Grokking the Generative AI System Design/

...

Key Concepts in Designing GenAI Systems

Understand some domain-specific concepts related to text-to-text, text-to-image, text-to-speech, and text-to-video generation systems.

We'll cover the following...

In the ever-evolving domain of GenAI, generating meaningful and high-quality outputs from various systems, whether text, audio, image, or video, relies on various concepts and techniques. These concepts will help us to design systems that understand, process, and generate data in ways that mimic human perception and creativity. This lesson describes the key techniques across diverse domains, from tokenization and embedding methods in natural language processing to phoneme conversion and acoustic models in audio generation and some basic concepts in video generation systems. Each concept is critical in connecting the gap between raw data and sophisticated, contextually rich outputs, forming the backbone of modern AI systems. We will discuss the following concepts as shown in the table below:

Lesson Structure

Concept	Description
Tokenization	Splits text into smaller units (tokens) to enable processing by language models in text-to-text, text-to-image, text-to-speech, and text-to-video generation systems.
Embedding	Converts tokens (words, phrases, etc.) into dense vector representations to capture semantic meanings and relationships in all generative systems.
Image resolution enhancement techniques	Improves the quality and detail of generated images by increasing their resolution, particularly in text-to-image or text-to-video generation systems.
Text-to-phoneme conversion	Converts written text into phonetic representations to generate accurate and natural-sounding speech in text-to-speech (TTS) systems.
Acoustic model	Models the relationship between phonetic units and their corresponding acoustic signals, enabling the generation of natural-sounding speech in TTS systems.
Concepts extractions	Extracts key ideas or instructions from textual prompts to guide the generation of relevant visuals (i.e., text-to-image or text-to-video generation) systems.
Scene graph generation	Creates structured representations of objects, their relationships, and attributes in images, used for generating more complex and context-aware visuals (images or videos)

Let’s dive deep into each of the above concepts.

Tokenization

Tokenization is breaking down text (prompt) into smaller units called tokens. These tokens can be words, phrases, sentences, or even characters. Tokenization is used to split text into individual words or subwords that are easier for machines to process and analyze. For example, if the input is “The five boxing wizards jumped quickly,” the tokenized output (word level) would be: [“The,” “five,” “boxing,” “wizards,” “jumped,” “quickly”]. For languages with complex scripts like Chinese or in specialized applications like text-to-speech systems, tokenization may operate at a finer granularity, such as individual characters or phonemes.

Tokenization techniques

There are several techniques to perform tokenization. Let’s explore some of the common ones:

Word-level tokenization: This technique splits text at word boundaries, producing tokens corresponding to whole words. While simple and intuitive, it struggles with languages that lack clear word delimiters and large vocabularies.
Subword tokenization: This technique breaks down words into smaller units like prefixes, suffixes, and stems. Popular approaches include Byte Pair Encoding (BPE)BPE is a subword tokenization technique that iteratively merges the most frequent pairs of characters or subword units into a single unit, creating a compact vocabulary. It helps handle rare words and out-of-vocabulary tokens by breaking them into smaller, meaningful subunits. and SentencePieceSentencePiece is a text tokenization library that treats input text as a sequence of Unicode characters and generates subword units without requiring explicit spaces between words. It supports advanced tokenization methods like BPE and Unigram, making it versatile for multiple languages and tasks., which handle rare or unknown words more efficiently by decomposing them into reusable subword tokens. For example, “unbelievable” may be split into [“un,” “believ,” and “able”].
Character-level tokenization: This method treats each character as a token. While highly granular, it often results in long token sequences, making it computationally expensive and requiring more sophisticated models to capture context.
Tokenization for speech: This method converts text into phonemes (basic sound units) to prepare for text-to-speech systems. This technique involves linguistic rules to ensure accurate pronunciation and intonation.
Multimodal tokenization: This technique combines textual and non-textual data, such as text with images or audio, into unified token representations. For example, ...