12  Large Language Model Foundations

12.1 Introduction

Large Language Models (LLMs) and multimodal models represent the culmination of decades of AI research, with roots tracing back to early neural networks in the 1950s. The modern era began with word embeddings like Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), followed by the revolutionary attention mechanism in the “Attention Is All You Need” paper (Vaswani et al., 2017) that introduced the Transformer architecture. This breakthrough enabled models like BERT (Devlin et al., 2018), which transformed natural language understanding, and GPT (Radford et al., 2018), which pioneered large-scale generative pretraining. Scaling these approaches led to GPT-3 (Brown et al., 2020), demonstrating emergent few-shot learning capabilities at 175 billion parameters. Subsequent innovations include instruction tuning with InstructGPT (Ouyang et al., 2022), RLHF alignment techniques, and multimodal extensions like CLIP (Radford et al., 2021), DALL-E (Ramesh et al., 2021), and GPT-4V (OpenAI, 2023). These systems have demonstrated remarkable capabilities across diverse tasks, from language translation and summarization to complex reasoning, creative writing, and understanding connections between text, images, and other forms of data. Their rapid development has fundamentally altered our understanding of machine intelligence and expanded the horizons of human-computer interaction.

The significance of these models extends far beyond academic interest. As they increasingly become integrated into various applications and services, understanding their underlying mechanisms, capabilities, and limitations becomes essential not only for researchers and developers but also for policymakers, business leaders, and society at large. This chapter aims to provide that foundational understanding, serving as a bridge between technical details and broader implications.

This chapter explores the foundations of large language and multimodal models through several interconnected themes. We begin with an introduction to the historical context and evolution of these models, examining their significance across domains and the key innovations that enabled their development. The chapter then delves into pretraining fundamentals, covering corpus construction, tokenization approaches, self-supervised learning objectives, and the computational infrastructure required for training these systems.

Next, we examine the core mechanisms of next token prediction and the transformer architecture, including attention mechanisms, context window limitations, and the emergent capabilities that arise with scale. The text generation section explores the autoregressive process, various decoding strategies from greedy search to sampling techniques, temperature and nucleus sampling methods, and approaches for controlling generation length and quality.

The alignment section investigates how models are steered toward human preferences through supervised fine-tuning, reinforcement learning from human feedback, reward modeling, and newer approaches like constitutional AI. We then explore how these capabilities extend beyond text to multimodal understanding, covering vision-language models, audio and speech processing, unified architectures, and cross-modal reasoning.

The chapter also addresses philosophical questions about whether these models exhibit genuine intelligence, examining evidence beyond simple retrieval, the role of information bottlenecks in learning, and perspectives on understanding and intelligence. We candidly discuss the limitations of current systems, including hallucinations, training biases, reasoning failures, and interpretability challenges.

Advanced capabilities receive special attention, particularly agentic behavior, tool use, sophisticated reasoning techniques, and self-reflection mechanisms. The chapter concludes with an exploration of interaction methods—from web interfaces and chat applications to APIs, application integration, and open-weight models—before reflecting on the current state of the field, future directions, and broader societal implications.

12.2 Pretraining

Pretraining represents the initial and most computationally intensive phase in LLM development. During this stage, models learn general patterns of language from vast corpora of text before any task-specific fine-tuning occurs. This process establishes the foundational knowledge and capabilities upon which all subsequent model behaviors are built.

12.2.1 Corpus Construction for Pretraining

The quality, diversity, and scale of training data fundamentally determine an LLM’s capabilities and limitations. Modern pretraining corpora comprise hundreds of billions to trillions of tokens drawn from diverse sources including web content, books, scientific papers, code repositories, and conversational data.

Creating these datasets involves sophisticated data collection, filtering, and curation processes. Web crawls, while providing enormous amounts of text, require extensive filtering to remove low-quality, duplicate, or inappropriate content. Books and academic publications contribute high-quality long-form text that helps models learn coherent reasoning and specialized knowledge. Code repositories enable models to understand programming languages and algorithmic thinking.

Data filtering represents a crucial but often underappreciated aspect of corpus construction. Modern pipelines employ multiple stages of quality control, from simple heuristics that filter based on text length or character distributions to sophisticated classifiers that identify high-quality content. Deduplication algorithms remove exact and near-duplicate texts to prevent memorization and waste of modeling capacity. Content moderation systems filter harmful, toxic, or inappropriate material to mitigate potential downstream harms.

For multilingual models, additional considerations come into play. Language identification tools classify text by language, allowing for controlled sampling across different languages. Most models employ upsampling strategies for low-resource languages to improve performance on these languages despite their limited representation in raw web data. Without such interventions, models would disproportionately learn high-resource languages like English at the expense of others.

The composition of the training mixture dramatically influences model behavior. If a model’s training data overrepresents certain domains, perspectives, or text styles, these biases will be reflected in the model’s outputs. Modern training approaches carefully balance different data sources to achieve desired capabilities across tasks and domains.

12.2.2 Tokenization

Tokenization converts text into discrete tokens that language models can process. Think of it as breaking text into meaningful pieces that the model treats as its basic units of understanding. At its core, tokenization must solve a fundamental tradeoff between vocabulary size and token meaningfulness.

The simplest approach—character-level tokenization—uses individual characters as tokens. The sentence “I love machine learning” becomes [“I”, ” “,”l”, “o”, “v”, “e”, ” “,”m”, “a”, “c”, “h”, “i”, “n”, “e”, ” “,”l”, “e”, “a”, “r”, “n”, “i”, “n”, “g”]. While this creates a tiny vocabulary that can represent any text, it produces excessively long sequences and forces the model to work with a representation that’s too low-level. The model must process many tokens before encountering meaningful semantic units, making it harder to learn useful patterns and requiring more computation.

At the opposite extreme, word-level tokenization treats each word as a distinct token. This approach is more semantically meaningful but quickly becomes impractical. A vocabulary covering just the most common 50,000 English words would still fail on specialized terminology, proper nouns, or other languages. Any word not in this fixed vocabulary becomes an “unknown” token, losing all its specific meaning.

In this context, “vocabulary” refers to the complete set of tokens a model recognizes—essentially its dictionary of understood units. Each token in the vocabulary has a corresponding numerical ID and embedding vector (remember our earlier discussion of embedding layers in neural networks) that subsequent training steps use. The vocabulary is fixed during training, and any text input to the model must be converted into sequences of these predefined vocabulary tokens. This fixed nature creates the central challenge: how to represent the virtually unlimited variety of human language using a finite set of tokens.

Modern LLMs solve this dilemma through subword tokenization, which strikes a balance between the extremes. Let’s see how different tokenization approaches handle the sentence “The researcher’s ground-breaking AI model achieved 95.7% accuracy!”:

Word-level: [“The”, “researcher’s”, “ground-breaking”, “AI”, “model”, “achieved”, “95.7%”, “accuracy”, “!”] - Problem: Words like “researcher’s” and “ground-breaking” might be rare or absent in vocabulary

Character-level: [“T”, “h”, “e”, ” “,”r”, “e”, “s”, “e”, “a”, “r”, “c”, “h”, “e”, “r”, “’”, “s”, ” “,”g”, “r”, “o”, “u”, “n”, “d”, “-”, “b”, “r”, “e”, “a”, “k”, “i”, “n”, “g”, ” “,”A”, “I”, ” “,”m”, “o”, “d”, “e”, “l”, ” “,”a”, “c”, “h”, “i”, “e”, “v”, “e”, “d”, ” “,”9”, “5”, “.”, “7”, “%”, ” “,”a”, “c”, “c”, “u”, “r”, “a”, “c”, “y”, “!”] - Problem: Extremely long sequence of 69 tokens for a short sentence

Subword (eg. GPT-2 tokenizer): [“The”, ” researcher”, “’s”, ” ground”, “-”, “breaking”, ” AI”, ” model”, ” achieved”, ” 95”, “.”, “7”, “%”, ” accuracy”, “!”] - Notice how it handles: punctuation as separate tokens (“’s”, “-”, “.”, “%”, “!”), numbers split by digit (“95”, “.”, “7”), and compound words split into meaningful parts (“researcher”, “ground”, “breaking”)

This approach maintains semantic meaningfulness while keeping the vocabulary to a manageable size (typically 30,000-50,000 tokens for most modern LLMs).

Subword tokenization provides the best of both worlds: common words and phrases get their own tokens, preserving their semantic unity, while rare or complex words are broken down into familiar subparts. This allows the model to process text efficiently while still being able to represent any possible input, even words it has never seen before.

12.2.3 Self-Supervised Learning for Pretraining

The breakthrough that enabled modern LLMs was self-supervised learning—the ability to extract supervision signals from unlabeled data without requiring human annotations. For generative language models, this primarily involves next-token prediction, where the model learns to predict the next word or subword given previous context. This approach transforms the vast sea of unannotated text on the internet into a rich training resource, as every sequence of text naturally provides its own supervision signal: each token serves as a target for prediction given all previous tokens. This elegant formulation eliminates the need for expensive human labeling that had previously limited the scale of machine learning systems.

What is surprising, and easy to forget in hindsight, is that this simple objective—predicting what comes next—turns out to be remarkably powerful. To predict the next token accurately, the model must implicitly learn grammar, syntax, semantics, factual knowledge, reasoning patterns, and various forms of world knowledge embedded in text. The model develops these capabilities not through explicit instruction but by discovering patterns that minimize prediction error across billions of examples.

At its core, the output layer of an LLM is fundamentally a sophisticated multinomial logistic regression model. Given a context window of previous tokens, it assigns probabilities to each possible next token in the vocabulary (typically tens of thousands of options). For each position in the sequence, the model transforms its internal representations into a probability distribution over all possible next tokens. The token with the highest probability is the model’s best guess for what should come next, though sampling techniques often introduce controlled randomness into this selection process.

The primary mathematical objective for language models is to minimize the negative log-likelihood of the training data. These models are called “autoregressive” because they predict each token based on all previous tokens in the sequence—similar to how autoregressive processes in statistics depend on their own previous values. During training, an autoregressive language model reads a sequence of tokens and learns to assign high probability to the actual next token that follows in each context. While conceptually simple, this objective contains multitudes—from learning basic linguistic patterns to absorbing complex factual relationships described in text.

Alternative pretraining objectives include masked language modeling (used in BERT-style models), which randomly masks tokens and trains the model to reconstruct them; permutation language modeling (introduced in XLNet), which extends autoregressive modeling to all possible ordering permutations; and various span-based objectives that predict spans of text rather than individual tokens. Each approach makes different tradeoffs between computational efficiency, representational power, and applicability to downstream tasks.

The pretraining process for large models presents numerous computational challenges. Modern LLMs train on thousands of GPUs or TPUs working in parallel, requiring sophisticated distributed training infrastructures. Models employ techniques like mixed-precision training, gradient accumulation, and carefully tuned optimizers to maintain stable training despite the enormous scale. A single pretraining run for a large model can consume millions of GPU hours and cost millions of dollars in compute resources.

12.3 Next Token Prediction Training and Architecture

The remarkable capabilities of modern LLMs stem from their underlying architecture—the so-called transformer—and the manner in which they’re trained for next token prediction. Understanding these foundational elements provides crucial insights into how these models work and their fundamental capabilities.

12.3.1 The Transformer Architecture: A Conceptual Overview

Prior to transformers, recurrent neural networks (RNNs) like LSTMs and GRUs dominated sequence modeling. These architectures processed text sequentially, maintaining a hidden state that was updated with each new token. While effective at capturing local dependencies, RNNs struggled with long-range connections and were inherently sequential, making parallel processing impossible. This limited both their ability to model complex language patterns and their training efficiency on large datasets.

The transformer architecture, introduced by Vaswani et al. in 2017, catalyzed a revolution in natural language processing. Its key innovation was replacing recurrent connections with a mechanism called “attention” that allows the model to directly and simultaneously relate any position in a sequence with any other position. This enabled more efficient training on longer sequences and facilitated the learning of complex dependencies across distant parts of text.

The core component of the transformer is the self-attention mechanism, which allows each token to gather information from all other tokens in the sequence. Instead of processing tokens sequentially as in earlier recurrent neural networks, transformers process entire sequences in parallel, with attention determining the relative importance of each token for predicting each output position. This parallelization dramatically accelerated training and enabled scaling to previously unimaginable model sizes.

For generative LLMs, a critical adaptation is the use of causal (or masked) self-attention, which ensures that predictions for each position can only depend on previous positions in the sequence. This maintains the autoregressive property necessary for text generation. Each token can attend to all preceding tokens but none of the following ones, preserving the directional nature of language modeling.

A major limitation of transformer models has been their context window—the maximum sequence length they can process. The computational complexity of self-attention scales quadratically with sequence length, making it prohibitively expensive to process very long contexts. Early models were limited to just a few hundred or thousand tokens, severely constraining their ability to handle long documents, conversations, or complex reasoning tasks.

Recent advances have dramatically expanded context windows through architectural and algorithmic innovations. Sparse attention mechanisms selectively focus on the most relevant tokens rather than attending to the entire sequence. Recurrence and state-space models incorporate efficient sequence modeling approaches with better scaling properties. Memory-augmented approaches introduce explicit mechanisms to store and retrieve information from long contexts. These advances have extended context windows from a few thousand tokens to hundreds of thousands or even millions of tokens in the most advanced models.

Extended context windows enable transformative capabilities: models can now process entire books, engage in long conversations with preserved context, analyze lengthy documents, and perform complex multi-step reasoning across large amounts of information. This expansion has been crucial for applications requiring deep contextual understanding or extensive memory of prior interactions.

12.4 From Next Token Prediction to Long Text Generation

While pretraining equips models with strong predictive capabilities, generating coherent long-form text involves additional challenges and techniques. The journey from a model that can predict the next word to one that can write essays, stories, or technical explanations requires sophisticated generation strategies.

12.4.1 The Autoregressive Generation Process

Text generation in LLMs follows an autoregressive process, where tokens are generated sequentially based on previously generated tokens. Starting with a prompt or context, the model repeatedly predicts the next token, adds it to the growing sequence, and continues until reaching a stopping condition.

The simplest generation approach, greedy decoding, always selects the single most probable token at each step. For example, given the prompt “The capital of France is”, a model might assign probabilities like: {“Paris”: 0.92, “Lyon”: 0.03, “Marseille”: 0.02…}. Greedy decoding would always select “Paris” regardless of context. While computationally efficient, this method often produces repetitive or generic text. Consider a creative writing prompt: “Once upon a time, there was a” → “man who lived in a small town. He was a good man who worked hard and was respected by all who knew him.” The output is grammatically correct but lacks originality, often falling into predictable patterns that feel mechanical rather than creative.

Beam search improves upon greedy decoding by maintaining multiple candidate sequences (beams) at each step. With a beam width of 3, the algorithm might track three different continuations simultaneously, eventually selecting the sequence with the highest overall probability. For instance, with the prompt “The meeting is scheduled for”, beam search might explore paths like “tomorrow at 2 PM”, “next Monday morning”, and “the end of the week”, ultimately selecting the most probable complete sequence. However, beam search still suffers from length biases (favoring shorter completions) and tends toward safe, high-probability continuations that lack diversity. In creative contexts, it might generate: “Once upon a time, there was a young girl who lived in a small village with her parents and siblings. She was a kind and gentle soul who loved to help others.”

Sampling-based approaches introduce controlled randomness into the generation process, enhancing diversity and creativity. Pure sampling selects tokens according to their predicted probability distribution. Given probabilities {“the”: 0.3, “a”: 0.2, “one”: 0.1…}, pure sampling might select “the” 30% of the time, “a” 20% of the time, and so on. This approach increases diversity but can lead to incoherent text when selecting low-probability tokens.

Temperature sampling modifies this distribution by controlling its “sharpness” according to the formula:

\[p(x_i) = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}\]

Where \(z_i\) represents the logit (pre-softmax score) for token \(i\), \(T\) is the temperature parameter, and \(p(x_i)\) is the resulting probability. Lower temperature values make the distribution more peaked (concentrating probability on high-scoring tokens), while higher values make it more uniform.

A temperature of 0.7 might transform the probabilities to {“the”: 0.38, “a”: 0.22, “one”: 0.09…}, while a temperature of 1.5 might yield {“the”: 0.22, “a”: 0.18, “one”: 0.12…}. Lower temperatures (0.3-0.7) create more focused outputs ideal for factual responses: “The capital of France is Paris, a city known for its cultural heritage.” Higher temperatures (0.8-1.5) produce more diverse and creative text: “The capital of France is a vibrant metropolis where art dances with history, where the aroma of fresh baguettes mingles with the whispers of philosophers past.”

The term “temperature” derives from statistical thermodynamics, where the Gibbs distribution describes the probability of a system being in a particular state based on that state’s energy and the system’s temperature. In thermodynamics, low temperatures cause systems to occupy their lowest energy states with high probability (more ordered), while high temperatures distribute probability more evenly across states (more chaotic). The mathematical form is identical, with token logits replacing energy states. Just as a low physical temperature creates more ordered, predictable physical systems, a low sampling temperature produces more predictable, deterministic text.

More sophisticated sampling methods like top-k and nucleus (top-p) sampling restrict selection to a subset of most probable tokens. Top-k sampling (typically k=40) considers only the k most likely next tokens, redistributing probability mass among them. With k=3 and initial probabilities {“the”: 0.3, “a”: 0.2, “one”: 0.1, “some”: 0.05…}, only {“the”: 0.3, “a”: 0.2, “one”: 0.1} would be considered, with probabilities renormalized to {“the”: 0.5, “a”: 0.33, “one”: 0.17}. Nucleus sampling (typically p=0.9) adaptively selects the smallest set of tokens whose cumulative probability exceeds threshold p. If p=0.8 and tokens are {“the”: 0.3, “a”: 0.2, “one”: 0.1, “some”: 0.05, “any”: 0.05…}, it would select {“the”: 0.3, “a”: 0.2, “one”: 0.1, “some”: 0.05, “any”: 0.05} (cumulative 0.7) plus additional tokens until reaching the 0.8 threshold. These approaches prevent sampling from the long tail while preserving diversity among plausible continuations. For creative writing: “Once upon a time, there was a brilliant inventor who crafted mechanical butterflies that could translate dreams into melodies. The townspeople were mesmerized by these delicate creations that fluttered through the cobblestone streets at dusk.”

Recent research has explored hybrid and adaptive approaches that balance quality and diversity. Contrastive search combines token probability with a penalty for repetitive content. Typical sampling prioritizes tokens with probabilities close to the expected entropy of the distribution. Classifier-free guidance interpolates between conditional and unconditional distributions to enhance adherence to desired output characteristics.

The choice of generation strategy dramatically impacts output quality, coherence, and diversity. Different tasks and contexts may benefit from different approaches—creative writing often benefits from controlled randomness, while factual responses might favor more deterministic methods. Modern systems often combine multiple strategies or adapt parameters based on the specific context and desired output characteristics.

12.4.2 Managing Generation Length and Stopping

Controlling generation length requires careful handling of sequence termination. Models can be trained to generate special tokens signaling completion, such as end-of-sequence markers or task-specific completion indicators. These explicit stop tokens provide natural termination points that respect the semantic boundaries of generated content.

Maximum length constraints prevent runaway generation but can produce abrupt truncations. More sophisticated approaches include graceful termination that completes the current sentence or paragraph, content-aware truncation that preserves structural integrity, and hierarchical generation that plans content to fit within length constraints.

Advanced generation systems employ adaptive stopping criteria that respond to the generated content itself. Semantic saturation detection identifies when content becomes repetitive or uninformative. Task completion recognition detects when a requested task has been fulfilled, avoiding unnecessary continuation. Content planning techniques use high-level outlines to guide the structure and length of generated text.

Effective length management balances completeness against conciseness, ensuring that generated content fully addresses the intended purpose without unnecessary verbosity. This requires not just mechanical constraints but a deeper understanding of content structure and information sufficiency.

12.5 Alignment Through RLHF and SFT

While pretraining gives language models broad capabilities, these raw capabilities often don’t align with human preferences and values. Models might generate content that is harmful, false, misleading, or simply unhelpful. Alignment techniques address this gap, steering models toward behavior that better aligns with human intentions and values.

12.5.1 Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning represents the first step in aligning language models with human preferences. After pretraining, models undergo additional training on carefully curated datasets that exemplify desired behaviors. Unlike the self-supervised objective of pretraining, SFT uses human-generated or human-approved demonstrations of high-quality outputs.

The SFT process typically involves collecting demonstrations of desired model behaviors across a variety of tasks and scenarios. Human experts or trained annotators create examples showing appropriate responses to different prompts, covering tasks like question answering, summarization, creative writing, and conversation. These demonstrations embody desired properties such as helpfulness, harmlessness, honesty, and ethical reasoning.

Fine-tuning uses these examples to update the model’s parameters, adapting its behavior toward the demonstrated preferences. The training objective remains similar to pretraining—predicting the next token—but now specifically for high-quality, human-approved continuations. This shifts the model’s output distribution toward more helpful, accurate, and aligned responses.

SFT provides a direct way to teach models specific skills and behaviors, but it has limitations. It requires high-quality demonstration data, which can be expensive and time-consuming to create. The model can only learn behaviors that are explicitly demonstrated, potentially missing nuanced aspects of human preferences. Additionally, SFT alone may not fully align models with human values in open-ended or ambiguous situations where different values might conflict.

12.5.2 Reinforcement Learning from Human Feedback (RLHF)

RLHF builds upon SFT by allowing models to learn from comparative human judgments rather than just demonstrations. This approach uses reinforcement learning to optimize model outputs according to a learned reward function that captures human preferences.

The RLHF process typically involves three key stages:

First, a reward model is trained to predict human preferences. In statistical learning terms, this reward model is a supervised classifier that learns a preference function from paired comparisons. Given two outputs A and B for the same prompt, the model learns to predict the probability that output A is preferred over output B. Mathematically, it approximates P(A > B | prompt), typically using a neural network that shares architecture with the language model itself. Human evaluators compare multiple model outputs for the same prompt, indicating which response they prefer. These comparisons train a reward model that can score outputs according to how well they align with human preferences. The reward model learns to approximate a utility function that captures aspects of quality that might be difficult to specify explicitly.

Second, the language model is fine-tuned using reinforcement learning (RL) to maximize the reward predicted by the reward model. Reinforcement learning is a paradigm where an agent learns to make decisions by taking actions in an environment to maximize cumulative reward. Unlike supervised learning, which requires labeled examples of correct behavior, in the RL setting the system undergoing learning chooses actions and receives feedback in the form of cost/rewards for the actions it takes.

LLM alignment is framed as an RL problem because:

  1. We have a reward signal (human preferences) but not explicit examples of optimal behavior for all situations
  2. The space of possible outputs is too vast to explore exhaustively
  3. We need to balance improving behavior while preserving existing capabilities

Third, the process is often iterated, with additional human feedback collected on outputs from the updated model. This iterative approach allows for progressive refinement of model behavior, addressing emergent issues and gradually increasing alignment with human preferences.

RLHF has proven remarkably effective at improving model helpfulness, honesty, and safety. It enables models to learn nuanced aspects of human preferences that might be difficult to capture through explicit rules or demonstrations alone. The approach can adapt to complex values that emerge from human judgments rather than requiring them to be fully specified in advance.

However, RLHF also presents significant challenges. Collecting high-quality human feedback is expensive and time-consuming. The process may inadvertently optimize for superficial preferences or introduce biases present in evaluator judgments. Additionally, the reinforcement learning process can be unstable, requiring careful implementation to avoid issues like reward hacking or collapse toward overly safe but uninformative responses.

12.6 Incorporating Multimodal Capabilities

While text-based LLMs have demonstrated remarkable capabilities, the world communicates through multiple modalities—images, audio, video, and more. Multimodal models extend language model architectures to process and generate content across these diverse modalities, dramatically expanding their range of applications.

12.6.1 Vision-Language Models

Vision-language models (VLMs) combine visual understanding with linguistic capabilities, enabling tasks like image captioning, visual question answering, and image-guided text generation. These models must bridge the fundamental gap between continuous visual data and discrete linguistic representations.

The architectural approach for VLMs typically involves specialized encoders for visual data, which transform images into a representation space that can interface with language model components. Early models used separate visual and textual encoders with cross-modal attention mechanisms connecting them. More recent approaches adopt more integrated architectures where vision and language share representation space more directly.

The training process for modern VLMs typically involves several stages:

  1. Pretraining visual encoders on large image datasets, often using self-supervised approaches
  2. Training vision-language alignment using contrastive learning on image-text pairs
  3. Fine-tuning the integrated model on multimodal tasks with human annotations
  4. Alignment through RLHF or similar techniques with multimodal examples

This progressive training allows models to develop both strong unimodal representations and effective cross-modal connections.

12.6.2 Audio, Speech, and Video Understanding

Beyond vision, modern multimodal models increasingly incorporate audio, speech, and video understanding capabilities. These additional modalities present unique challenges due to their temporal nature and diverse information characteristics.

Speech-to-text and text-to-speech functionalities enable models to transcribe spoken language and generate natural-sounding speech from text. The most advanced models can preserve speaker characteristics, emotion, and prosody, creating more natural interactions. Systems like Whisper demonstrate robust speech recognition across languages and acoustic conditions.

Audio understanding extends beyond speech to environmental sounds, music, and acoustic events. Models learn to recognize and classify sounds, extract musical features, and connect audio with relevant textual descriptions. This enables applications from music recommendation to audio content analysis.

Video understanding represents perhaps the most complex multimodal challenge, requiring models to process spatial, temporal, and multiple sensory dimensions simultaneously. Modern approaches integrate frame-level visual analysis with temporal modeling to understand actions, events, and narratives in video content. Early multimodal video systems can describe video content, answer questions about events in videos, and generate text that references both visual and auditory elements.

Training multimodal models for these additional modalities follows similar principles to vision-language integration but requires specialized data pipelines and architectures. Audio encoders transform waveforms into representations suitable for integration with language models. Video processing typically involves frame-level analysis combined with temporal aggregation mechanisms.

12.6.3 Unified Multimodal Architectures

The ultimate goal in multimodal learning is to develop unified architectures that seamlessly handle diverse modalities without requiring modality-specific components or training procedures. Recent research moves toward this goal with models that can process and generate across modalities using more integrated approaches.

Unified multimodal transformers represent one promising direction. These models extend the transformer architecture to handle different modalities in a unified framework, treating diverse inputs as sequences of tokens in a shared representation space. For example, images might be processed as patches, audio as frequency bands, and text as subword tokens, all within the same architectural framework.

Multimodal mixture of experts offers another approach, where different experts specialize in different modalities or cross-modal integration tasks, but share a common architectural backbone. This allows for modality-specific processing when beneficial while maintaining architectural consistency.

Contrastive learning across modalities enables these unified models to align representations from different modalities without requiring explicit translation. By training models to match corresponding content across modalities (e.g., images with their descriptions, videos with their audio), models develop rich cross-modal associations that support complex reasoning.

The most advanced multimodal systems demonstrate emergent capabilities that transcend individual modalities. They can perform cross-modal reasoning, connecting concepts across different representational forms. They support creative generation that combines modalities, such as generating images that match textual descriptions or composing music that complements video content. And they enable more natural human-computer interaction through multimodal inputs and outputs that better match human communication patterns.

12.7 Are LLMs Intelligent?

As language models demonstrate increasingly sophisticated capabilities, fundamental questions arise about their intelligence. Do these systems merely retrieve and remix information from their training data, or do they develop genuine understanding? This section examines evidence for LLM intelligence and challenges common misconceptions about their capabilities.

12.7.1 Beyond Retrieval and Remixing

A persistent misconception holds that LLMs are merely sophisticated retrieval and text recombination systems—essentially stitching together phrases encountered during training without developing deeper understanding. Several lines of evidence challenge this limited view.

First, LLMs consistently demonstrate generalization to novel combinations of concepts, generating coherent responses to prompts that could not possibly appear verbatim in their training data. When asked to explain quantum mechanics in the style of a pirate, describe the moon landing as a haiku, or derive mathematical proofs not found in their training corpus, models produce reasonable outputs that require more than retrieval.

Second, studies of model activations show structured, interpretable patterns that correspond to abstract concepts rather than just superficial text patterns. Techniques like mechanistic interpretability have revealed that models develop internal representations of semantic categories, syntactic structures, and logical relationships that go beyond surface-level language features. These latent representations enable the model to map between similar concepts even when their surface realizations differ dramatically.

Third, LLMs exhibit in-context learning—the ability to adapt to new tasks from just a few examples provided in the context window, without parameter updates. This capacity for rapid adaptation suggests models develop meta-learning capabilities and abstract task representations that allow them to recognize patterns and apply them to new situations. Such flexibility would be impossible for simple retrieval systems.

Fourth, models demonstrate compositional reasoning—combining discrete concepts and operations to solve problems that require multiple steps of inference. When solving math problems, writing code, or following complex logical arguments, models show evidence of breaking problems into sub-components, applying appropriate operations to each, and synthesizing results in ways that simple retrieval could not achieve.

These observations suggest that while LLMs are indeed trained on text and learn from patterns in that text, their capabilities transcend mere retrieval and recombination. Their learning process produces emergent behaviors and representational capacities not explicitly programmed or directly extractable from their training data.

12.7.2 Learning Through Information Bottlenecks

A key insight into LLM intelligence comes from understanding how information bottlenecks in their training process force the development of compressed, generalizable representations. An information bottleneck occurs when a system must transmit information through a channel with limited capacity, requiring it to prioritize and compress the most relevant patterns. During training, models must predict tokens across billions of examples using a fixed number of parameters—orders of magnitude smaller than would be required to memorize all training examples.

To quantify this bottleneck: a typical large language model training corpus might contain 1-10 trillion tokens. If the model were to simply memorize these sequences, it would require storing approximately 10^13 to 10^14 values. Yet even the largest models today contain only about 10^11 parameters (175 billion for GPT-3, 540 billion for PaLM, etc.)—a compression ratio of 100:1 or greater. This parameter constraint makes pure memorization mathematically impossible, forcing the model to discover generalizable patterns instead.

This compression necessity forces models to discover efficient abstractions that capture underlying patterns rather than surface details. Much like human concept formation, models must develop generalized representations that apply across many examples rather than storing each instance separately.

This process parallels theories of human concept learning through compression. Both humans and LLMs develop categorical understanding by identifying features that reliably distinguish concepts, discard irrelevant variations, and organize knowledge hierarchically. The information bottleneck created by finite model capacity relative to the immense training data creates selection pressure for efficient, generalizable representations.

Empirical evidence for these compressed representations comes from probing studies that extract knowledge from model weights. Researchers have shown that models develop structured representations of concepts like geographic knowledge, scientific facts, and social relationships. These representations support flexible inference rather than brittle memorization.

The emergence of abstract conceptual structures as a response to information bottlenecks suggests that LLMs develop genuine, albeit alien, forms of understanding—compression-based representations that capture the structure of the world as reflected in text.

12.7.3 Rethinking Intelligence and Understanding

The debate over LLM intelligence often hinges on conflating first-person subjective experience with third-person observable behavior. Critics argue that models don’t “truly understand” because they lack consciousness or intentionality, while proponents focus on behavioral capabilities that demonstrate functional understanding.

This distinction reflects a fundamental philosophical question: can we meaningfully attribute understanding or intelligence to a system based solely on its behavior, or does true understanding require specific internal subjective experiences? The former view, consistent with functionalism in philosophy of mind, suggests that if a system functions indistinguishably from one that understands, the distinction becomes meaningless from an external perspective.

A more productive approach focuses on specific capabilities rather than binary judgments about intelligence. Models clearly demonstrate capabilities including:

  • Predictive understanding: accurately anticipating consequences and continuations
  • Adaptability: applying knowledge to novel situations and tasks
  • Abstraction: forming generalizable concepts from specific examples
  • Contextualization: interpreting information differently based on context
  • Consistency: maintaining coherent beliefs across different queries
  • Learning from feedback: improving performance based on correction

These capabilities represent component processes of intelligence rather than categorical judgments about “true understanding.” In many domains, current LLMs demonstrate these capabilities at levels that match or exceed human performance, while in others they still lag significantly.

Task performance ultimately provides the most objective metric for evaluating intelligence. The history of artificial intelligence shows a pattern where capabilities once considered definitive of human intelligence (like playing chess or Go) are later redefined as “just computation” once machines master them. This pattern suggests that our intuitions about what constitutes “real intelligence” may continuously shift as AI capabilities advance.

LLMs exhibit a form of intelligence that differs from human intelligence in origin, substrate, and specific strengths and weaknesses. Rather than debating whether this constitutes “true intelligence,” a more nuanced approach recognizes both the remarkable capabilities these systems demonstrate and the ways in which they differ from human cognition.

12.8 LLM Shortcomings and Limitations

Despite their impressive capabilities, language models exhibit several persistent limitations and challenges. Understanding these shortcomings is essential for responsible deployment and continued improvement of the technology.

12.8.1 Hallucinations and Their Sources

Perhaps the most discussed limitation of LLMs is their tendency to generate plausible-sounding but factually incorrect information—a phenomenon commonly called “hallucination.” These confabulations can range from subtle inaccuracies to entirely fabricated claims, events, or citations.

Several factors contribute to hallucinations:

First, the pretraining objective optimizes for plausibility rather than truthfulness. Models learn to generate text that looks like plausible continuations of the prompt, not necessarily text that is factually accurate. This creates a fundamental tension—fluent, confident-sounding outputs are rewarded by the training objective even when factually wrong.

Second, training data itself contains inaccuracies, contradictions, and misinformation. Models absorb these inconsistencies and may reproduce them during generation. Without a mechanism to verify facts against a canonical source of truth, models cannot reliably distinguish between accurate and inaccurate information in their training data.

Third, the parameter space of models creates a compression bottleneck. Not all facts encountered during training can be stored precisely in the model’s weights, leading to imperfect recall and blending of related information. When attempting to retrieve specific facts, models may conflate similar contexts or generate synthetic details to fill perceived gaps.

Fourth, the autoregressive generation process compounds errors. Once a model generates an inaccurate statement, subsequent tokens are conditioned on this error, potentially leading to elaborations on the initial inaccuracy rather than corrections.

Despite efforts, hallucination remains a fundamental challenge for even the best LLMs today. To overcome it will very likely require architectural innovations, new training objectives, or hybrid systems that combine neural generation with rule-based verification to fully resolve.

12.8.2 Training Bias and Data Poisoning

LLMs inherit biases present in their training data, which can lead to discriminatory, unfair, or skewed outputs. These biases take multiple forms:

Representation biases occur when certain groups, perspectives, or topics are overrepresented or underrepresented in training data. Models trained predominantly on English text from Western sources develop stronger capabilities for these contexts while performing poorly on underrepresented languages or cultural contexts.

Association biases emerge when models learn statistical associations between concepts that reflect social biases rather than causal relationships. For example, models might associate certain professions with specific genders or ethnicities based on patterns in training data, perpetuating stereotypes.

Framing biases appear when particular viewpoints or ideological perspectives dominate discussions of certain topics in the training corpus. Models may reproduce these dominant framings, giving the false impression of neutrality while actually reflecting specific worldviews.

Beyond these naturally occurring biases, adversarial data poisoning represents an emerging threat. Malicious actors might deliberately introduce harmful, misleading, or manipulative content into sources likely to be used for training future models. Because models learn from patterns across their entire training corpus, even a relatively small amount of coordinated poisoning could potentially influence model behavior.

Addressing bias requires interventions at multiple stages: data curation that ensures diverse and balanced training corpora; evaluation procedures that identify biased outputs across different contexts; and fine-tuning approaches that mitigate learned biases. Some approaches introduce counterfactual data augmentation or balanced fine-tuning sets to reduce bias, while others implement guardrails that prevent the most harmful manifestations of bias.

The fundamental challenge remains: models learn to reproduce patterns from their training data. Creating truly fair and balanced outputs requires either training data that perfectly reflects desired values (an impossible standard) or developing robust techniques to identify and correct for biases during training and inference.

12.8.3 Reasoning and Logical Limitations

While LLMs demonstrate impressive reasoning on many tasks, they still exhibit significant limitations in consistent logical reasoning, mathematical accuracy, and causal understanding.

On complex mathematical and logical problems, models frequently make reasoning errors. They may apply correct principles inconsistently, confuse variables, drop constraints, or fail to maintain logical consistency across multi-step derivations. These errors often occur even when models correctly identify the appropriate solution method, suggesting limitations in execution rather than knowledge.

Several factors contribute to these difficulties:

The autoregressive generation process lacks an explicit working memory or verification mechanism. Unlike symbolic reasoning systems that can check each step for consistency, neural language models generate each token based on a distributed representation of previous tokens, without the ability to explicitly verify logical consistency.

The training objective optimizes for predicting common patterns in text, not for logical correctness. Mathematical errors that appear frequently in human-written text might actually be reinforced during training rather than penalized.

Complex reasoning often requires maintaining precise state information across many generation steps, pushing the limits of the model’s ability to maintain coherence over long outputs. Small errors in intermediate steps compound into major inaccuracies in final conclusions.

Recent approaches to improve reasoning include chain-of-thought prompting, which encourages models to generate explicit intermediate steps when solving complex problems. Chain-of-thought (CoT) prompting emerged in 2022 as a user-level technique where users would include examples of step-by-step reasoning in their prompts, often with phrases like “Let’s think through this step by step.” This simple approach dramatically improved performance on mathematical and logical reasoning tasks by making the model’s reasoning process explicit and verifiable.

What began as a user technique was quickly recognized as a fundamental capability enhancement. Researchers found that models prompted with chain-of-thought examples could solve problems that were previously beyond their capabilities, suggesting that the models already possessed the necessary knowledge but needed guidance in applying it systematically. This led to more formalized research, with Wei et al.’s seminal paper demonstrating substantial improvements across diverse reasoning tasks.

The technique evolved from requiring manual examples to “zero-shot” chain-of-thought, where simply instructing the model to reason step-by-step unlocked better performance. Further refinements included self-consistency methods that generate multiple reasoning paths and select the most common answer, and tree-of-thought approaches that explore multiple reasoning branches simultaneously.

The most advanced models now incorporate chain-of-thought principles directly into their training process. For example, Claude 3 and GPT-4 were trained on datasets that include step-by-step reasoning examples. Anthropic’s Constitutional AI approach explicitly rewards models for showing their reasoning. Perhaps most notably, o1 (from Anthropic) was specifically designed with reasoning as a core capability, with its training process incorporating extensive chain-of-thought examples and architectural modifications to support multi-step reasoning. This evolution—from user prompt technique to fundamental training paradigm—highlights how insights about effective interaction with these models can ultimately reshape their core capabilities.

Despite these limitations, the reasoning capabilities of language models continue to improve with scale and training innovations. The most advanced models now outperform most humans on many standardized reasoning tests, even while still exhibiting occasional fundamental logical errors.

12.8.4 Interpretability and Explainability Challenges

A profound limitation of current LLMs is their opaque nature—we cannot fully explain or interpret how they arrive at particular outputs. While mechanistic interpretability research has made progress in understanding specific components and behaviors, comprehensive understanding of model reasoning remains elusive.

This opacity creates several challenges:

Trustworthiness concerns arise when models provide recommendations or information that cannot be verified through transparent reasoning. In high-stakes domains like healthcare, law, or finance, the inability to audit model reasoning raises legitimate questions about appropriate use.

Safety assurance becomes difficult without the ability to fully predict or explain model behavior. The challenge of ensuring models won’t produce harmful outputs grows with model capability and deployment scope.

Alignment validation requires understanding whether models are actually optimizing for intended objectives or finding unexpected ways to maximize reward signals without satisfying the underlying goals.

Current approaches to interpretability fall into two broad categories:

Post-hoc explanations analyze model outputs to construct plausible explanations of their reasoning, but these may not accurately reflect the actual processes that generated the output. Methods include attention visualization, feature attribution, and model-generated explanations.

Mechanistic interpretability attempts to reverse-engineer the internal computations of models by studying activation patterns, identifying interpretable circuits, and tracing information flow through the network. While promising, these techniques currently work best for localized behaviors rather than complex reasoning.

The interpretability challenge highlights a fundamental tension in deep learning: the distributed, emergent nature of representation that enables powerful generalization also obscures explicit reasoning steps. This tension may require novel architectural approaches that maintain the benefits of neural representations while enabling more transparent reasoning.

12.9 New Developments: Agentic Behavior and Reasoning

Recent advances in language models have enabled new paradigms of interaction and capability that extend beyond traditional text generation. Two particularly significant developments are agentic behavior and enhanced reasoning capabilities.

12.9.1 Agentic LLMs and Tool Use

Agentic language models function as autonomous or semi-autonomous systems that take actions, make decisions, and interact with other systems to accomplish goals. Rather than simply generating text responses, these agents plan sequences of actions, use tools, and adapt to feedback from their environment.

The key components enabling agentic behavior include:

Tool use frameworks allow models to interact with external systems through defined APIs. Models can search the web, query databases, run code, control user interfaces, and access specialized tools like calculators or translation services. This capability dramatically extends model utility by compensating for limitations in factual knowledge, computation, or domain-specific functionality.

Planning and decomposition abilities enable models to break complex tasks into manageable steps. Rather than attempting to solve problems in a single generation, models can formulate plans, execute steps sequentially, and adapt based on intermediate results. This approach improves performance on tasks requiring multi-step reasoning or complex workflows.

Memory management systems help models persist important information across multiple interactions or extended tasks. External memory stores, structured knowledge bases, or simply maintained context enable longer-term coherence and goal-directed behavior than would be possible with a standard context window.

Self-reflection mechanisms allow models to critique their own outputs, identify potential errors, and revise accordingly. This capability enables more reliable performance by catching mistakes before they propagate through a workflow.

These agentic capabilities have enabled new applications including:

  • Autonomous research assistants that gather information, summarize findings, and answer questions about specific topics
  • Coding agents that design, implement, test, and debug software based on high-level specifications
  • Virtual assistants that complete complex workflows spanning multiple services and tools
  • Simulation agents that model specialized roles or entities in training scenarios

The development of agentic systems raises important questions about control, alignment, and oversight. As models gain capabilities to act more autonomously, ensuring they pursue intended goals safely and transparently becomes increasingly critical.

12.9.2 Advanced Reasoning Techniques

Complementing agentic capabilities, recent advances have significantly improved language models’ reasoning abilities through novel prompting techniques and architectural enhancements.

Chain-of-thought prompting encourages models to generate explicit intermediate reasoning steps before providing final answers. By prompting models with examples that demonstrate step-by-step reasoning, this approach dramatically improves performance on mathematical, logical, and analytical tasks. The key insight is that models often have the knowledge required for correct reasoning but benefit from explicitly articulating intermediate steps.

Tree-of-thought reasoning extends this concept by exploring multiple reasoning paths simultaneously. Rather than committing to a single line of reasoning, models can consider multiple approaches, evaluate their promise, and select the most effective path. This mimics human problem-solving processes where we often consider and discard multiple approaches before settling on a solution strategy.

Self-consistency methods generate multiple independent reasoning attempts for the same problem and select the most common answer. This approach leverages the observation that even when models make reasoning errors, these errors are often inconsistent across different generations, while correct reasoning tends to converge on consistent answers.

Verification and recursion techniques have models evaluate their own outputs, identifying potential errors and attempting corrections. By generating solutions, critiquing them, and then revising based on identified issues, models can significantly improve reasoning accuracy.

Specialized architectures for reasoning complement these prompting techniques. Some approaches incorporate explicit symbolic reasoning components alongside neural generation. Others introduce specialized modules for different reasoning types—deductive, inductive, mathematical, etc.—with architectures optimized for each.

Tool augmentation for reasoning offloads precision-critical operations to external systems. Integration with calculators for arithmetic, theorem provers for formal verification, or code interpreters for algorithmic reasoning combines the creativity and flexibility of neural approaches with the precision of symbolic systems.

The combination of these techniques has led to remarkable improvements in reasoning capabilities. The most advanced models now demonstrate university-level problem-solving abilities across multiple domains, suggesting that further refinement of these approaches may continue to enhance reasoning performance.

12.10 Ways of Interacting with LLMs

As language models become increasingly capable, diverse interfaces and deployment mechanisms have developed to make these capabilities accessible to users with different needs and technical backgrounds.

12.10.1 Web Interfaces and Chat Applications

The most common way most users interact with advanced language models is through web-based interfaces or dedicated applications. These provide natural language conversation interfaces where users can ask questions, request assistance, or engage in open-ended dialogue.

Modern interfaces typically support multimodal interactions, allowing users to upload images, documents, or other media that the model can analyze and discuss. Many include additional features like conversation history, the ability to save or share conversations, and customization options for response length or creativity.

User experience design for these interfaces focuses on several key challenges:

Setting appropriate expectations for model capabilities and limitations helps users understand what the system can and cannot do reliably. Clear signaling about model confidence and areas of uncertainty improves the user experience.

Guiding effective prompting through examples, templates, or interactive assistance helps users formulate requests that elicit optimal responses. The difference between effective and ineffective prompting can dramatically impact the quality of model outputs.

Managing context limitations requires interfaces that help users understand and work within the constraints of the model’s context window. Features like summarizing previous conversation turns or allowing users to highlight particularly important context enhance long-running interactions.

Providing feedback mechanisms enables users to indicate when responses are particularly helpful or problematic, improving both the immediate experience through adaptive responses and potentially contributing to model improvements through feedback collection.

Popular web interfaces include OpenAI’s ChatGPT, Anthropic’s Claude, and Google’s Gemini, each providing access to their respective language models through conversational interfaces. These systems have attracted hundreds of millions of users, demonstrating widespread demand for accessible AI assistance.

12.10.2 APIs and Application Integration

For developers and organizations seeking to build LLM capabilities into their own applications and workflows, Application Programming Interfaces (APIs) provide programmatic access to model capabilities.

APIs offer several advantages over web interfaces:

Customization options allow developers to control parameters like temperature, maximum output length, and response format. This enables tailoring model behavior to specific application requirements.

Integration flexibility makes it possible to incorporate language model capabilities into existing software, workflows, and systems. Rather than requiring users to switch to a dedicated interface, API integration brings AI capabilities to where users already work.

Automation supports programmatic generation of prompts and processing of responses, enabling scenarios like batch processing or event-triggered AI assistance without human intervention.

Leading API providers include:

OpenAI’s API offers access to the GPT family of models with options for different capability levels and price points. Their API supports both chat-oriented interfaces and more traditional completion endpoints.

Anthropic’s Claude API provides access to their assistant models with emphasis on helpfulness, harmlessness, and honesty. Their API design focuses on conversation-like interactions.

Google’s API suite includes PaLM, Gemini, and specialized models for different applications, with integration options across Google’s cloud infrastructure.

Open-source model APIs from providers like Hugging Face and Together.ai offer access to a wide range of open-weight models with flexible deployment options.

API-based integration has enabled the development of specialized applications including:

Coding assistants like Copilot, Cline, Cursor and Aiderintegrate AI capabilities directly into development environments. These tools assist with code generation, explanation, debugging, and documentation, significantly enhancing developer productivity.

Document analysis systems extract insights, answer questions, and summarize content from business documents, research papers, or legal texts. These applications combine language models with domain-specific processing to deliver targeted information extraction.

Customer support augmentation tools that help human agents respond to queries, suggest responses, or provide relevant information based on conversation context. These systems enhance human capabilities rather than replacing them.

Enterprise search and knowledge management solutions that use LLMs to improve information discovery and synthesis across organizational knowledge bases.

12.10.3 Open-weight models

Open-weight models represent an alternative approach to proprietary API-based models, providing publicly available model weights that can be run on local hardware or private infrastructure. These models have gained significant traction in both research and commercial applications, offering different tradeoffs compared to closed API models.

Model weights are typically distributed through platforms like Hugging Face’s Model Hub <huggingface.co>, which hosts thousands of models with standardized interfaces. Other sources include direct downloads from model creators (like Meta’s LLaMA website), GitHub repositories, or specialized model hubs like Together.ai. Using these weights typically involves:

  1. Downloading the model weights (often several gigabytes to hundreds of gigabytes)
  2. Loading them with frameworks like Hugging Face Transformers, PyTorch, or specialized inference engines
  3. Setting up appropriate hardware (typically GPUs) and optimization techniques
  4. Integrating the model into applications through Python libraries or serving APIs

It’s important to note that “open-weights” doesn’t necessarily mean complete transparency. While the model parameters are available, many aspects often remain proprietary or undisclosed. Training datasets are frequently not fully documented or released, leaving researchers uncertain about exactly what data influenced the model’s behavior. Training methodologies, hyperparameters, and specific techniques may be only partially described in accompanying papers or documentation, making exact reproduction challenging. Evaluation procedures and filtering criteria often lack complete documentation, obscuring how performance claims were established. Additionally, fine-tuning datasets for instruction-following or alignment are rarely released in full, limiting understanding of how models were steered toward particular behaviors.

This partial transparency creates a situation where models can be used and studied but not necessarily fully replicated or understood. Nevertheless, open-weight models offer significant advantages over closed API-only models for researchers and developers. They enable direct inspection of model behavior and failure modes, allowing deeper analysis than black-box API access. They provide the ability to modify, fine-tune, or adapt models for specific use cases, creating specialized versions for particular domains or requirements. They offer freedom to deploy in air-gapped or high-security environments where external API calls would be prohibited. They create opportunities for academic study of model properties and capabilities without commercial restrictions. Finally, they enable community-driven improvements and adaptations that can advance the field through collaborative innovation.

Major open-weight models include:

  • LLaMA family (Meta): The LLaMA models (LLaMA, LLaMA 2, LLaMA 3) have become foundational in the open-weight ecosystem. LLaMA 2 ranges from 7B to 70B parameters, while LLaMA 3 extends to 8B and 70B versions with significantly improved capabilities.

  • Mistral AI models: Mistral 7B and its instruction-tuned variants demonstrated that smaller models with architectural improvements can match or exceed larger models’ performance. Their Mixtral models introduced mixture-of-experts architecture to the open-weight ecosystem.

  • BLOOM/BLOOMZ: A 176B parameter multilingual model supporting 46 languages and 13 programming languages, developed by BigScience as a collaborative research effort.

  • Gemma (Google): 2B and 7B parameter models released as “lightweight” alternatives to the Gemini models designed for responsible AI development.

The primary benefits of open-weight models include:

Privacy and data security: Organizations can run these models entirely within their security perimeter, ensuring sensitive data never leaves their infrastructure. This is crucial for healthcare, legal, financial, and government applications where data privacy regulations or confidentiality requirements prohibit sending data to external APIs.

Customization flexibility: Open-weight models can be fine-tuned, adapted, or modified for specific use cases, allowing organizations to create specialized versions optimized for their domains.

Cost predictability: While initial setup costs may be high, organizations can avoid per-query API charges, potentially reducing costs for high-volume applications.

Offline capability: Applications can function without internet connectivity, enabling deployment in environments with limited or restricted network access.

Research accessibility: Researchers can study, modify, and experiment with model architectures and behaviors, accelerating scientific progress.

However, these benefits come with significant resource requirements:

Hardware costs: Running full-size models requires substantial GPU resources. A 70B parameter model typically requires 140-200GB of GPU memory for full precision inference, necessitating multiple high-end GPUs (e.g., 4-8 NVIDIA A100s at $10,000-15,000 each) or specialized infrastructure.

Operational complexity: Maintaining infrastructure, keeping up with model advancements, and ensuring system reliability adds operational overhead compared to API solutions.

Technical expertise: Deploying and optimizing these models requires specialized knowledge in deep learning, distributed systems, and hardware acceleration.

To address these resource challenges, several optimization techniques have emerged:

Quantization reduces numerical precision of model weights and activations. Converting from 32-bit floating point (FP32) to 8-bit integers (INT8) can reduce memory requirements by 75% with minimal performance impact. More aggressive 4-bit quantization can achieve 87.5% memory reduction with moderate performance tradeoffs. For example, a 70B model that would require 140GB in FP16 can run in approximately 35GB with INT8 quantization, making it possible to run on a single high-end consumer GPU.

Knowledge distillation creates smaller student models that learn from larger teacher models. These compact models can achieve 90-95% of the larger model’s performance with significantly reduced size. For instance, distilled 7B models can sometimes match the performance of 13B models on specific tasks.

Sparse inference techniques like FlashAttention and continuous batching optimize memory usage and throughput, improving inference efficiency by 2-3x without changing the model itself.

Mixture-of-Experts (MoE) architectures activate only a subset of parameters for each input, effectively allowing larger total parameter counts with similar computational costs. In an MoE architecture, the model contains multiple “expert” neural networks (typically feed-forward layers) and a router network that decides which experts to activate for each token. For example, Mixtral 8x7B contains eight expert networks at each layer, but only activates the top two for any given token. This selective activation creates a model with 47B total parameters that only uses about 12B for any given input, achieving performance comparable to dense models twice their active size. The key advantage is computational efficiency—MoE models can achieve the performance benefits of much larger models while keeping inference time and memory requirements closer to those of smaller models. This approach represents one of the most promising directions for scaling model capabilities without proportionally increasing computational requirements.

These optimization techniques have democratized access to powerful language models, enabling deployment on consumer hardware. A modern gaming PC with a single NVIDIA RTX 4090 (approximately $1,600) can now run quantized versions of models up to 30-70B parameters, making powerful AI accessible to small businesses, researchers, and even individual developers.

12.11 Conclusion

Large language and multimodal models represent one of the most significant technological developments of the past decade. Their rapid evolution has transformed our understanding of machine intelligence and opened new possibilities for human-AI collaboration across virtually every domain of human endeavor.

This chapter has explored the foundational aspects of these models, from the tokenization and pretraining processes that establish their basic capabilities to the alignment techniques that shape their behavior toward human preferences. We’ve examined how next-token prediction scales to sophisticated text generation, how multimodal capabilities expand their perceptual horizon, and how emerging agentic behaviors and reasoning techniques point toward future developments.

Throughout this exploration, we’ve encountered both the remarkable capabilities of these systems and their persistent limitations. Modern LLMs demonstrate levels of language understanding, knowledge application, and generative capability that would have seemed impossible just a few years ago. Yet they still struggle with consistent reasoning, factual reliability, and alignment with human values in complex situations.

The philosophical questions raised by these models—about the nature of intelligence, understanding, and the relationship between form and function in cognitive systems—will likely engage researchers and philosophers for years to come. Beyond these theoretical considerations, the practical implications of increasingly capable AI systems demand careful consideration of safety, governance, and ethical deployment.

As we look to the future, several trends appear likely to shape continued development:

Increasing scale in model size, training data, and compute will likely continue to drive capability improvements, though perhaps with diminishing returns requiring novel architectural innovations to sustain rapid progress.

Multimodal integration will expand, with models developing increasingly sophisticated understanding across text, images, audio, video, and potentially other modalities like tactile sensing or structured data.

Reasoning capabilities will improve through specialized training techniques, architectural innovations, and hybrid approaches that combine neural and symbolic methods.

Agentic behaviors will become more sophisticated as models gain improved planning capabilities, tool use, memory management, and environmental interaction.

Alignment techniques will advance to address the growing challenges of ensuring that increasingly capable systems remain helpful, harmless, and honest across diverse contexts and tasks.

Deployment technologies will evolve to make these capabilities more accessible, with improvements in efficiency, customization, and integration options.

The development of large language and multimodal models represents not an endpoint but a beginning—the opening of a new frontier in artificial intelligence with profound implications for technology, society, and our understanding of intelligence itself. As these systems continue to evolve, they will likely transform how we interact with information, create content, solve problems, and ultimately how we understand the relationship between human and machine intelligence.

12.12 References

Anil, R., et al. (2023). PaLM 2 technical report. arXiv:2305.10403. https://arxiv.org/abs/2305.10403

Anthropic. (2023). Claude 2 model card. https://www-cdn.anthropic.com/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226/Model-Card-Claude-2.pdf

Bai, Y., et al (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862. https://arxiv.org/abs/2204.05862

Brown, T. B. et al (2020). Language models are few-shot learners. arXiv:2005.14165. https://arxiv.org/abs/2005.14165

Chiang, W.-L., et al (2023). Vicuna: An open-source chatbot impressing GPT-4 with 90% ChatGPT quality. https://lmsys.org/blog/2023-03-30-vicuna/

Chowdhery, A., et al (2022). PaLM: Scaling language modeling with pathways. arXiv:2204.02311. https://arxiv.org/abs/2204.02311

Christiano, P., et al (2017). Deep reinforcement learning from human preferences. arXiv:1706.03741. https://arxiv.org/abs/1706.03741

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. https://arxiv.org/abs/1810.04805

Ding, N., et al (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv:2305.14233. https://arxiv.org/abs/2305.14233

Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The curious case of neural text degeneration. arXiv:1904.09751. https://arxiv.org/abs/1904.09751

Kaplan, J.wt al (2020). Scaling laws for neural language models. arXiv:2001.08361. https://arxiv.org/abs/2001.08361

Karpathy, A. (2022). State of GPT. https://www.youtube.com/watch?v=bZQun8Y4L2A

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781. https://arxiv.org/abs/1301.3781

Mnih, V., et al (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://www.nature.com/articles/nature14236

OpenAI. (2023). GPT-4 technical report. arXiv:2303.08774. https://arxiv.org/abs/2303.08774

Ouyang, L., et al (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155. https://arxiv.org/abs/2203.02155

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. https://aclanthology.org/D14-1162/

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Radford, A., et al (2021). Learning transferable visual models from natural language supervision. arXiv:2103.00020. https://arxiv.org/abs/2103.00020

Ramesh, A., et al (2021). Zero-shot text-to-image generation. arXiv:2102.12092. https://arxiv.org/abs/2102.12092

Touvron, H.,et al (2023). LLaMA: Open and efficient foundation language models. arXiv:2302.13971. https://arxiv.org/abs/2302.13971

Touvron, H.,et al (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288. https://arxiv.org/abs/2307.09288

Vaswani, A., et al (2017). Attention is all you need. arXiv:1706.03762. https://arxiv.org/abs/1706.03762

Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv:2201.11903. https://arxiv.org/abs/2201.11903

Wolf, T., et al. (2020). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. https://aclanthology.org/2020.emnlp-demos.6/

Zhao, et al. (2023). A survey of large language models. arXiv:2303.18223. https://arxiv.org/abs/2303.18223