12  Large Language Model Foundations

12.1 Introduction

Large Language Models (LLMs) and multimodal AI systems have become foundational technologies, transforming how we interact with information, create content, and solve complex problems. Their roots trace back to early neural networks, but the field accelerated with the introduction of the Transformer architecture (Vaswani et al., 2017). This innovation enabled models to scale to unprecedented sizes, unlocking emergent capabilities in language understanding and generation.

The last few years have seen a wave of progress. Models such as GPT-3, PaLM, Llama, and Claude demonstrated remarkable in-context learning, while techniques like Reinforcement Learning from Human Feedback (RLHF) have made their outputs more helpful and aligned with human values. The rise of open-weight models has made advanced AI more accessible than ever. Simultaneously, multimodal models like GPT-4V and Gemini began processing not just text, but also images, audio, and video, enabling applications from image captioning to creative art generation. With context windows expanding to millions of tokens, these models can now process entire books or complex documents in a single pass.

The impact of these models extends far beyond research labs. LLMs and multimodal systems are now integrated into search engines, productivity tools, and scientific discovery. Their capabilities are reshaping industries and raising new questions about trust, safety, and the future of work.

This chapter provides an overview of the foundations of large language and multimodal models. We explore their core architectures, pretraining and alignment techniques, and the frontiers of reasoning and agentic behavior. We also examine their limitations—such as hallucinations and bias—to equip readers with the technical understanding needed to engage with this rapidly changing field.

12.2 Pretraining

Pretraining is the computationally intensive first step in creating an LLM. In this stage, the model learns general patterns of language and knowledge from a vast corpus of text. This process establishes the foundational capabilities upon which all subsequent behaviors are built.

12.2.1 Corpus Construction

An LLM’s capabilities are fundamentally determined by the quality and diversity of its training data. Pretraining corpora consist of trillions of tokens (words or sub-words) from diverse sources like web pages, books, scientific papers, and code. Constructing this dataset is a major engineering challenge involving sophisticated filtering to remove low-quality content, deduplication to prevent memorization, and careful balancing of sources to ensure broad knowledge and prevent bias. The composition of this data directly shapes the model’s eventual behavior and expertise.

12.2.2 Tokenization

Tokenization is the process of splitting raw text into the smaller units, or tokens, that a model processes. The choice of tokenization strategy impacts what the model can learn and its efficiency.

  • Character-level tokenization splits text into individual characters. This is simple but creates very long sequences, making it difficult for the model to learn the meaning of words.
  • Word-level tokenization uses whole words as tokens. This is more meaningful but struggles with rare words, typos, or new terms, which are mapped to an “unknown” token, resulting in information loss.
  • Subword tokenization, typically using an algorithm like Byte Pair Encoding (BPE), is the standard for modern LLMs. It breaks text into units that are often smaller than words but larger than characters. Common words become a single token, while rare words are broken into smaller, recognizable pieces (e.g., “unhappiness” -> [“un”, “happiness”]). This provides a balance, allowing for a fixed, manageable vocabulary while still being able to represent any possible text.

12.2.3 Self-Supervised Learning

The breakthrough that enabled modern LLMs was self-supervised learning. This approach allows the model to learn from unlabeled data, using the data itself to create training signals. For generative language models, the primary method is next-token prediction.

The model is given a sequence of text and tasked with predicting the next token. This simple objective, when applied at a massive scale, is remarkably powerful. To accurately predict the next token, the model must implicitly learn grammar, semantics, factual knowledge, and even reasoning patterns embedded in the text. Every piece of text becomes a training example, with each token serving as a target for prediction given the preceding context.

At its core, an LLM’s output layer functions as a sophisticated multinomial logistic regression model. Given a context of previous tokens, it assigns a probability to every possible next token in its vocabulary. The training objective is to minimize the negative log-likelihood of the training data, making the model “autoregressive”—it predicts the next step based on its own previous outputs, much like an autoregressive model in time-series analysis. While the concept is simple, this process is what allows the model to absorb the vast knowledge contained in its training data.

12.3 Next Token Prediction Training and Architecture

The remarkable capabilities of modern LLMs stem from their underlying architecture—the so-called transformer—and the manner in which they’re trained for next token prediction. Understanding these foundational elements provides crucial insights into how these models work and their fundamental capabilities.

12.3.1 The Transformer Architecture: A Conceptual Overview

Prior to transformers, recurrent neural networks (RNNs) like LSTMs and GRUs dominated sequence modeling. These architectures processed text sequentially, maintaining a hidden state that was updated with each new token. While effective at capturing local dependencies, RNNs struggled with long-range connections and were inherently sequential, making parallel processing impossible. This limited both their ability to model complex language patterns and their training efficiency on large datasets.

The transformer architecture, introduced by Vaswani et al. in 2017, catalyzed a revolution in natural language processing. Its key innovation was replacing recurrent connections with a mechanism called “attention” that allows the model to directly and simultaneously relate any position in a sequence with any other position. This enabled more efficient training on longer sequences and facilitated the learning of complex dependencies across distant parts of text.

The core component of the transformer is the self-attention mechanism, which allows each token to gather information from all other tokens in the sequence. Instead of processing tokens sequentially as in earlier recurrent neural networks, transformers process entire sequences in parallel, with attention determining the relative importance of each token for predicting each output position. This parallelization dramatically accelerated training and enabled scaling to previously unimaginable model sizes.

For generative LLMs, a critical adaptation is the use of causal (or masked) self-attention, which ensures that predictions for each position can only depend on previous positions in the sequence. This maintains the autoregressive property necessary for text generation. Each token can attend to all preceding tokens but none of the following ones, preserving the directional nature of language modeling.

A major limitation of transformer models has been their context window—the maximum sequence length they can process. The computational complexity of self-attention scales quadratically with sequence length, making it prohibitively expensive to process very long contexts. Early models were limited to just a few hundred or thousand tokens, severely constraining their ability to handle long documents, conversations, or complex reasoning tasks.

Recent advances have dramatically expanded context windows through architectural and algorithmic innovations. Sparse attention mechanisms selectively focus on the most relevant tokens rather than attending to the entire sequence. Recurrence and state-space models incorporate efficient sequence modeling approaches with better scaling properties. Memory-augmented approaches introduce explicit mechanisms to store and retrieve information from long contexts. These advances have extended context windows from a few thousand tokens to hundreds of thousands or even millions of tokens in the most advanced models.

Extended context windows enable transformative capabilities: models can now process entire books, engage in long conversations with preserved context, analyze lengthy documents, and perform complex multi-step reasoning across large amounts of information. This expansion has been crucial for applications requiring deep contextual understanding or extensive memory of prior interactions.

12.4 From Next-Token Prediction to Text Generation

While pretraining teaches a model to predict the next token, generating coherent, long-form text requires additional strategies. The process is autoregressive: the model generates one token at a time, appends it to the input, and uses this new sequence to generate the next token.

The simplest method is greedy decoding, which always selects the single most probable token at each step. While computationally efficient, this approach often produces repetitive and unoriginal text. For creative tasks, it might fall into predictable patterns, feeling mechanical rather than inventive.

To introduce creativity and diversity, sampling-based approaches are used. Instead of always picking the most likely token, these methods introduce controlled randomness. The most common technique is temperature sampling. It adjusts the probability distribution of possible next tokens using a parameter, T (temperature), according to the formula:

\[p(x_i) = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}\]

Where \(z_i\) is the model’s output score (logit) for token \(i\). - A low temperature (e.g., 0.2) makes the distribution sharper, concentrating probability on the highest-scoring tokens. This leads to more focused, deterministic output, suitable for factual answers. - A high temperature (e.g., 1.0 or higher) flattens the distribution, increasing the chance of selecting less likely tokens. This encourages diversity and creativity but risks generating less coherent text.

The term “temperature” is an analogy to statistical thermodynamics, where higher temperatures correspond to more chaotic, higher-entropy systems.

Other methods like top-k and nucleus (top-p) sampling offer further control by restricting the sampling to a smaller, high-probability set of candidate tokens. Top-k considers only the k most likely tokens, while nucleus sampling considers the smallest set of tokens whose cumulative probability exceeds a threshold p. These techniques prevent the model from selecting highly improbable, “long-tail” tokens, striking a balance between creativity and coherence.

The choice of generation strategy has a dramatic impact on the output. Factual question-answering may benefit from low temperatures, while creative writing often requires higher temperatures and nucleus sampling to produce novel and interesting text.

12.5 Alignment: Steering Model Behavior

While pretraining endows models with broad knowledge, it does not guarantee that their behavior will align with human preferences. Alignment is the process of fine-tuning a model to be more helpful, harmless, and honest.

The foundational step is Supervised Fine-Tuning (SFT). After pretraining, the model is trained further on a curated dataset of high-quality, human-generated examples of desired behavior (e.g., polite and accurate answers to questions). This directly teaches the model to follow instructions and adopt a helpful persona, but it is limited by the cost and scalability of creating demonstration data.

Reinforcement Learning from Human Feedback (RLHF) was the next major advance. This technique uses human preferences to train a reward model that acts as a proxy for human judgment. 1. Humans are shown pairs of model-generated responses and asked to choose which one they prefer. 2. This preference data is used to train a reward model to predict which outputs humans will find “good.” 3. The language model is then fine-tuned using reinforcement learning to generate responses that maximize the score from the reward model.

RLHF proved highly effective but is complex and can be unstable. More recently, Direct Preference Optimization (DPO) has emerged as a more direct and stable alternative. DPO uses the same preference data (chosen vs. rejected responses) but skips the separate reward model. Instead, it directly fine-tunes the language model, simultaneously increasing the probability of generating the “chosen” responses and decreasing the probability of the “rejected” ones. This achieves the same goal as RLHF with a simpler, more robust, and computationally cheaper process.

12.6 Unifying Intelligence: The Rise of Multimodal AI

While text was the original domain of LLMs, the frontier of AI has shifted to models that can understand, reason about, and generate content across multiple modalities. These models process not just text and images, but also video, audio, and other data streams, enabling a more holistic understanding of the world.

Modern systems are natively multimodal, designed from the ground up to treat all data types as a universal language. This is achieved by “tokenizing” different modalities into a common format that a single transformer architecture can process. Images are broken into “patches,” audio into snippets of sound, and text into subwords. By pre-training on vast, interleaved datasets of text, images, and video, these models learn a shared representational space where concepts are untethered from a single modality. A “cat” is not just the word ‘c-a-t’ but also its visual appearance and the sound of its meow.

This unified approach unlocks capabilities far beyond single-modality models. Vision-language models can now analyze and edit video content from text instructions. Audio models can perform real-time, expressive translation that captures the user’s tone and emotion. Video generation models can create high-fidelity, coherent scenes from text prompts. This paradigm is even extending to robotics, where models are trained on sensor data to connect linguistic commands with the physical actions needed to execute them.

A key insight from this work is that multimodality provides a richer signal for learning about the world. By observing how objects behave in videos and how they are described in text, models implicitly build “world models”—internal representations of the rules and causal relationships that govern reality. This has proven to be a powerful path toward more robust and common-sense reasoning.

12.7 Are LLMs Intelligent?

As language models demonstrate increasingly sophisticated capabilities, a fundamental question arises: are they intelligent? A persistent misconception holds that LLMs are merely “stochastic parrots,” stitching together phrases from their training data without genuine understanding. Several lines of evidence challenge this view.

First, LLMs consistently generalize to novel combinations of concepts, generating coherent responses to prompts that are statistically impossible to have appeared in their training data (e.g., “Explain quantum mechanics in the style of a pirate”).

Second, the training process itself forces generalization through an information bottleneck. A large model might be trained on trillions of tokens of text but has orders of magnitude fewer parameters. This makes rote memorization mathematically impossible. To minimize its prediction error, the model is forced to learn compressed, generalizable representations of the data. This is analogous to how humans form abstract concepts rather than memorizing every single experience.

Third, LLMs exhibit in-context learning—the ability to adapt to new tasks from just a few examples provided in the prompt, without any change to the model’s weights. This suggests the model has learned abstract task representations, a form of meta-learning.

The debate over LLM intelligence often conflates observable behavior with subjective experience. A more productive approach is to focus on specific, measurable capabilities. Models demonstrate predictive understanding, adaptability, and abstraction at levels that often match or exceed human performance in specific domains. Rather than debating whether this constitutes “true intelligence,” a more nuanced view recognizes the remarkable capabilities these systems demonstrate, while acknowledging the ways their intelligence differs from our own.

12.8 LLM Limitations and the Frontier of Reasoning

Despite their impressive capabilities, LLMs have significant limitations. Understanding these is crucial for both responsible deployment and for appreciating the frontiers of AI research.

12.8.1 Hallucinations and Bias

Perhaps the most discussed limitation is hallucination: the tendency to generate plausible-sounding but factually incorrect information. This occurs because the model’s objective is to generate statistically likely text, not to state verified truths. It may confidently invent facts, citations, or details to complete a response.

LLMs also inherit and can amplify biases present in their vast training data. This can manifest as stereotypes, skewed perspectives, or unfair representations. Adversarial data poisoning, where malicious actors intentionally introduce misleading content into training data sources, represents an emerging threat.

12.8.2 The Evolution of Reasoning

While early LLMs struggled with complex logic, their reasoning abilities have improved dramatically. This evolution began with Chain-of-Thought (CoT) prompting, a user-discovered technique where instructing a model to “think step by step” elicits explicit reasoning pathways and improves accuracy on mathematical and logical problems.

This insight—that making reasoning explicit improves performance—has been integrated directly into model development. The most advanced models are now trained extensively on datasets rich with step-by-step reasoning. The frontier of AI reasoning has moved even further, to techniques like:

  • Process-Supervised Reasoning (PSR): Instead of rewarding models only for a correct final answer, PSR provides feedback on each step of the reasoning process. This helps correct flawed logic mid-stream and teaches more robust problem-solving strategies.

  • Deliberative Reasoning: Inspired by human critical thinking, some models employ a multi-stage process. An initial “intuitive” answer is generated quickly, after which the model performs a slower, more deliberate analysis to critique and refine its initial thoughts, mimicking a “System 1 / System 2” cognitive architecture.

  • Hybrid Neuro-Symbolic Architectures: The most advanced research systems are beginning to integrate neural models with classical symbolic reasoners. In this paradigm, the LLM handles natural language understanding and heuristic planning, while offloading precise logical and mathematical calculations to a symbolic engine, leveraging the strengths of both worlds.

While logical inconsistencies still occur, these advanced techniques have transformed LLM reasoning, enabling them to tackle problems that require deep, multi-step, and verifiable thought.

12.8.3 Interpretability

A profound limitation of current LLMs is their opaque nature. It is not yet possible to fully explain how they arrive at a particular output. This “black box” problem poses challenges for trust and safety, especially in high-stakes domains like medicine or finance. Research into mechanistic interpretability, which aims to reverse-engineer the internal computations of models, is an active and critical field, but comprehensive understanding remains elusive.

12.9 New Developments: Agentic AI

The evolution from passive text generators to active, goal-oriented systems marks a significant shift in AI. Agentic behavior—the ability of a model to autonomously plan, act, and use tools to achieve goals—is now a central focus of AI development.

An agentic LLM can decompose a high-level objective (e.g., “research the latest trends in renewable energy and create a presentation”) into a sequence of actions. It then executes this plan by invoking tools, which are APIs that connect it to the outside world (e.g., a web search, a code interpreter, or a database query). The agent can then synthesize the results, adapt its plan, and even critique its own work to refine the final output.

This paradigm is unlocking a new generation of applications. GitHub Copilot, for example, has evolved from a code-completion tool into a programming partner that can take a feature request, write the necessary code across multiple files, and attempt to debug it. In business operations, agents can manage complex workflows like processing invoices or handling customer support queries.

As these agentic systems proliferate, the need for a standardized communication protocol has become critical. The Model-Context-Protocol (MCP) is an emerging standard that provides a universal format for agents to exchange information and delegate tasks. This creates a “market” for AI services, where an agent can discover and “hire” other specialized agents on the fly, fostering a dynamic and collaborative ecosystem. Building robust and trustworthy agent ecosystems remains a critical area of research.

12.10 Conclusion

Large language and multimodal models represent a pivotal development in technology, shifting from task-specific tools to general-purpose reasoning engines. This chapter has traced their foundations, from the core concepts of pretraining and the transformer architecture to the alignment techniques that shape their behavior. We have seen how the simple act of next-token prediction, when scaled, gives rise to complex capabilities, and how the frontier is pushing toward multimodal understanding and autonomous, agentic systems.

For the quantitatively-minded professional, the key takeaways are: - These models are not merely retrieving information; they are learning compressed, generalizable representations of the world through a process constrained by an information bottleneck. - Their capabilities are a product of the data they are trained on, the architecture they use, and the objectives they are optimized for. Understanding these elements is key to appreciating both their strengths and weaknesses. - Limitations like hallucination, bias, and logical inconsistency are not just flaws to be patched but are deeply connected to the models’ current design. The evolution of reasoning techniques shows a clear path toward more reliable and verifiable outputs.

The development of LLMs is not an endpoint but the beginning of a new frontier in AI. As these systems continue to evolve, they will transform how we interact with information, solve problems, and ultimately understand the relationship between human and machine intelligence.

12.11 References

Anil, R., et al. (2023). PaLM 2 technical report. arXiv:2305.10403. https://arxiv.org/abs/2305.10403

Anthropic. (2023). Claude 2 model card. https://www-cdn.anthropic.com/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226/Model-Card-Claude-2.pdf

Bai, Y., et al (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862. https://arxiv.org/abs/2204.05862

Brown, T. B. et al (2020). Language models are few-shot learners. arXiv:2005.14165. https://arxiv.org/abs/2005.14165

Chiang, W.-L., et al (2023). Vicuna: An open-source chatbot impressing GPT-4 with 90% ChatGPT quality. https://lmsys.org/blog/2023-03-30-vicuna/

Chowdhery, A., et al (2022). PaLM: Scaling language modeling with pathways. arXiv:2204.02311. https://arxiv.org/abs/2204.02311

Christiano, P., et al (2017). Deep reinforcement learning from human preferences. arXiv:1706.03741. https://arxiv.org/abs/1706.03741

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. https://arxiv.org/abs/1810.04805

Ding, N., et al (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv:2305.14233. https://arxiv.org/abs/2305.14233

Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The curious case of neural text degeneration. arXiv:1904.09751. https://arxiv.org/abs/1904.09751

Kaplan, J.wt al (2020). Scaling laws for neural language models. arXiv:2001.08361. https://arxiv.org/abs/2001.08361

Karpathy, A. (2022). State of GPT. https://www.youtube.com/watch?v=bZQun8Y4L2A

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781. https://arxiv.org/abs/1301.3781

Mnih, V., et al (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://www.nature.com/articles/nature14236

OpenAI. (2023). GPT-4 technical report. arXiv:2303.08774. https://arxiv.org/abs/2303.08774

Ouyang, L., et al (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155. https://arxiv.org/abs/2203.02155

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543. https://aclanthology.org/D14-1162/

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Radford, A., et al (2021). Learning transferable visual models from natural language supervision. arXiv:2103.00020. https://arxiv.org/abs/2103.00020

Ramesh, A., et al (2021). Zero-shot text-to-image generation. arXiv:2102.12092. https://arxiv.org/abs/2102.12092

Touvron, H.,et al (2023). LLaMA: Open and efficient foundation language models. arXiv:2302.13971. https://arxiv.org/abs/2302.13971

Touvron, H.,et al (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288. https://arxiv.org/abs/2307.09288

Vaswani, A., et al (2017). Attention is all you need. arXiv:1706.03762. https://arxiv.org/abs/1706.03762

Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv:2201.11903. https://arxiv.org/abs/2201.11903

Wolf, T., et al. (2020). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. https://aclanthology.org/2020.emnlp-demos.6/

Zhao, et al. (2023). A survey of large language models. arXiv:2303.18223. https://arxiv.org/abs/2303.18223