13  Large Language Models: Text Generation

Note

This is an EARLY DRAFT.

Large Language Models (LLMs) have revolutionized how we interact with and process text data. There are multiple ways in which we can interacting with LLMs, which we will look at in the next few chapters. Here we start with the simplest of them “text generation”. We provide a textual prompt to the LLM and it replies to us with text of its own. While the term “text generation” might suggest simply creating new content, depending on how they are prompted, LLMs can perform a wide variety of tasks through the generation paradigm, including:

You very likely have interacted with LLMs through their web interfaces. In this chapter we will instead interact with them using Python programs. Programmatically interacting with LLMs offers significant advantages over web interfaces. By integrating LLMs into code, we can:

  1. Automate workflows: Process large volumes of data without manual intervention
  2. Create data pipelines: Seamlessly incorporate LLM capabilities into existing data processing systems
  3. Customize behavior: Fine-tune prompts and parameters for specific use cases
  4. Ensure reproducibility: Generate consistent results through standardized prompts and configurations
  5. Scale efficiently: Handle batch processing and parallel requests

The LLMs we interact in this chapter will be Google’s Gemini series of models. Our primary reason for choosing Gemini from among advanced LLMs is that at the moment it offers a free tier. However, the methods we learn are common to all LLMs and can be used with minor changes with other providers such as OpenAI or Anthropic.

13.1 APIs

An Application Programming Interface (API) is a set of rules and protocols that allows different software applications to communicate with each other. In the context of LLMs, APIs provide a standardized way for our code to interact with models hosted on remote servers.

APIs work through a request-response mechanism:

  1. Your application sends a request to the API endpoint (a server) with specific parameters (like the prompt text and configuration settings)
  2. The server processes this request (in our case, running the text through the LLM)
  3. The server returns a response containing the generated output

APIs offer several advantages for working with LLMs:

  • Abstraction: You don’t need to understand the complex inner workings of the model
  • Resource efficiency: The computationally intensive model runs on the provider’s servers, not your local machine
  • Versioning: Providers can update models without requiring changes to your code
  • Security: Access is controlled through API keys and authentication

At present the most common way to send API request and responses across machines is through HTTP — the same protocol that powers the Web. The term “REST API” (Representational State Transfer) is commonly used to describe these HTTP-based APIs. While REST originally referred to a specific architectural style with strict principles, at present the terms is used loosely to denote any HTTP-based API.

The requests library that we have used in earlier chapters is a great tool for interacting with REST APIs. However, often API providers provide higher-level libraries that save you the trouble of having to deal with HTTP. This is the case here. We will use the google-genai library to interact with Gemini. You will need to install it in the usual manner using pip.

13.1.1 API keys

Usually we need to authenticate ourselves to the API provider while making API requests. This is certainly necessary when making use of paid services. But authentication is often required even for free services so that the service-provider can apply per-user quotas and maintain logs. The standard way to authenticate with APIs is using what is known as an API-key. These are long strings of alphanumeric characters that serve as combined usernames and passwords. You obtain one by signing up on the service-provider’s website. Before you continue this chapter, please head to the Gemini website and obtain an API key for Gemini.

The API key will have to be included in our function calls to ensure that the Gemini server can authenticate us. But it is a very bad practice to include it as a string in your code. It would amount to storing your password in clear text in a file that is widely shared. We need a way to store our API keys secretly while still making them available to our code. There are a number of ways to do this.

One is to use environment variables. These are variables that the user can assign values to at the operating system level, with the values then being made available to programs run by the user. The Python function os.environ.get allows us to access values of environment variables. For example, the PATH variable, which exists on all major operating systems, holds the list of directories which are searched by the operating system to find programs to execute. We can check its value in Python thus:

import os
print(os.environ.get('PATH'))

So if we could find some way of assigning our API key to an environment variable, say, GOOGLE_API_KEY, we could use this method to access it while running the program without including it in the program text.

The methods of setting environment variables are many and differ from one operating system from another. We will not talk about them since including your key in the global environment is not a great idea. It becomes available to all programs if you do so, which is not desirable because:

  • It violates the ‘need to know’ principle and poses a security risk
  • It makes it hard to use different keys for different project

Instead we will make use of the Python library dotenv (install it). This library provides a function load_dotenv which looks for file called .env in the current directory and if not found there, in its parent directories. When it finds the file it loads the definition of the environment variables from it as if they were set by the operating system.

So, in our case, in the directory where you are running the code, or in one of its parent directories, create a file .env with the content:

GOOGLE_API_KEY=abcdefgh

with your actual Gemini API key in place of abcdefgh. Note that on Unix-descendent systems such as Linux and MacOS, file names starting with a period are hidden in directory listings by default. So don’t worry if you cannot see the file you created. You can see it by running the command ls -a at the terminal.

Once our .env file has been created, we can use dotenv to access it:

import os
from dotenv import load_dotenv
load_dotenv()
GOOGLE_API_KEY = os.environ.get("GOOGLE_API_KEY")

Make sure that your .env file is not uploaded or shared along with your source code. If you are using Git, add it to your .gitignore file.

If you are on Google Colab, there is another way of storing API secrets. On the left of the notebook you will see a key icon which open a secrets tab. You can save secrets here and also automatically pull in your Gemini keys. Then in your notebook you can load the secrets as follows:

from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

Secrets stored in this way are not shared when you share the notebook.

13.2 Generating text

13.2.1 Basic usage

Assuming that you have obtained your API key and have installed the google-genai library, the basic interface for generating text is very simple:

from google import genai

client = genai.Client(api_key=GOOGLE_API_KEY)

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="""
                Why is the government expenditure multiplier
                greater than 1?

                Answer in about 500 words.
            """
)

We first create a Client object through which we will interact with the Gemini service. Then we make a call the generate_content method on the client.models attribute with the following parameters

  • model: the Gemini model to send the request to. You have to use the exact string for the model name from the Gemini documentation which lists the available models and their capabilities.
  • contents: the prompt to be sent to the LLM.

The model’s response is stored in the response variable. Its text attribute contains the actual text of the answer.

To print it nicely we write a small helper function to break long lines:

import textwrap

def pretty_print(text,width=60):
    lines = text.splitlines()
    return "\n".join(
              "\n".join(textwrap.wrap(line,width))
              for line in lines)
print(pretty_print(response.text))
The government expenditure multiplier being greater than 1
means that an increase in government spending leads to a
larger increase in overall economic output (GDP). This
amplification effect stems from a chain reaction of spending
and income generation within the economy, often referred to
as the "multiplier effect." Here's a breakdown of the
underlying mechanisms:

**The Basic Idea: Circular Flow of Income**

The multiplier effect is rooted in the concept of the
circular flow of income. Imagine the government decides to
build a new bridge and spends $1 million on the project.
This $1 million doesn't disappear; it becomes income for the
construction company hired to do the job.  The construction
company then uses this income to pay its workers, purchase
materials, and maybe pay off some debts.  The workers, in
turn, use their wages to buy groceries, pay rent, go to the
movies, and so on.  The suppliers of building materials use
their revenue to pay their workers, and so on.

This initial $1 million of government spending is injected
into the economy and ripples outward, generating further
rounds of spending and income.  Each round is smaller than
the previous one, but the cumulative effect can be
significantly larger than the original investment.

**Marginal Propensity to Consume (MPC)**

The key factor determining the size of the multiplier is the
**marginal propensity to consume (MPC)**.  The MPC
represents the proportion of an additional dollar of income
that an individual will spend rather than save.  For
example, an MPC of 0.8 means that for every extra dollar
earned, a person will spend 80 cents and save 20 cents.

A higher MPC leads to a larger multiplier because more of
each dollar earned is spent, further fueling economic
activity.  If people save almost all of their extra income
(low MPC), the stimulus will leak out of the circular flow
relatively quickly, limiting the multiplier effect.

**Calculating the Multiplier**

The government expenditure multiplier is calculated as:

Multiplier = 1 / (1 - MPC)

Let's consider a few examples:

*   **MPC = 0.5:** Multiplier = 1 / (1 - 0.5) = 1 / 0.5 = 2.
This means that every $1 of government spending increases
GDP by $2.

*   **MPC = 0.8:** Multiplier = 1 / (1 - 0.8) = 1 / 0.2 = 5.
This means that every $1 of government spending increases
GDP by $5.

*   **MPC = 0.9:** Multiplier = 1 / (1 - 0.9) = 1 / 0.1 =
10.  This means that every $1 of government spending
increases GDP by $10.

As you can see, as the MPC increases, the multiplier
increases significantly.

**Leakages and Limitations**

While the multiplier effect suggests a powerful impact from
government spending, it's important to consider some
limitations and factors that can reduce its effectiveness:

*   **Savings:** As mentioned before, savings are a
"leakage" from the circular flow. The higher the savings
rate, the smaller the multiplier.
*   **Taxes:** Taxes reduce the amount of disposable income
available for spending. Higher taxes mean a smaller
multiplier.
*   **Imports:** If a portion of the money spent is used to
purchase goods and services from abroad (imports), it
doesn't stimulate domestic production and income. This is
another leakage.
*   **Crowding Out:** In some cases, increased government
borrowing to finance spending can drive up interest rates.
Higher interest rates can discourage private investment,
partially offsetting the positive effects of government
spending. This is known as "crowding out."
*   **Time Lags:** The multiplier effect doesn't happen
instantaneously. It takes time for the initial spending to
ripple through the economy, meaning the full impact might
not be felt for months or even years.
*   **Supply-Side Constraints:**  The multiplier effect
assumes that the economy has the capacity to increase
production to meet the increased demand.  If the economy is
already operating at full capacity, increased government
spending might primarily lead to inflation rather than a
significant increase in real GDP.

**Conclusion**

The government expenditure multiplier is typically greater
than 1 because the initial government spending creates a
chain reaction of spending and income generation throughout
the economy. This amplification effect is driven by the
marginal propensity to consume (MPC), which determines how
much of each additional dollar of income is spent. However,
factors like savings, taxes, imports, crowding out, and
supply-side constraints can limit the size and effectiveness
of the multiplier. Understanding these factors is crucial
for policymakers when considering the use of government
spending as a tool to stimulate the economy.

The model has provided us its answer formatted in Markdown.

13.2.2 Providing model parameters

We can pass a configuration argument to generate_content call to further control the generation process

from google.genai.types import GenerateContentConfig

config = GenerateContentConfig(
    temperature = 2,
    candidate_count = 4,
    response_mime_type="text/plain"
)

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="""
                Why is the government expenditure multiplier
                greater than 1?

                Explain in less than 100 words.
            """,
    config=config
)

We pass the model parameters in a GenerateContentConfig object. Here we have set the temperature (recall from the last chapter that higher temperature means more idiosyncratic output), the number of candidate responses to generate, and the format of the output. By setting response_mime_type to text/plain we have indicated that we want plain text and not Markdown. MIME types are a file format identification standard not specific to Gemini. The full list of model parameters can be found in the Genai SDK documentation, the link to which is in the references.

The return value of generate_content is an object of GenerateContentResponse type, once again documented in the SDK documentation. In our earlier example we used its text attribute to give us the concatenate text of the response. This time since we have asked for multiple candidate responses, let us iterate through them and print them separately:

for i,candidate in enumerate(response.candidates):

    ll = candidate.avg_logprobs
    print(f"Candidate {i+1}, avg logprog {ll:.3g}")
    print("---------------")

    for part in candidate.content.parts:
        print(pretty_print(part.text))
    print()
Candidate 1, avg logprog -0.409
---------------
The government expenditure multiplier is greater than 1
because of the circular flow of income. When the government
spends money, it becomes income for individuals or firms.
They, in turn, spend a portion of that income (depending on
their marginal propensity to consume), which becomes income
for others. This cycle continues, generating additional
rounds of spending, resulting in a total increase in GDP
larger than the initial government expenditure.

Candidate 2, avg logprog -0.247
---------------
The government expenditure multiplier is greater than 1
because of the "multiplier effect." When the government
spends money, it becomes income for someone else. That
person then spends a portion of their new income, which
becomes income for someone else again, and so on. This
ripple effect generates additional economic activity beyond
the initial government spending, resulting in a total
increase in GDP that is larger than the original investment.
The size of the multiplier depends on factors like the
marginal propensity to consume.

Candidate 3, avg logprog -0.421
---------------
The government expenditure multiplier is greater than 1
because of the ripple effect. When the government spends
money, it becomes income for individuals and businesses.
These recipients then spend a portion of this new income
(induced consumption), creating more income for others. This
continues, generating a multiple increase in aggregate
demand that exceeds the initial government spending. The
size depends on the marginal propensity to consume.

Candidate 4, avg logprog -0.338
---------------
The government expenditure multiplier is greater than 1
because of a chain reaction effect. When the government
spends money, it becomes income for someone else. This
recipient then spends a portion of that income (their
marginal propensity to consume), creating further income for
others. This process continues, leading to a larger increase
in aggregate demand than the initial government spending.

In this code we have loop through the candidate responses. We print the average log-probability of the tokens in the candidate from its avg_logprobs attribute. The actual response is in the parts attribute. A response can consist of multiple parts. We iterate through them and print each.

It is interesting to compare the different candidate solution and see how they converge because they are the product of the same model and yet diverge because of random sampling. You should experiment with different temperature settings as well as explore other parameters controlling text generation to see how they affect the responses produced.

13.2.3 Variable prompts and processing output

What we have done so far is essentially the same as using the model through its web interface. Next let us look at generating the prompts and using the outputs programmatically to get the benefits of accessing the model through an API.

As an example we generate a list of the most important books written by various economists, and put it together in a Pandas dataframe. We do this by calling the API in a loop, making our prompt vary on each iteration of the loop by using an f-string to interpolate the name of the economist within the prompt. Note that the config can also be provided as a dictionary.

import pandas as pd

economists = ['Keynes', 'Kaldor', 'Kalecki']
books = []
for economist in economists:
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents=f"""
        Which is the most important book written by 
        the economist {economist}?

        Return just the answer as plain text.
        """,
        config = {
            'response_mime_type': 'text/plain'
        }
    )
    books.append(response.text.strip())
pd.DataFrame({'economist':economists,
            'books':books})
economist books
0 Keynes The General Theory of Employment, Interest and...
1 Kaldor Growth Models
2 Kalecki Theory of Economic Dynamics

Are you sure each of the economists wrote a book of the name specified? Remember LLMs can hallucinate and return wrong answers in a confident voice. If we really needed this data we would have confirmed it against a bibliographic database.

When working with LLMs to produce output that will be used further by machines, some degree of trial and error is required to ensure that the results returned are exactly what you wanted and in the format that you wanted. You are in effect writing programs in the English language for a system whose behaviour is not explicitly defined and must be explored empirically.

13.3 Example: classifying FOMC minutes

In this example, we’ll demonstrate how LLMs can be used for a sophisticated classification task relevant to economics: analyzing the stance of Federal Open Market Committee (FOMC) members based on meeting transcripts. The FOMC is the monetary policymaking body of the Federal Reserve System in the United States that sets key interest rates and makes decisions about monetary policy. Their meetings and the stance of individual members are closely watched by economists, investors, and policymakers as they provide crucial signals about future economic policy. For macroeconomists, understanding whether FOMC members lean hawkish or dovish helps predict policy changes that affect inflation, employment, and overall economic growth, making this classification task particularly valuable for economic forecasting and research.

Our task is to classify each speaker in the FOMC transcripts as either a “hawk” (favoring tighter monetary policy to control inflation), a “dove” (favoring looser monetary policy to stimulate growth), “neutral,” or “indeterminate” based on their statements. This type of analysis is typically done manually by economists and financial analysts, but we’ll automate it using an LLM.

This example showcases several advanced techniques:

  1. Processing structured data (transcripts in JSON format)
  2. Using complex prompts with detailed instructions
  3. Generating structured output (JSON) that can be directly integrated into a data analysis pipeline
  4. Extracting key evidence (utterances) that support the classification decisions

13.3.1 Data

We use a dataset of FOMC meeting transcripts from 1976 onwards created by Miguel Acosta https://www.acostamiguel.com/data/fomc_data.html. The website for this book has a copy of the data in compressed CSV format.

transcripts = pd.read_csv("https://mlbook.jyotirmoy.net/static/data/fomc_transcripts.csv.zst")
transcripts.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140806 entries, 0 to 140805
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   date         140806 non-null  int64 
 1   sequence     140806 non-null  int64 
 2   name         140806 non-null  object
 3   n_utterance  140806 non-null  int64 
 4   section      39244 non-null   object
 5   text         140776 non-null  object
dtypes: int64(3), object(3)
memory usage: 6.4+ MB

The description of the columns is as follows (quoted from Acosta’s website):

  • date is the date of the FOMC meeting
  • sequence this orders each speaker utterance within the meeting
  • name is the last name of the speaker
  • n_utterance which utterance by the current speaker within the meeting (e.g. the tenth time Greenspan spoke is 10)
  • section in which section of the meeting was the current utternance a part of (either ECSIT for the discussion of the economic situation, MPS for a discussion of monetary policy, or AGGREGATES for biannual discussions of the monetary aggregates.
  • text the text of the utterance

Next, lets look at some of the observations:

transcripts.head()
date sequence name n_utterance section text
0 19760329 1 BURNS 1 NaN Gentlemen, we're ready. This will now be a mee...
1 19760329 2 OCONNELL 1 NaN Yes, Mr. Chairman. The Committee will recall t...
2 19760329 3 BURNS 2 NaN Thank you, Mr. O'Connell. Are there any questi...
3 19760329 4 COLDWELL 1 NaN One question, Tom. Do you gather any sense or ...
4 19760329 5 OCONNELL 2 NaN Mr. Chairman, Governor, we've had no conversat...

We will be feeding the entire transcript of one meeting at a time to the LLM at a time. We use the fact that the pandas group by operation creates a list of dataframes to break up the data by meeting:

transcripts_by_date = [
    df for _,df in transcripts.groupby("date")]

13.3.2 JSON

We will need to provide data in a structured way to the LLM and obtain structured answers. The most common way to codify structured data as string at the moment is JSON (JavaScript Object Notation). This is a lightweight data interchange format that is both human-readable and machine-parsable. It has become the standard format for data exchange in web applications and APIs due to its simplicity and flexibility.

JSON consists of two primary structures:

  • Objects: Collections of key-value pairs enclosed in curly braces {}. Keys must be strings, and values can be strings, numbers, objects, arrays, booleans, or null.
  • Arrays: Ordered lists of values enclosed in square brackets []. Values can be of any type, including other arrays or objects.

For example, a simple JSON object representing economic data might look like:

{
  "country": "United States",
  "gdp_growth": 2.3,
  "inflation_rate": 1.8,
  "unemployment": 3.6,
  "sectors": ["manufacturing", "services", "agriculture"]
}

In our FOMC classification task, we’re working with JSON data that represents meeting transcripts. Each transcript is structured as an array of utterance objects, with each object containing metadata (speaker, date, sequence number) and the actual text spoken.

Python provides built-in support for JSON through the json module, and pandas offers convenient methods like to_json() and read_json() for converting between DataFrames and JSON. This makes it easy to process structured data and prepare it for analysis with LLMs.

Here we convert the list of per-meeting dataframes into a list of JSON-encoded strings.

jsonified = [df.to_json(orient="records") 
                for df in transcripts_by_date]
print(len(jsonified[0]))
145842

We used to_json method of the pandas dataframe to convert the dataframe to json. The orient="records" argument converts each row into its own JSON object, with the column names as keys. Other values of orient can codifies the dataframe one column at a time etc.

The JSON for an entire meeting is around 145 thousand characters. An important characteristic of an LLM is its context length, how long an input it can understand. We must make sure that our input does not exceed the context length of the LLM we are using. Context length is usually specified in terms of token where tokens are the units in which an LLM breaks a text. To a very rough approximation one token corresponds to a word. Our input would correspond to few tens of thousands of tokens, which is not too large by today’s standards. But for larger outputs it may be necessary to feed the data in chunks.

13.3.3 JSON Schemas

We will not only feed the data to the model as JSON, we will also ask it to return the its response as JSON so that it is in a fixed format that is easy to parse. The format of a dataset is known as an schema. We’ll have to define an output schema for our query. We want the model to return its assessed stance for each speaker. Just so that we can assess the model’s performance informally, we will ask it to also indicate the most significant statement made by the speaker.

There are multiple ways to specify the schema. A simple one is to define class or classes specifying the organization of your data:

To define our schema, we’ll use two powerful Python libraries: enum and Pydantic. These libraries help us create well-structured, type-safe data models that clearly communicate the expected format of our data.

import enum
from pydantic import BaseModel

class Stance(enum.Enum):
    HAWK = "hawk"
    DOVE = "dove"
    NEUTRAL = "neutral"
    INDETERMINATE = "indeterminate"

class StanceAnalysis(BaseModel):
    date: int
    name: str
    stance: Stance
    key_utterance_seq: int

The enum module provides the Enum class, which allows us to create sets of named constants. This is generally useful beyond machine learning since in many cases we want a variable to take on values only from some finite set. Using enums instead of bare strings ensures that the variable cannot take on invalid values. In our case, we’ve defined a Stance enum that restricts the possible stance values to exactly four options: hawk, dove, neutral, or indeterminate. This ensures consistency in our classification and prevents typos or invalid values from entering our data.

Pydantic is a data validation library that uses Python type annotations to enforce data types and structures. The BaseModel class provides a foundation for creating data models with automatic validation. Our StanceAnalysis model defines the exact structure we expect for each speaker’s analysis, with specific data types for each field. When data is parsed through this model, Pydantic will automatically validate it against these specifications, raising clear errors if the data doesn’t conform.

In the StanceAnalysis class, we’re using Python’s type annotation syntax, which follows the pattern variable_name: type. This syntax is part of Python’s type hinting system introduced in Python 3.5:

  • date: int indicates that the date field should be an integer
  • name: str indicates that the name field should be a string
  • stance: Stance indicates that the stance field should be a value from our custom Stance enum
  • key_utterance_seq: int indicates that the key_utterance_seq field should be an integer

Python’s type system allows us to use both built-in types (like str and int) and user-defined types (like our Stance enum) in these annotations. Pydantic uses these annotations to validate data at runtime, ensuring that values match their expected types.

Python’s type system also supports collection types for more complex data structures. For example:

  • list[int] indicates a list containing only integers
  • dict[str, float] indicates a dictionary with string keys and float values
  • tuple[str, int, bool] indicates a tuple with exactly three elements of the specified types
  • Optional[str] (from the typing module) indicates a value that can be either a string or None

These collection types are particularly useful when working with structured data like JSON, where you might need to validate arrays of objects or nested structures. For instance, if we wanted our model to return multiple utterances per speaker, we might use key_utterances: list[int] instead of just a single sequence number.

This structured approach is particularly valuable when working with LLMs, as it helps guide the model to produce outputs in exactly the format we need for further processing.

13.3.4 The prompt

Writing the prompt requires trial and error. This is our version. You have to be careful to clearly lay down what you expect. Excuse the melodrama of the first para, it is most likely superfluous.

The prompt is written in a way that the actual transcript can be appended to it.

PROMPT = """
INSTRUCTIONS:

You are a senior monetary economist studying 
monetary policy-making in the United States.

After these instructions you will be provided
the transcript of a meeting of the Federal 
Open Market Committee which determines monetary 
policy in the United States of America. The transcript
will be provided in JSON format. It'll be in the 
form of a list of utterances, which each utterance 
consisting of the following fields:

- date: date of the meeting
- sequence: sequence of the utternace in the transcript
- name: name of the speaker of the utterance
- n_utterance: speaker-specific sequence no
- section: ignore this field
- text: text of the utterance

You have to analyse the transcript to understand 
the viewpoints and attitudesof the speakers.

For determine the list of names of speakers.

For each of the speakers you will have to assess 
whether they were a:

- 'hawk': in favour of making monetary policy more 
contractionary, i.e. lower lower money supply growth, 
higher interest rates, feels that there is excess
economic activity or inflation.
- 'dove': in favour of making monetary policy more 
expansionary, i.e. higher money supply growth, 
lower interest rates, feels that there is too little
 economic activity or inflation.
- 'neutral': in favour of not changing the policy stance
- 'indeterminate': in cases where it is not clear from 
the text what their stance is

First determine the list of speakers in the transcript.

Your response must be a JSON string consisting of a list 
of dictionaries, with one entry per speaker. You must have 
an entry for each speaker whose name occurs
in the transcript. The dictionary must have the following fields:

- date: date of the meeting (from your input)
- name: name of the speaker
- stance: 'hawk', 'dove', 'neutral', or 'indeterminate' as described above
- key_utterance_seq: `sequence` value of the one utterance 
by that speaker that best reflects the stance assigned to them


"""

13.3.5 API call for structured output

To illustrate, we make an API call using only the transcript of the 6-th meeting, stored in jsonified[5]. We append it to our prompt.

Also to force the LLM to give structured output we add two more parameters:

  • The response_mime_type parameter tells the API what format to return the response in - setting it to ‘application/json’ ensures we get valid JSON rather than markdown or plain text.
  • The response_schema parameter provides a type definition that guides the model to structure its output as a list of objects of our StanceAnalysis class.

Together, these parameters constrain the model’s output to follow our desired format, making it easier to parse and process programmatically.

client = genai.Client(api_key=GOOGLE_API_KEY)

# Generate content with structured output
response = client.models.generate_content(
    model='gemini-2.0-flash',
    contents=PROMPT+"TRANSCRIPT: \n"+jsonified[5],
    config={
        'response_mime_type': 'application/json',
        'response_schema': list[StanceAnalysis]
    }
)
print(response.text)
[
  {
    "date": 19760817,
    "name": "BURNS",
    "stance": "neutral",
    "key_utterance_seq": 161
  },
  {
    "date": 19760817,
    "name": "HOLMES",
    "stance": "indeterminate",
    "key_utterance_seq": 4
  },
  {
    "date": 19760817,
    "name": "BLACK",
    "stance": "neutral",
    "key_utterance_seq": 168
  },
  {
    "date": 19760817,
    "name": "PARDEE",
    "stance": "indeterminate",
    "key_utterance_seq": 7
  },
  {
    "date": 19760817,
    "name": "COLDWELL",
    "stance": "dove",
    "key_utterance_seq": 162
  },
  {
    "date": 19760817,
    "name": "PARTEE",
    "stance": "neutral",
    "key_utterance_seq": 164
  },
  {
    "date": 19760817,
    "name": "EASTBURN",
    "stance": "dove",
    "key_utterance_seq": 166
  },
  {
    "date": 19760817,
    "name": "WALLICH",
    "stance": "neutral",
    "key_utterance_seq": 180
  },
  {
    "date": 19760817,
    "name": "JACKSON",
    "stance": "neutral",
    "key_utterance_seq": 174
  },
  {
    "date": 19760817,
    "name": "BAUGHMAN",
    "stance": "indeterminate",
    "key_utterance_seq": 69
  },
  {
    "date": 19760817,
    "name": "GRAMLEY",
    "stance": "neutral",
    "key_utterance_seq": 63
  },
  {
    "date": 19760817,
    "name": "WINN",
    "stance": "dove",
    "key_utterance_seq": 59
  },
  {
    "date": 19760817,
    "name": "KIMBREL",
    "stance": "neutral",
    "key_utterance_seq": 65
  },
  {
    "date": 19760817,
    "name": "AXILROD",
    "stance": "neutral",
    "key_utterance_seq": 117
  },
  {
    "date": 19760817,
    "name": "GARDNER",
    "stance": "neutral",
    "key_utterance_seq": 160
  },
  {
    "date": 19760817,
    "name": "MAYO",
    "stance": "neutral",
    "key_utterance_seq": 100
  },
  {
    "date": 19760817,
    "name": "WILLIAMS",
    "stance": "neutral",
    "key_utterance_seq": 137
  },
  {
    "date": 19760817,
    "name": "MORRIS",
    "stance": "neutral",
    "key_utterance_seq": 170
  },
  {
    "date": 19760817,
    "name": "GUFFEY",
    "stance": "neutral",
    "key_utterance_seq": 184
  },
  {
    "date": 19760817,
    "name": "VOLCKER",
    "stance": "dove",
    "key_utterance_seq": 182
  },
    {
    "date": 19760817,
    "name": "MACLAURY",
    "stance": "neutral",
    "key_utterance_seq": 186
  },
  {
    "date": 19760817,
    "name": "BROIDA",
    "stance": "indeterminate",
    "key_utterance_seq": 194
  },
    {
    "date": 19760817,
    "name": "LILLY",
    "stance": "indeterminate",
    "key_utterance_seq": 205
  }
]

We get the JSON as expected. Now to parse it. In a production setting we would using the features of pydantic to validate the output. We take the lazy way out here to just parse the json into Python lists and dictionaries and then convert it into a data frame.

import json
# Parse the response
stance_data = json.loads(response.text)

# Convert to a pandas DataFrame
stance_df = pd.DataFrame(stance_data)

stance_df.head()
date name stance key_utterance_seq
0 19760817 BURNS neutral 161
1 19760817 HOLMES indeterminate 4
2 19760817 BLACK neutral 168
3 19760817 PARDEE indeterminate 7
4 19760817 COLDWELL dove 162

13.3.6 Sanity check

Let’s do some basic checks on our results. First let’s look at the frequency of assigned labels.

# Merge the stance dataframe with the transcript
stance_with_text = stance_df.merge(
    transcripts[['date', 'sequence', 'name', 'text']],
    how = 'left',
    left_on=['date', 'key_utterance_seq'],
    right_on=['date', 'sequence'],
    suffixes=('', '_transcript')
)

print("Frequency count of stance")
print(stance_with_text['stance'].value_counts())
Frequency count of stance
stance
neutral          14
indeterminate     5
dove              4
Name: count, dtype: int64

We see that there are a very large number of neutral or indeterminate labels, indicating that in many cases the model cannot make up its mind. We will need to read the transcript carefully to see if an expert human could have done better.

Next we make use of the fact that we have asked the model to produce key utterances to see if the model actually can attribute utterances to the right speaker:

stance_with_text['correct_speaker'] = (
    stance_with_text['name'] == stance_with_text['name_transcript'])

accuracy = stance_with_text['correct_speaker'].mean() * 100

# Print the verification results
print(f"Percentage of key utterances correctly attributed: {accuracy:.1f}%")
Percentage of key utterances correctly attributed: 95.7%

The answer turns out not to be 100%. So the model misattributes utterances. If we were using this for actual research we would have to filter out such cases.

Finally let’s look at what statements the model considers hawkish or dovish.

# Print an example of a hawk and a dove
if 'hawk' in stance_with_text['stance'].values:
    hawk_example = (
        stance_with_text[stance_with_text['stance'] == 'hawk'].iloc[0])
    print("\nExample HAWK statement:")
    print(f"Speaker: {hawk_example['name']}")
    print("Utterance:")
    print(pretty_print(hawk_example['text']))

if 'dove' in stance_with_text['stance'].values:
    dove_example = (
        stance_with_text[stance_with_text['stance'] == 'dove'].iloc[0])
    print("\nExample DOVE statement:")
    print(f"Speaker: {dove_example['name']}")
    print("Utterance:")
    print(pretty_print(dove_example['text']))

Example DOVE statement:
Speaker: COLDWELL
Utterance:
Mr. Chairman, I don't find great difficulty with what you or
Governor Gardner has said. Perhaps a nuance. I'm not ready
to write off this recovery. I think we are involved in the
pause, at least with some dimension, probably for another
month and perhaps longer. [If] we recognize this pause in
the policy prescription, I would think it would be a matter
of minimal response. At least, however, [it] ought to call
for some caution on our part. Perhaps a shading of our
judgments in Desk operations might be sufficient. It doesn't
mean an overt move, but I suspect not perfectly steady in
the boat either. I'm a little bothered by the continued Desk
mechanization with policy and response. I'd prefer to shade
just a little bit as the Desk sees opportunities,
capitalizing on those opportunities as the market believes
it ought to move. I would like to raise, though, with the
Committee the possibility of again taking the opportunity of
the fall and the seasonal demand period to look at--what is
essentially a [Federal Reserve] Board action, of course--a
change in reserve requirements. I know that it is difficult
to do this in a period of recovery, but we have a seasonal
demand which, in the past, we have met with this device. I
continue to believe that our reserve requirements are larger
than necessary for policy action, and it would be helpful to
our [Federal Reserve System] membership problem if we could
get them further reduced.

What exasperating bureaucratese people can speak when they know their words will be pored over by curious characters even decades later! That the LLM can make some sense of it at all is remarkable.

13.4 Further steps

13.4.1 Evaluation

When using LLMs for classification tasks like our FOMC stance analysis, it’s crucial to evaluate the quality of the model’s outputs. Unlike traditional machine learning models where we can calculate precise metrics like accuracy or F1 scores against a ground truth dataset, evaluating LLM outputs often requires more nuanced approaches.

One effective evaluation method is to compare the LLM’s classifications with those made by human experts. This process typically involves:

  1. Blind coding: Having human experts (in this case, economists familiar with monetary policy) independently classify a sample of the same FOMC transcripts without seeing the LLM’s outputs. It’s important that multiple experts code the same data to establish inter-rater reliability.

  2. Confusion matrix analysis: Creating a matrix that compares the LLM’s classifications (hawk, dove, neutral, indeterminate) with the human consensus classifications. This helps identify patterns in where the model agrees with humans and where it diverges.

  3. Key utterance validation: Examining whether the key utterances identified by the LLM as evidence for its classifications align with those that humans find most significant. This qualitative assessment helps determine if the model is focusing on the same textual cues that human experts would.

  4. Error analysis: Categorizing disagreements between human and LLM classifications to identify systematic biases or weaknesses in the model’s understanding. For example, the model might consistently misclassify certain types of economic language or fail to recognize subtle contextual cues that human experts pick up on.

13.4.2 Few shot learning

Few-shot learning is a technique where we provide the model with examples of the task we want it to perform before asking it to complete a similar task. This approach helps the model understand the specific format, style, and reasoning we expect in its responses. Unlike traditional machine learning that requires thousands of examples, LLMs can often learn from just a handful of demonstrations—hence the term “few-shot.” In contrast, our example above which provides no examples and only describes the desired output would be called “zero-shot”.

The few-shot technique works by structuring your prompt in this pattern:

  1. Task description and instructions
  2. Several example inputs and their corresponding expected outputs
  3. The new input for which you want a response

For our FOMC stance analysis, we could enhance our results by providing examples of how we want the model to classify different types of statements. Here’s how we might structure a few-shot prompt:

few_shot_examples = """
EXAMPLES:

Example 1:
Speaker statement: "I'm concerned about the persistent 
inflation we're seeing across sectors. The economy is 
clearly overheating, and I believe we need to raise rates 
by at least 50 basis points at this meeting to get ahead of the problem."
Classification: hawk
Reasoning: The speaker expresses concern about inflation 
and explicitly advocates for raising interest rates to cool 
down the economy.

Example 2:
Speaker statement: "The labor market remains fragile, with 
many Americans still struggling to find work. Raising rates 
now would risk choking off the recovery before it reaches 
all segments of society. I recommend we maintain our current 
stance and reassess next quarter."
Classification: dove
Reasoning: The speaker prioritizes employment concerns over 
inflation and explicitly recommends against raising rates.

Example 3:
Speaker statement: "The data presents a mixed picture. While 
inflation has ticked up, there are signs it may be transitory. 
Meanwhile, employment gains have been steady but not spectacular. 
On balance, I believe our current policy stance is appropriate 
for now."
Classification: neutral
Reasoning: The speaker acknowledges both inflation and employment 
concerns but recommends maintaining the current policy stance.

Now analyze the following transcript:
"""

# Modify the API call to include few-shot examples
response = client.models.generate_content(
    model='gemini-2.0-flash',
    contents=(PROMPT+few_shot_examples + 
    "TRANSCRIPT: \n" + jsonified[5]),
    config={
        'response_mime_type': 'application/json',
        'response_schema': list[StanceAnalysis]
    }
)

13.4.3 System prompts

System prompts are special instructions provided to an LLM that establish its role, behavior, and constraints before any user input is processed. Unlike regular prompts that are part of the conversation, system prompts set the overall context and operating parameters for the model.

In the context of API interactions, system prompts are typically provided as a separate parameter in the API call. They serve several important functions:

  1. Role definition: Establishing the model’s persona (e.g., “You are an expert economist specializing in monetary policy”)
  2. Behavior guidelines: Setting rules for how the model should respond (e.g., “Always provide evidence for your claims”)
  3. Output formatting: Specifying the desired format of responses (e.g., “Structure your answers as bullet points”)
  4. Knowledge constraints: Defining what the model should or shouldn’t know (e.g., “You have access to data only up to 2023”)

For our FOMC analysis, we embedded these instructions directly in our prompt. However, a more structured approach would be to separate the system prompt from the user prompt. This separation makes the code more maintainable and allows for easier experimentation with different system behaviors while keeping the core user query consistent.

Here’s how we might restructure our API call to use a dedicated system prompt:

from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

response = client.models.generate_content(
    model='gemini-2.0-flash',
    config=types.GenerateContentConfig(
        system_instruction=PROMPT),
    contents="Transcript:\n"+jsonified[5]
)

This approach creates a clearer separation of concerns and makes it easier to refine either the system behavior or the specific task instructions independently.

13.5 Conclusion

In this chapter, we’ve explored how Large Language Models can be programmatically integrated into data analysis workflows through APIs. We’ve seen how text generation capabilities extend far beyond simple content creation to include sophisticated tasks like classification, information extraction, and structured data analysis.

Our FOMC transcript analysis example demonstrated several key concepts:

  1. Structured input and output: Using JSON to communicate with LLMs in a standardized format
  2. Schema definition: Creating explicit data models to guide and validate model outputs
  3. Prompt engineering: Crafting clear instructions that elicit the desired behavior from the model
  4. System prompts: Establishing context and constraints to improve model performance
  5. Evaluation approaches: Comparing model outputs with human expert judgments

These techniques can be applied to numerous economic research tasks, such as:

  • Analyzing sentiment in earnings calls or economic news
  • Extracting structured data from unstructured economic reports
  • Classifying economic policy statements across different countries
  • Summarizing academic papers for literature reviews
  • Generating explanations of complex economic concepts for different audiences

As LLM capabilities continue to evolve, the ability to programmatically interact with these models will become an increasingly valuable skill for economists working with text data. By treating LLMs as components in larger analytical pipelines rather than standalone tools, researchers can leverage their capabilities while maintaining the rigor and reproducibility expected in academic research.

When implementing LLM-based solutions, remember to:

  • Document your prompts as carefully as you would document code
  • Establish evaluation protocols to validate model outputs
  • Consider the limitations of current models, particularly regarding factual accuracy
  • Be mindful of potential biases in model outputs, especially for sensitive economic analyses

With these considerations in mind, LLMs can serve as powerful tools that augment traditional economic research methods and open new avenues for analyzing textual data at scale.

13.6 References

  • Google GenAI SDK Documentation: https://googleapis.github.io/python-genai/