15  LLM-Derived Embeddings

In machine learning practice an embedding is a mapping from some input space to some, usually high-dimensional, vector space. The expectation is that the vector captures the essential content of the input and that inputs with similar content will produce similar embeddings.

LLMs can be used to produce embeddings of text, and in the case of multimodal LLMs of images, documents etc. as well. You provide the LLM with the input and it returns a vector of numbers—the embedding vector.

The embeddings are essentially the LLM’s summarization of its input expressed in a numerical form with the property that texts similar in meaning/content (in fancier words, semantically-similar text) will produce similar embedding vectors. This is superior to methods that summarize text by looking just at the occurrence of words, since LLMs can use their common-sense to infer context and intent from the words. And the result, being a list of numbers, can easily be used in further computational steps, as we shall illustrate in this chapter.

Embeddings are particularly important when we want to work with bodies of unstructured data too large to be sent as a single prompt or when we want to work repeatedly with the same bodies of data. In these cases we compute embeddings of chunks of our data once and store them for future use. Embeddings allow us to build information processing and information retrieval systems which operate on the meaning of structured data as opposed to its surface appearance. We shall illustrate the potential of embeddings with two examples in this chapters. In the first, we create a database of financial rules which can be searched for the most relevant rule for a case. In the second, we group economics journal articles based on their titles and abstracts. The conclusion of the chapter describes further applications.

For this chapter we shall be using the Mistral AI API. Before you begin you will need to sign up for their platform and obtain an API key. As discussed in earlier chapters, the key is to be stored either in a .env file or in Colab’s secrets tab. You will also need to install the mistralai package.

15.1 Question Answering: Find the Right Government Rule

As a first example of the use of embedding let us consider a situation where you have a large book of financial rules. You also have a large number of cases for which you wish to find the most relevant rules for the rulebook.

If the rulebook is small them we do not need embeddings. We can make one request to the LLM API for each case, sending both the case description and the rulebook as inputs to the query. If it becomes necessary, you can save on costs by using the caching facility provided by LLM API which allows a fixed part of the input to be uploaded once and then referred to multiple times from different queries. Cached inputs costs less than regular inputs. However, even then costs will add up if there are many cases. And this approach is not feasible at all if the rulebook is so large that it exceeds the context window of the LLM.

That is when you use embeddings. We can break up the rulebook into individual rules and ask an LLM to produce an embedding for each. Then for each case, we also ask for an embedding of the case description. Since similar content is expected to produce similar embeddings, we search among our saved rule embeddings for those which are most similar to our case embedding. These are likely to be the most useful rules. In this approach, the rulebook has to be processed only once (and that too piecemeal) by the LLM. Then for each case we only send a request with the case description.

Let’s illustrate with a small ruleset and a single case:

rules = [
    """
    Every officer responsible for the
collection of Government dues or
expenditure of Government money shall
see that proper accounts of the receipts
and expenditure, as the case may be, are
maintained in such form as may have
been prescribed for the financial
transactions of Government with which
he is concerned and tender accurately
and promptly all such accounts and
returns relating to them as may be
required by Government, Controlling
Officer or Accounts Officer, as the case
may be.
    """,

    """
    Performance and conduct of every
registered supplier is to be watched by
the concerned Ministry or Department.
The registered supplier(s) are liable to
be removed from the list of approved
suppliers if they fail to abide by the
terms and conditions of the registration
or fail to supply the goods on time or
supply substandard goods or make any
false declaration to any Government
agency or for any ground which, in the
opinion of the Government, is not in
public interest
    """,

    """
     If a Ministry or Department
is unable to sell any surplus or obsolete or
unserviceable item in spite of its attempts
through advertised tender or auction, it
may dispose of the same at its scrap value
with the approval of the competent
authority in consultation with Finance
division. In case the Ministry or Department
is unable to sell the item even at its scrap
value, it may adopt any other mode of
disposal including destruction of the item in
an eco-friendly manner.
    """,

    """
    On receipt of an intimation from the Director,
Postal Life Insurance, Kolkata, about the issue
of a policy in favour of a subscriber authorizing
the Drawing Officer to commence recovery
from pay, or on receipt of a Last Pay
Certificate in respect of the subscriber
transferred from another office, the Drawing
Officer should make a note of the particulars
of the policy in the register. The name of the
office from which the subscriber has been
transferred should invariably be noted in the
remarks column. Wherever a subscriber is
transferred to another office or his policy is
discharged, his name should be scored out
from the register giving necessary remarks.
    """
]

We use the Mistral API to create embeddings for the rules first. Using the Mistral API is very similar to how we used the Gemini API in the earlier chapters.

We first get the API key from Colab secrets or our .env file. With dotenv:

from dotenv import load_dotenv
import os
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY")

With Colab:

from google.colab import userdata
MISTRAL_API_KEY = userdata.get('MISTRAL_API_KEY')

Then we import the Mistral object from the mistralai library and use it to create a client object:

from mistralai import Mistral
client = Mistral(api_key=MISTRAL_API_KEY)

Now we make an API call to produce the embeddings:

response = client.embeddings.create(
    model="mistral-embed",
    inputs=rules)

If all goes well, one embedding vector is created for each element of the list passed as inputs. The response object has a list called data, with one element for each element of inputs (in this case our rules). Each of these elements is an object whose embedding attribute contains the embedding vector for that rule produced by Mistral. We collect all of these together into a numpy array.

import numpy as np

rules_embed = np.array([e.embedding for e in response.data],
                             dtype=np.float32)
print(rules_embed.shape)
(4, 1024)

The created array has the shape (no of rules, embedding dimension) where embedding dimension is the dimension of the vector space in which the embeddings live. In the case of Mistral it is 1024.

Now suppose the following case comes to us.

case_query = """
The Ministry of AI Affairs is planning to sell some obsolete GPUs.
"""

We create an embedding for the case description:

response = client.embeddings.create(
    model="mistral-embed",
    inputs=[case_query])
case_embed = np.array(response.data[0].embedding)
print(case_embed.shape)
(1024,)

The most relevant rules are likely to be ones whose embeddings are closest to the embeddings of the case query. To judge this, we will need a numeric measure of similarity. A commonly used measure is the so-called cosine similarity:

\[\cos(x,y) = \left(\frac{x}{\|x\|}\right) \cdot \left(\frac{y}{\|y\|}\right)\]

Recall from linear algebra that the right-hand side is the cosine of the smaller angle between the vectors \(x\) and \(y\). If the two vectors point in the same direction, and are in that sense similar, we will have \(cos(x,y)=1\). On the other hand if they point in opposite directions they will have \(cos(x,y)=-1\). Since both the vectors are divided by the norm before taking the dot product, cosine similarity only considers the relative values of the components of the vectors instead of their magnitudes. An alternative measure of similarity is the Euclidean distance \(\|x-y\|\), which does look at magnitude. To find similar vectors maximize cosine similarity and minimize Euclidean distance. In the case of vectors which have been normalized to have length one, both criteria lead to exactly the same similarity ordering (Prove!). Mistral embedding are in fact normalized to have unit length, and so we could have used either criteria. We shall use cosine similarity (implemented in scikit-learn).

import numpy.linalg as npla
from sklearn.metrics.pairwise import cosine_similarity

# Calculate and print the similarity of the query with each rule. 
# We need to reshape and flatten because 
# `cosine_similarity`'s arguments and
# return values are 2D vectors
rule_similarity = cosine_similarity(case_embed.reshape(1,-1),rules_embed).flatten()
print(rule_similarity)

# Print the most similar rule
print(rules[rule_similarity.argmax()])
[0.73370876 0.78445776 0.81273031 0.69485607]

     If a Ministry or Department
is unable to sell any surplus or obsolete or
unserviceable item in spite of its attempts
through advertised tender or auction, it
may dispose of the same at its scrap value
with the approval of the competent
authority in consultation with Finance
division. In case the Ministry or Department
is unable to sell the item even at its scrap
value, it may adopt any other mode of
disposal including destruction of the item in
an eco-friendly manner.
    

This does seem like the best rule for this situation. For a large dataset we may have printed the best few rules instead of just one.

In our code above we stored the embeddings as numpy arrays and explicitly calculated and maximized similarity measures. In a real application we would store the embeddings in a database along with the rules. There exist vector databases which are specialized in storing embeddings and finding them on the basis of similarity. They implement approximate nearest neighbour search algorithms which are much faster than calculating the similarity of the query with each database entry like we did. This comes with a caveat though. These algorithms are only approximate—they come with some probability of error. However, when databases become very large, using these approximately correct algorithms becomes practically necessary.

15.2 Clustering: Cluster Economics Paper Titles and Abstracts

In our next example we use text embeddings to cluster together economics articles on the basis of their titles and abstracts. This would also provides us a peek into unsupervised learning where we want to understand the distribution of our data instead of predicting the value of a target variable.

15.2.1 Data

Our data on economics articles is downloaded from RePEC.

import pandas as pd
papers = pd.read_csv('https://mlbook.jyotirmoy.net/static/data/papers.csv.zst')
papers.describe()
id title abstract
count 2000 2000 2000
unique 2000 1997 1999
top RePEc:ecm:emetrp:v:58:y:1990:i:5:p:1007-40 Entitled to Work: Urban Property Rights and La... (Copyright: Elsevier)
freq 1 2 2

For each paper we combine the title and abstract of the paper into a text passage and embed it using Mistral. We embed the papers in batches of batch_size. We want to keep batch_size large enough so that we do not make too many API calls and exceed our quota of calls, but we need to keep it small enough to not exceed the limits on the size of a single API call.

import numpy as np
from mistralai import Mistral
from time import sleep

# Parameters
n_papers = papers.shape[0]
embed_dim = 1024 # Fixed for Mistral
batch_size = 100

# Array to store the embeddings
embeddings = np.empty((n_papers,embed_dim),
                        dtype = np.float32)


# Initialize the Mistral client
client = Mistral(api_key=MISTRAL_API_KEY)

#Starting position of chunk
start_pos = 0

# Loop through chunks
while start_pos < n_papers:
    # End position of chunk
    end_pos = min(start_pos+batch_size,n_papers)

    # Create the list of texts for the chunk
    # Each text combines the title and the abstract so that information
    # from both is used by the LLM for
    # understanding the paper
    texts = [
    f"""
    TITLE: 
    {papers.iloc[i]['title']}

    ABSTRACT:
    {papers.iloc[i]['abstract']}
    """

    for i in range(start_pos,end_pos)
    ]
    
    # Make the API call
    response = client.embeddings.create(
        model="mistral-embed",
        inputs=texts)

    # Embeddings for this chunk
    es = np.array([e.embedding for e in response.data],
                             dtype=np.float32)

    # Add it to the combine embeddings array
    embeddings[start_pos:end_pos] = es

    # Sleep for 2s to avoid exceeding
    #   the per-second API rate limit
    sleep(2)
    
    # Move to the next chunk
    start_pos = end_pos

15.2.2 Clustering with K-Means

K-means clustering is one of the most fundamental unsupervised learning algorithms used for partitioning data into distinct groups based on similarity. The algorithm works by dividing data points into K clusters, where each cluster is represented by its centroid (the mean of all points in the cluster).

The process follows an iterative approach:

  1. Initialize K centroids randomly within the data space
  2. Assign each data point to the nearest centroid, forming K clusters
  3. Recalculate each centroid as the mean of all points assigned to its cluster
  4. Repeat steps 2-3 until centroids stabilize or a maximum number of iterations is reached

K-means aims to minimize the within-cluster sum of squares (WCSS), which measures the total squared distance between each point and its assigned centroid. This optimization creates compact, well-separated clusters when the underlying data naturally forms distinct groups.

For high-dimensional data like text embeddings, K-means is particularly effective as it scales well with both the number of data points and dimensions. However, it does require specifying the number of clusters (K) in advance and works best when clusters are roughly spherical and similar in size. Of course, we have no way of knowing whether these assumptions are satisfied, so it would be a good exercise to try out differing clustering algorithms.

We use the K-means implementation form scikit-learn. We set 25 as a desired number of cluster. This has to be chosen by trial and error.

from sklearn.cluster import KMeans

cluster_model = KMeans(25)
papers['label'] = cluster_model.fit_predict(embeddings)

The label column now holds the numeric identifier (from 0 to 25) of the cluster to which a paper is assigned.

15.2.3 Analyzing Cluster Results

After obtaining cluster assignments, we can examine the quality of the clustering by looking at the paper titles in each cluster. To illustrate, this code iterates through the first five cluster, printing the number of papers it contains and the titles of the first five papers. The output reveals clear thematic patterns.

clustered = list(papers.groupby('label'))
for cluster_id,cluster_papers in clustered[:5]:
    n_papers = cluster_papers.shape[0]
    print(f"\n Cluster {cluster_id}: {n_papers} papers")
    for i in range(5):
        print(cluster_papers.iloc[i]['title'])

 Cluster 0: 57 papers
Entry by Foreign Firms in the United States under Exchange Rate Uncertainty.
Did the Strong Dollar Increase Competition in U.S. Product Markets?
Real-Exchange-Rate Uncertainty and Private Investment in LDCS
Understanding European Real Exchange Rates
The effect of exchange rate variability on US shareholder wealth

 Cluster 1: 107 papers
Interim efficiency with MEU-preferences
A unifying model for matrix-based pairing situations
On $${\alpha }$$ α -roughly weighted games
On 64%-Majority Rule.
Robustness of intermediate agreements for the discrete Raiffa solution

 Cluster 2: 22 papers
Mortgage Terminations, Heterogeneity and the Exercise of Mortgage Options
The Loan Structure and Housing Tenure Decisions in an Equilibrium Model of Mortgage Choice
State Misallocation and Housing Prices: Theory and Evidence from China
Tax Subsidies to Owner-Occupied Housing: An Asset-Market Approach
Housing Problems

 Cluster 3: 59 papers
Land Inequality and the Transition to Modern Growth
Was Postwar Suburbanization "White Flight"? Evidence from the Black Migration
Social networks and interactions in cities
Presidential Address Institutions and Culture
Killer Cities: Past and Present

 Cluster 4: 80 papers
Asset allocation and location over the life cycle with investment-linked survival-contingent payouts
Liquidity, Risk, and Occupational Choices
Building the Family Nest: Premarital Investments, Marriage Markets, and Spousal Allocations
Temptation and Self‐Control: Some Evidence and Applications
The Effect of Inheritance Receipt on Retirement

We can see that clustering based on text embeddings has been able to clearly group together papers from similar areas of economics.

Wouldn’t it be great if we could have a title to give to each cluster? Well, that’s what we have LLMs for. We can feed few of the extracted titles from each cluster to an LLM and have it give us a label. We leave that for you as an exercise.

15.3 Conclusion

Embeddings are a powerful tool for extracting the ‘meaning’ of a body of text or other media as a numeric vector that can be stored and processed further.

We have seen how embeddings can be used for searching for relevant document based on content (semantic search) as well as clustering.

We can also use embedding as inputs into classification and regresssion models of our own, combining them with other numerical and categorical data. This lets the LLM do the heavy-lifting of extracting meaning from text (and other media) while we can focus on using the result as an input in modelling our variable of interest. For example you could feed embeddings of corporate press releases into a regression model of firm performance.

Another important use of embeddings is in record matching. Often economists need to join datasets from different sources together using text based columns such as company names and addresses. Different sources may have minor differences of spellings and abbreviations which can defeat direct character-by-character matching. First converting the text into embeddings and then matching them can lead to much more successful matching.

Finally, embeddings are the foundation of retrieval-augmented generation (RAG) where embedding-based search is used to select relevant sources from a knowledge base which are then put together into a LLM prompt. This allows a LLM based system to draw upon a knowledge base much larger than what can fit into a LLM context window.

We hope that this chapter has whetted your appetite to explore all these applications of embedding.