Embeddings and Vectors, explanation of how LLMs see and interpret the world.

Recently I’ve been investigating and leveling up my understanding of AI, LLMs, and AIOps in general.  You can’t look anywhere in technology it’s obvious that AI and Agentic AI are having the same level of impact on the software development industry as object-oriented programming.  And along those lines, I ended up doing a deep dive into how exactly embedding models work, to ultimately feed LLMs via RAG.  And that led to a lot of interesting discoveries and concepts that I wanted to share here. 

What is vectorization?  What are vectors?

This is a key part of NLP (Natural Language Processing) and specifically how we break the semantic meaning and syntactic relationships between words into a numeric based system.  The key part to this is the vectors themselves are high-dimensional in nature. 

Specifically, some of the key areas that are captured during vectorization of words are:

  • Semantic Understanding – By mapping words to vectors in high dimensional space, we help an LLM to understand what relationships between words are.  The idea of “sedan”,”truck”, and “motorcycle” all having a similar meaning and a relationship is captured numerically.
  • Dimensionality Reduction – Part of this to that LLMs and computers speak differently than we do.  From a machine, compute perspective, our language is very inefficient for processing as raw text.  By performing this vectorization we can convert this data without losing it’s meaning to a medium that is more efficient for machines. 
  • Transfer Learning – While building these embedding models, we can pre-train words which allows for existing information to provide additional context to prompts and embeddings which improve efficiency. 

How does a model via text?

At the core, the model views the text data in 3 key parts:

  • Corpus – A collection of text segments that provide context and statistical background as part of the embedding.  This is important as it helps provide data and informs the vectoring process with a context of the situation.  This is how we ensure that words that have different meanings in different situations are accounted for. 
  • Document – One of the text segments within a corpus.  This is seen as a basic unit of text for processing.
  • Term – This relates to each word of token that comprises the document.  This is the smallest unit of meaning to the model.

At the core, an LLM sees the text through 3 specific lenses:

  • Tokenization: Which is the process outlined before on this blog of taking text / characters and breaking it down into tokens.  This is true as part of the embedding process or the prompt and response model.
  • Embeddings: Represent the high-dimensional vectors that focus on not just the individual words, or tokens but how they relate to one another.  Through things like semantic meaning, context, and relationships. 
  • Attention Mechanisms:  These mechanisms attach different weights to the embeddings of tokens based on a relevance context.  Allowing for the model to use specific context focus to derive its answers.

Why do we use vectors at all?

At the core, if you think about processing up to this point.  Computers at their core speak a mathematics-based language.  They work in a world of binary, 1’s and 0’s. But we humans speak in a language that is based on text.  Furthermore, mathematics is precise, even getting into things like layers of precision.  Human languages are imprecise, have complex rules and exceptions to those rules.   When you look at the problem of LLM’s and Generative AI, the process of getting a machine to understand and interpret text language, the answer is quite clear.  What happens when you get two people who speak different languages?  You need a translator!  And at the core, that’s what vectors are, is a means of taking text and breaking it down to tokens, and then generating vectors based on those tokens to determine weight, importance, and similarity. 

What is the cosine similarity? 

So, the functional answer here is that it is a metric used to measure how similar a block of text is irrespective of the text block’s size.  And that is how it supports LLM processing through taking prompts and breaking down the text into numeric vectors and then comparing that against the pretrained and RAG stored data, all while including the context window to ensure further clarity.

Now from a mathematical perspective, cosine similarity is the measurement of the angle between two vectors projected onto multi-dimensional space.  The main reason this is important because even if the two documents are far apart due to the Euclidean distance (this would be caused by the size of the document), the cosine similarity being a multi-dimensional space helps support seeing the similarity in the terms. 

The reason cosine similarity is important is because it helps to resolve any bias due to the number of times a word appears in a document.  For example, if a word appears 100 times in one document, and 10 times in the other, the angle of similarity between these vectors would be more aligned and closer despite the magnitude of the entries.  This helps the LLM to recognize the similarity in a more efficient way. 

Now why does this matter in practice?  The answer is that this mathematical approach helps to ensure that size of text doesn’t bring a bias to the data.  For example, if I have a document that has 2,000 words, on a specific topic, let’s say “Star Wars”, and another document that’s only 500 words on “Darth Vader”.

Now if I ask for a prompt like “why did Darth Vader turn to the darkside?”

There is a risk that if we only look at how much of the significant words are in each document, the LLM would skew towards the larger document as being more relevant and pull from that information to generate its answer, but really the smaller document contains the relevant information I’m looking for in this case. 

If we take the vectors that are calculated by the embedding model, and plot them on a 2-dimensional space, based on occurrence and significance, this bias would be a lot more noticeable.  But if we plot this on a 3-dimensional space, we can wrap your mind around how those items might be spread far apart based on that evaluation criteria.  But if you are using the extra dimension to identify similarity, it would provide a new criterion that changes the output.  Now, the specifics of the math are a little over my head at this moment, but this is the core concept behind it, which really illustrates the power involved here with driving LLM usage of data. 

If you want a great article that explains the math, see [cosine similarity by ](Cosine Similarity – Understanding the math and how it works? (with python)) or this [GeeksforGeeks article](Cosine Similarity – GeeksforGeeks). 

What is topic modeling and gensim?

Another term I came across regarding this topic was genism.  And when I researched this online, I found a bunch of resources on it at a high level of “Topic Modeling for Humans.”  Which to me meant a whole lot of nothing at all. 

Which means the first part is to focus on “Topic Modeling,” which is a technique for extracting the underlying topics from large volumes of text, and for this process there are a significant number of models that exist (LDA, LSI, etc).  At the core, topic modeling is the process / approach to discovering and identifying hidden semantic patterns that help to inform context.

The key techniques that I found are used the most in topic modeling are Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA).  LSA is an older solution that was theorized back in the 1980s but specifically assumes that words with similar meanings will appear in similar documents.  The idea being that if you construct a matrix containing word counts for each document and then using Singular Value Decomposition to reduce the number of rows there is a mathematical way of deriving the similarity of the documents.  This similarity calculation is done using cosine similarity. 

Latent Dirichlet Allocation (LDA) is a more recent approach and focused which uses a Bayesian network to create a statistical model for deriving topics from a block of text.  Now in this case, the order of the words is irrelevant, and instead it looks to match each word to a collection of topics based on the generative statistical model. 

So as I wrap my head around this, that means that with embeddings, we are not only assigning vectors for each word based upon the word, and how much it appears within the body of text, but also using other models to discern the topic meaning within the semantics of the text structure.  All this provides context for RAG to be more useful to the LLM. 

Specifically, genism is a python library that processes these unstructured digital texts using unsupervised machine learning to accomplish this, and derive context and meaning for additional depth to vectors. 

What is meant by the TF-IDF (Term Frequency-Inverse Document Frequency)?

Now another key algorithm to natural language processing is Term Frequency-Inverse Document Frequency (TF-IDF).  and this relates to the above as it’s another classification algorithm on top of LSA and LDA, and an approach to doing a statistical measure in natural language processing. 

The first part focuses on “Term Frequency”.  And it focuses on evaluating the importance of a word in a document relative to a collection of documents, or a corpus.  This measures how often a word appears in a document, and the higher frequency would generally indicate importance.  And that importance can be used to inform relevancy to the document’s context.  This measure alone, is problematic as term frequency does not account for global importance of terms, and common words like articles will have high scores, but are not meaningful.  So Term Frequency on its own can create a significant bias. 

The second part is “Inverse Document Frequency”, which pulls down the weight of common words across multiple documents by offsetting with the weight of rare words.  This by itself would not provide value to doing embedding and deriving topics for topic modeling, but when measured against term frequency, it can help to offset the bias created by the math. 

What is Euclidean distance?

So given the above, we discussed cosine similarity and, in the process, continued to compare it to Euclidean distance?  The true answer from a math perspective is that Euclidean Distance is a straight-line distance between 2 points.  And at this point, I am very unimpressed with that definition.  Let’s dig deeper…

The truth is that this is a foundational metric for measuring and plays a key role in determining data clustering and is part of what takes theoretical math and making it practical.  The fact that it is so “simple” and “direct” makes it an ideal measure for real world practical systems.  For me the key thing I found in researching this topic, isn’t so much what Euclidean Distance is, because it really is the foundational concept behind many data processing solutions.  But it’s more how this single metric, by itself can inject a bias into a system.  But when you add in other metrics like cosine similarity, it really unlocks the power of modern AI, and how NLP can work the way it does. 

The benefit in traditional machine learning models, is that it enables the ability to guide outcomes to desirable solutions.  Take GPS for example.  When you put in a destination, it’s going to calculate the distance and then based on weights and preferences, Euclidean distance is a key part of the math that leads to the outcome we desire. 

This metric is a key part of the math behind clustering and classification algorithms which power much of what we use every day.  And the reason being is that it’s an easy metric to quantify, which then enables and supports machine learning by enabling similarity and dissimilarity binary operations. 

If you want a great step-by-step guide on doing the math here, see this [article](Mastering Euclidean Distance in Machine Learning: A Comprehensive Guide). 

How do you measure embedding models?

When looking at how these models behave and function, it comes down to math.  Mainly as they models build out vectors (numeric) functions from string-based values.  This is then measured by looking at the cosine similarity and Euclidean Distance of the results.  These matter because if we visual them as points, the idea that two similar words would be closer together than others.  Like for example, take the following words:

  • Sedan
  • Truck
  • Helicopter

We would expect that these would be close together as they are all vehicles, but the distance between sedan and truck should be less than the distance from truck to helicopter.  This is at a conceptual level to be clear.  But the goal would be that the vectors being generated would be closer together overall. 

How are these similarities used?

When you implement a Retrieval-Augmented Generation implementation, you are enabling a vector data source for the model to use and consume.  The intention being that the vectors help when the model breaks down the prompt into vectors itself for evaluating to generate a response.  The goal being that it would use the cosine similarity to find which document vectors are closest to the prompt / query vectors and then generate a response. 

Why does this performance matter?

This matters because of your basic “garbage in, garbage out.”  The principle is that the quality of the data means a better chance of better responses.  The better the embeddings model, the better the results from the prompt through RAG. 

What is the difference between text-embedding-ada-002, text-embedding-3-large, and text-embedding-3-small?

To be clear, part of the difference here is temporal.  text-embedding-ada-002 has been around for a “long time”(relative nowadays).  But this model has tested rather well against English-language data specifically.  Many consider this the industry standard and was released by OpenAI.  It uses 1536-length embeddings which is on the higher end for dimensionality.  What do we mean by that?  It means the number of features per vector is at the core higher.  This means that the embeddings are generally pretty high quality, and it leads to better embeddings for prompt response. 

Now text-embedding-3-large, has the same dimensionality as ada-002, at 1536.  But the model is proven to be more semantically accurate overall.  The cost is higher, so the question is does the data being fed for RAG require that higher level of semantic accuracy to justify the price point. 

The final option here is the text-embedding-3-small, which does still provide the same 1536-dimensional vectors.  But this model has a lower accuracy semantically, but higher efficiency.  This makes this model ideal for cases where you are willing to trade off accuracy for speed of processing. 

At the end of the day, the classic trade off triangle is still present for these models, you can have quality, speed, or cost…pick 2. 

Conclusion:

After going down this rabbit hole, I found it particularly interesting and insightful to understand how embeddings work and how LLMs use them.  The math is way more complex than I was expecting.  Much like anything in technology, I find breaking it down to understand how it works can be very insightful into how to leverage the tools better and where we can take things.  The architectural and cost implications of the embedding models you choose are an important consideration in architecting an application.  Mainly, as with different use-cases embeddings that are generated are going to influence the responses an LLM gives based on your data.  Is speed a primary concern?  Or is it completeness and context?  And along those lines, the volume and structure of the data could also have an impact as we saw with the different classification approaches and algorithms.