When I started looking into the architecture of Large Language Models (LLMs), I got confused when I encountered Retrieval Augmented Generation (RAG). Both LLMs themselves and RAG use embeddings (a numerical vector representation of a token) and through its shared terminology, I made the wrong assumption that the embeddings in both are strongly related. It is in fact much simpler, and while both use embeddings, they are unrelated to each other.
Note: I'm still dipping my toes into the world of LLMs (and other generative AI, like diffusion-models for image generation), so my posts might be inaccurate. I welcome any feedback or comments on this.
Embeddings in a large language model
LLMs are trained to predict text given a certain input. The text that is predicted are so-called tokens, small text snippets. These are then added to the input text, and the LLM again predicts the next token, moving forward until it predicted a special token that indicates the end of a text sequence.
Suppose the text at that point is the following:
Two roads diverged in a yellow wood, and sorry I
You might know this as the start of "The Road Not Taken", a poem by Robert Frost. If the LLM is trained with this poem, it might be able to predict the next tokens. When I ran this as input through Qwen3-VL 8B, one of the more recent open-weights model released by the Qwen team at Alibaba Cloud, it was able to generate parts of the poem further, but eventually strayed off course.
Two roads diverged in a yellow wood, and sorry I couldn’t travel both and be one traveler, long I stood. and looked down one as far as I could to where it bent in the undergrowth; Then took the other, as just as fair, and having perhaps a better claim, because it was grassy and wanted wear; though as for that the passing there had worn them really about the same, And both that morning equally lay in leaves no step had trodden black. And both … The question is — which way does he take? It’s not clear. He says “I took the other”, but then says “the passing there had worn them really about the same”. So why did he choose one over the other? Is it a matter of chance? Or is there something more symbolic going on?
Note: There is some randomness involved here, other iterations with the same model and input did result in the poem being quoted correctly, followed by an analysis of the poem.
While generating the output, the model generates one part of text at a time. This part of text is called a token, and the LLM has a built-in tokenizer that converts text into tokens, and tokens back into text. For the Qwen3 models, the Qwen tokenizer is used. If I understand its vocabulary correctly, the text "couldn't travel" would be tokenized into:
[ "couldn", "'t", " ", "travel" ]
Different LLMs can use different tokenization methods, but there is a lot of re-use here. Different LLM models can use the same tokenizer.
These tokens are converted into embeddings, which form the foundational representation for use in LLMs. They are numerical vectors that represent those text tokens. LLMs work with these numerical vectors: LLMs (and AI in general) are software systems that perform heavy computational operations, performing many matrix operations with each matrix being a massive set of numbers. Well, text is represented as a huge matrix.
Embeddings are not just a simple index, but are pretrained values. These values enable token mapping based on semantic similarity. When the training material often combines "corona" and "COVID", then these two will have embeddings that allow both terms to be seen as close to each other. But the same is true if there is material combining "corona" and "beer". So the embedding that represents "corona" (assuming it is a single token) would have semantic understanding of both corona being a viral disease (related to COVID-19) as well as an alcoholic beverage.
Unlike tokenizers, which can be reused across different LLM models, the embeddings are unique to each model. Sure, within the same family (e.g. Qwen3) there can be reuse as well, but it is much less common to see this re-use across different families.
The phrase "Two roads" would consist of three tokens ("Two", " ", "roads"), which are converted into a corresponding 4096-dimensional embedding vector during processing. The dimension is fixed for a particular LLM: Qwen3 8B for instance uses embeddings of 4096 numbers. So that start would be a matrix with dimensions 3x4096. The entire text itself thus would be represented by a very large matrix, with one dimension being this embedding size (4096 in my case), the other dimension being the amount of tokens already used as text (both input and generated output).
These matrices are then used as input within the LLM, which then starts doing magic with them (well, not really magic, it's rather maths, multiplying the matrix against other in-LLM stored matrices, iterating over multiple blocks of matrix operations, etc.) to eventually output a (sequence of) embedding(s), which is appended to the input matrix to re-iterate the entire process over and over again.
The maximum amount of tokens that a model can handle is also predefined, although there are methods to extend this. For Qwen3 8B, this is 32768 natively, and 131072 with an extension method called YaRN. So, for the native implementation, that means the maximum text size would be represented as a matrix of dimensions 32768x4096.
Retrieval Augmented Generation
LLMs are trained with a certain set of data, so once it is finished training, it does not have the ability to learn more. To make it more useful, you want the LLM to have access to recent insights. Nowadays, the hype is all about MCP (Model Context Protocol), which is having LLMs trained to understand that they have tools at their disposal, and know how to call these tools (well, in reality, they are trained to generate output that the software which executes the LLM detects, makes a tool output, and adds the outcome of that tool back to the text already generated, allowing the LLM to continue).
Before MCP the world was (and still is) using Retrieval Augmented Generation (RAG). The idea behind RAG is that, before the LLM responds to a user's query (prompt) it also receives new information from external data sources. With both the user query and information from the sources, the LLM is able to generate more useful output.
When I looked at RAG, I noticed it using embeddings as well prior to the actual retrieval, so I wrongfully thought that those are the same embeddings, and that the outcome of the RAG would be an embedding matrix as well, that the LLM then receives and further processes...
I was misled by documentation on RAGs indicated things like "the data to be referenced is converted into LLM embeddings", and that the technology used for RAG retrieval are vector databases specialized for embedding-based operations. Many online resources also looked at RAG as a complete, singular solution with multiple components. So I jumped into conclusion that these are the same embeddings. But then, that would mean the RAG solution would be tailored to the LLM being used, because other LLM models (like Llama3, or Mistral) use different embedding vocabulary.
Instead, what RAG does, is take the same prompt, convert it into tokens and embeddings (using its own tokenizer/embedding vocabulary) and then uses that to perform a search operation against the data that is added to the RAG database. This data (which is the recent insights or other documents you want your LLM to know about) is also tokenized and converted into embeddings, but it is not those embeddings that are brought back to the main LLM, but the plain text outcome (or other media types that your LLM understands, such as images).
Why does RAG then use embeddings? Wouldn't a simple search engine be sufficient? Well, the RAG's primary advantage is its ability to locate relevant information more effectively through embeddings. Thanks to the embedding representation, the RAG can find information that is related to the user query without relying on keyword matches. You could effectively replace the RAG engine with a simple search - and many LLM-powered software applications do support this. For instance, Koboldcpp which I use to run LLM locally, supports a simple DuckDuckGo-based websearch as well.
The use of embeddings for search operations (again, completely independent of the LLM) allows for contextual understanding. When a user prompts for "What are the ingredients for Corona", a simple keyword-based search operation might incorrectly result in findings of COVID-19, whereas in this case the query is about the Corona beer.
These improved search operations are often called "semantic search", as they have a better understanding of the semantics and meanings of text (through the embeddings), resulting in more contextually relevant insights.
When is it "RAG" and when semantic search
Retrieval Augmented Generation is the process of converting the user query, performing a semantic search against the knowledge base, and appending the best results (e.g. top-3 hits in the knowledge base) to the user input text. This completed input text thus contains both the user query, as well as pieces of insights obtained from the semantic search. The LLM uses this additional information for generating better outcomes. This entire pipeline (retrieving context, augmenting the prompt, and then generating output) is what defines "RAG".
I personally see RAG technology-wise being very similar to a regular search: replace the semantic search with a search engine (which underlyingly could also use semantic search anyway) and the outcome is the same. The main difference is that RAG is meant for finding exact truth, information snippets tailored to bring context information accurately, whereas a search engine based retrieval would rather bring snippets of data back.
In the market, RAG also focuses on the management of the semantic search (and vector database), optimizing the data that is added to the knowledge base to be LLM-friendly (shorter pieces of accurate data, rather than fully-indexed complete pages which could easily overload the maximum size that an LLM can handle). It prioritizes efficient data management and insights lifecycle control.
For LLMs, it also provides a bit more nuance. A web search would be presented to the LLM as "The following information can be useful to answer the question", whereas RAG results would be presented as actual insights/context. LLMs might be trained to deal differently with that distinction.
Understanding that the semantic search is independent of the LLM of course makes much more sense. It allows companies or organizations to build up a knowledge base and maintain this knowledge independent of the LLMs. Multiple different LLMs can then use RAG to obtain the latest information from this knowledge base - or you can just use the engine for semantic searches alone, you do not need LLMs to get beneficial searches. Many popular web search engines use semantic search underlyingly (i.e. when they index pages, they also generate the embeddings from it and store those in their own vector databases to improve search results).
When new embedding algorithms emerge that you want to use, you must re-generate the embeddings for the entire knowledge base. But that will most likely occur much, much less frequently than using new LLM models (given the rapid evolution here).
Conclusion
RAG is a feature of the software that runs the LLM, allowing for retrieving contextual information from a curated knowledge base. RAG's use of embeddings is related to its semantic search, not to the same embeddings as those used by the LLM. The contextual information is added to the user prompt as text, and only then 'converted' into the embeddings used by the LLM itself.
Feedback? Comments? Don't hesitate to get in touch on Mastodon.
Images are created in Inkscape, using icons from Streamline (GitHub), released under the CC BY 4.0 license, indexed at OpenSVG.







The four contributing factors within Technology Sovereignty (SOV-6). Open licensing is one among four. Source: 












