RAG & EMBEDDING

RAG & Embeddings

A Hybrid Architecture at the Core of Modern AI

What is RAG? Retrieval-Augmented Generation (RAG) is an architecture that combines two major components of natural language processing:

Retrieval: an information retrieval step that fetches relevant documents or passages from a knowledge base.
Generation: a response generation step driven by a Large Language Model (LLM), leveraging the retrieved documents to produce contextualized answers.

This approach overcomes the limitations of fine-tuning by dynamically integrating external knowledge during inference.

The Central Role of Embeddings

Embeddings are vector representations of textual units (sentences, paragraphs, documents) in a dense, fixed-dimensional space (typically ranging from 384 to 1536 dimensions, depending on the model).

They are generated by specialized models, often distinct from LLMs (e.g., sentence transformers, OpenAI text-embedding-3-small, Instructor, GTE, etc.).

How are embeddings used in RAG?

Here are the typical processing steps in a RAG pipeline:

Semantic Indexing (offline) • Reference documents are segmented into chunks (e.g., 200–500 tokens). • Each chunk is encoded into a vector using an embedding encoder. • Vectors are stored in a vector database (Vector DB) such as FAISS, Weaviate, Qdrant, Pinecone, etc.
Semantic Search (online) • When a user query is submitted, it is encoded into a vector. • Vector similarity (cosine, dot-product, etc.) is computed between the query and indexed documents. • The top-k most relevant passages are retrieved.
Augmented Generation • Retrieved documents are injected into the LLM prompt, typically via prompt stuffing (in-context learning) or chain-of-thought techniques. • The LLM generates a response based on retrieved information, resulting in answers that are: o Contextualized o Non-hallucinated o Domain or organization-specific

Typical Technological Stack for an Efficient RAG

Component	Possible Technologies
Embedding Model	OpenAI, HuggingFace (e.g., sentence-transformers), Cohere
Vector Store	FAISS, Qdrant, etc.
LLM	LLaMA 3/4, Mistral, GPT-4, Claude, etc.
Orchestration	LangChain, etc.
Cloud Stack	Azure AI Search, AWS Kendra + Bedrock, GCP Vertex AI

RAG & Embeddings

The Central Role of Embeddings

Typical Technological Stack for an Efficient RAG

Recent posts

Archive

Tags

AI Strategy and Consulting

Recent comments

Company

Expertise