Day 2: Vector DB in RAG

Today, we are going to understand how Large Language Models (LLMs) interact with PDF documents using Vector Databases and Retrieval-Augmented Generation (RAG).

When we upload a PDF into an AI application, the LLM does not directly “read” the file like humans do. Instead, the document goes through multiple processing stages before the model can answer questions or summarize the content.


LLM + PDF = Intelligent Summarization

The main goal of combining an LLM with a PDF is to:

  • summarize information
  • answer questions
  • retrieve relevant content
  • perform semantic search

To achieve this, the PDF content is first converted into embeddings and stored inside a Vector Database.


Step 1: Loading the PDF into a Vector Database

The first step is extracting the text from the PDF.

Once extracted, the content is split into smaller chunks:

  • sentence by sentence
  • paragraph by paragraph
  • or token-based chunks

Why do we split the content?

Because smaller chunks improve:

  • retrieval accuracy
  • semantic matching
  • context relevance

These chunks are then stored inside a Vector Database.


Why Do We Need a Vector Database?

Traditional databases store:

  • rows
  • columns
  • text values

But AI systems require a different way of storing information.

A Vector Database stores data as:

  • embeddings
  • vectors
  • multi-dimensional numerical points

Flow:

Input Text → Embedding Model → Multi-Dimensional Vector

Instead of storing plain text, the system stores the “meaning” of the text as numbers.


Understanding Embeddings

Embedding models convert text into vectors.

For example:

"The weather is good today"

may become:

[0.21, -0.44, 0.98, ...]

These values represent the semantic meaning of the sentence.

Models like:

  • Nomic Embed Text
  • OpenAI Embeddings
  • Sentence Transformers

generate vectors with hundreds of dimensions.

Example:

  • 768 dimensions
  • represented mathematically as:
    x, y, z, a, b, c … and so on

The higher-dimensional representation helps AI systems understand relationships between words and sentences.


One-Hot Encoding: A Basic Representation Technique

Before modern embeddings, a simple encoding technique called One-Hot Encoding was commonly used.

Example sentences:

["Hi", "how are you", "are you okay"]

First, the vocabulary is created:

["Hi", "how", "are", "you", "okay"]

Repeated words are consolidated into a single vocabulary list.

Now each sentence is converted into numerical representation using 0s and 1s.

Sentence 1:

["1", "0", "0", "0", "0"]

Sentence 2:

["0", "1", "1", "1", "0"]

Sentence 3:

["0", "0", "1", "1", "1"]

Notice how:

  • “are”
  • “you”

appear in both Sentence 2 and Sentence 3.

This creates a relationship between the sentences.

However, One-Hot Encoding has limitations:

  • it cannot understand meaning
  • vectors become sparse
  • it does not capture semantic similarity

That is why modern AI systems use embeddings instead.


What is Semantic Search?

Semantic Search means searching based on meaning rather than exact keyword matching.

Example:

rice

and

Arisi

(Tamil word for rice)

Both represent the same meaning.

A semantic search system can understand this relationship and retrieve relevant results even if the exact words differ.

This is called:

  • Semantic Search
  • Similarity Search

How LLMs Use Semantic Search

LLMs are trained on massive datasets.

If the training data already contains relationships between:

  • languages
  • concepts
  • meanings

then the model can retrieve semantically related information.

But what if your custom business data is not part of the model training?

In that case:

  1. You create embeddings from your own documents
  2. Store them in a Vector Database
  3. Retrieve relevant chunks during user queries

This is the foundation of RAG (Retrieval-Augmented Generation).


Query Flow in a Vector Database

When a user asks a question:

User Query → Embedding Model → Vector

The query is converted into a vector.

The Vector Database then searches for the nearest matching vectors.

Example:

Top 5 nearest vectors

The closest matching chunks are retrieved and passed to the LLM.

Finally, the LLM generates a contextual answer using the retrieved information.


How Does the Database Decide Which Vectors Are Close?

This is where similarity metrics come into play.

The most commonly used metric is:

Cosine Similarity

Cosine similarity measures the angle between two vectors.

If two vectors point in similar directions:

  • they are considered semantically similar

Smaller angle:

  • higher similarity

Larger angle:

  • lower similarity

The comparison is usually based on:

  • angle (theta)
  • vector direction
  • distance metrics

Cosine similarity works especially well for:

  • text embeddings
  • high-dimensional vectors
  • semantic comparisons

Other similarity metrics also exist:

  • Euclidean Distance
  • Manhattan Distance
  • Dot Product Similarity

Cosine similarity is simply one of the most effective methods for NLP applications.


KNN and ANN Algorithms

Vector Databases use specialized search algorithms.

KNN (K-Nearest Neighbour)

KNN searches for the exact nearest vectors.

Example:

  • find the top 5 closest vectors

ANN (Approximate Nearest Neighbour)

ANN is optimized for speed.

Instead of checking every vector, it approximates the nearest matches efficiently.

Most production-grade Vector Databases use ANN because:

  • datasets are huge
  • search needs to be fast
  • exact comparison is computationally expensive

Simple RAG Workflow Example

Imagine you have 5 custom sentences stored locally.

Flow 1: Storing Data

User Data

Embedding Model

Vector Conversion

Store in ChromaDB

Flow 2: Querying Data

User Query

Embedding Model

Similarity Search in ChromaDB

Relevant Chunks Retrieved

LLM Generates Answer

This is the basic architecture behind most RAG applications.


Popular Tools You Can Try

Embedding Models

  • Nomic Embed Text
  • Sentence Transformers
  • OpenAI Embeddings

Vector Databases

  • ChromaDB
  • Qdrant
  • Redis Vector Search
  • PostgreSQL with pgvector

Dimensionality Reduction

Embedding vectors can contain hundreds or thousands of dimensions.

To visualize or optimize them, dimensionality reduction techniques are used.

Popular algorithms:

  • PCA (Principal Component Analysis)
  • t-SNE
  • UMAP

For example:

(1,1,1)

can represent a point in 3D space:

  • x
  • y
  • z

Dimensionality reduction helps compress and visualize such high-dimensional data.


Text Search vs Image Search

Different Vector Databases specialize in different workloads.

Text Search

  • ChromaDB
  • pgvector

Image Search

  • Qdrant

Image-based systems also use techniques like:

  • feature extraction
  • max pooling
  • vector compression

Max pooling is commonly used in deep learning for image compression and feature reduction.


Final Thoughts

Vector Databases are one of the core building blocks of modern AI systems and RAG applications.

They enable:

  • semantic search
  • contextual retrieval
  • intelligent question answering
  • document summarization

Without Vector Databases, LLMs would struggle to work effectively with large custom datasets like PDFs, enterprise documents, and knowledge bases.

Understanding embeddings, similarity search, cosine similarity, and vector retrieval is essential for building scalable AI applications in the real world.

Comments