Today, we are going to understand how Large Language Models (LLMs) interact with PDF documents using Vector Databases and Retrieval-Augmented Generation (RAG).

When we upload a PDF into an AI application, the LLM does not directly “read” the file like humans do. Instead, the document goes through multiple processing stages before the model can answer questions or summarize the content.

LLM + PDF = Intelligent Summarization

The main goal of combining an LLM with a PDF is to:

summarize information
answer questions
retrieve relevant content
perform semantic search

To achieve this, the PDF content is first converted into embeddings and stored inside a Vector Database.

Step 1: Loading the PDF into a Vector Database

The first step is extracting the text from the PDF.

Once extracted, the content is split into smaller chunks:

sentence by sentence
paragraph by paragraph
or token-based chunks

Why do we split the content?

Because smaller chunks improve:

retrieval accuracy
semantic matching
context relevance

These chunks are then stored inside a Vector Database.

Why Do We Need a Vector Database?

Traditional databases store:

rows
columns
text values

But AI systems require a different way of storing information.

A Vector Database stores data as:

embeddings
vectors
multi-dimensional numerical points

Flow:


Input Text → Embedding Model → Multi-Dimensional Vector

Instead of storing plain text, the system stores the “meaning” of the text as numbers.

Understanding Embeddings

Embedding models convert text into vectors.

For example:


"The weather is good today"

may become:


[0.21, -0.44, 0.98, ...]

These values represent the semantic meaning of the sentence.

Models like:

Nomic Embed Text
OpenAI Embeddings
Sentence Transformers

generate vectors with hundreds of dimensions.

Example:

768 dimensions
represented mathematically as:
x, y, z, a, b, c … and so on

The higher-dimensional representation helps AI systems understand relationships between words and sentences.

One-Hot Encoding: A Basic Representation Technique

Before modern embeddings, a simple encoding technique called One-Hot Encoding was commonly used.

Example sentences:


["Hi", "how are you", "are you okay"]

First, the vocabulary is created:


["Hi", "how", "are", "you", "okay"]

Repeated words are consolidated into a single vocabulary list.

Now each sentence is converted into numerical representation using 0s and 1s.

Sentence 1:


["1", "0", "0", "0", "0"]

Sentence 2:


["0", "1", "1", "1", "0"]

Sentence 3:


["0", "0", "1", "1", "1"]

Notice how:

“are”
“you”

appear in both Sentence 2 and Sentence 3.

This creates a relationship between the sentences.

However, One-Hot Encoding has limitations:

it cannot understand meaning
vectors become sparse
it does not capture semantic similarity

That is why modern AI systems use embeddings instead.

What is Semantic Search?

Semantic Search means searching based on meaning rather than exact keyword matching.

Example:


rice

and


Arisi

(Tamil word for rice)

Both represent the same meaning.

A semantic search system can understand this relationship and retrieve relevant results even if the exact words differ.

This is called:

Semantic Search
Similarity Search

How LLMs Use Semantic Search

LLMs are trained on massive datasets.

If the training data already contains relationships between:

languages
concepts
meanings

then the model can retrieve semantically related information.

But what if your custom business data is not part of the model training?

In that case:

You create embeddings from your own documents
Store them in a Vector Database
Retrieve relevant chunks during user queries

This is the foundation of RAG (Retrieval-Augmented Generation).

Query Flow in a Vector Database

When a user asks a question:


User Query → Embedding Model → Vector

The query is converted into a vector.

The Vector Database then searches for the nearest matching vectors.

Example:


Top 5 nearest vectors

The closest matching chunks are retrieved and passed to the LLM.

Finally, the LLM generates a contextual answer using the retrieved information.

How Does the Database Decide Which Vectors Are Close?

This is where similarity metrics come into play.

The most commonly used metric is:

Cosine Similarity

Cosine similarity measures the angle between two vectors.

If two vectors point in similar directions:

they are considered semantically similar

Smaller angle:

higher similarity

Larger angle:

lower similarity

The comparison is usually based on:

angle (theta)
vector direction
distance metrics

Cosine similarity works especially well for:

text embeddings
high-dimensional vectors
semantic comparisons

Other similarity metrics also exist:

Euclidean Distance
Manhattan Distance
Dot Product Similarity

Cosine similarity is simply one of the most effective methods for NLP applications.

KNN and ANN Algorithms

Vector Databases use specialized search algorithms.

KNN (K-Nearest Neighbour)

KNN searches for the exact nearest vectors.

Example:

find the top 5 closest vectors

ANN (Approximate Nearest Neighbour)

ANN is optimized for speed.

Instead of checking every vector, it approximates the nearest matches efficiently.

Most production-grade Vector Databases use ANN because:

datasets are huge
search needs to be fast
exact comparison is computationally expensive

Simple RAG Workflow Example

Imagine you have 5 custom sentences stored locally.

Flow 1: Storing Data


User Data
   ↓
Embedding Model
   ↓
Vector Conversion
   ↓
Store in ChromaDB

Flow 2: Querying Data


User Query
   ↓
Embedding Model
   ↓
Similarity Search in ChromaDB
   ↓
Relevant Chunks Retrieved
   ↓
LLM Generates Answer

This is the basic architecture behind most RAG applications.

Popular Tools You Can Try

Embedding Models

Nomic Embed Text
Sentence Transformers
OpenAI Embeddings

Vector Databases

ChromaDB
Qdrant
Redis Vector Search
PostgreSQL with pgvector

Dimensionality Reduction

Embedding vectors can contain hundreds or thousands of dimensions.

To visualize or optimize them, dimensionality reduction techniques are used.

Popular algorithms:

PCA (Principal Component Analysis)
t-SNE
UMAP

For example:


(1,1,1)

can represent a point in 3D space:

Dimensionality reduction helps compress and visualize such high-dimensional data.

Text Search vs Image Search

Different Vector Databases specialize in different workloads.

Text Search

ChromaDB
pgvector

Image Search

Qdrant

Image-based systems also use techniques like:

feature extraction
max pooling
vector compression

Max pooling is commonly used in deep learning for image compression and feature reduction.

Final Thoughts

Vector Databases are one of the core building blocks of modern AI systems and RAG applications.

They enable:

semantic search
contextual retrieval
intelligent question answering
document summarization

Without Vector Databases, LLMs would struggle to work effectively with large custom datasets like PDFs, enterprise documents, and knowledge bases.

Understanding embeddings, similarity search, cosine similarity, and vector retrieval is essential for building scalable AI applications in the real world.

Tester Friend

Search This Blog

Day 2: Vector DB in RAG

LLM + PDF = Intelligent Summarization

Step 1: Loading the PDF into a Vector Database

Why Do We Need a Vector Database?

Understanding Embeddings

One-Hot Encoding: A Basic Representation Technique

What is Semantic Search?

How LLMs Use Semantic Search

Query Flow in a Vector Database

How Does the Database Decide Which Vectors Are Close?

Cosine Similarity

KNN and ANN Algorithms

KNN (K-Nearest Neighbour)

ANN (Approximate Nearest Neighbour)

Simple RAG Workflow Example

Flow 1: Storing Data

Flow 2: Querying Data

Popular Tools You Can Try

Embedding Models

Vector Databases

Dimensionality Reduction

Text Search vs Image Search

Text Search

Image Search

Final Thoughts

Comments

Post a Comment