Today, we are going to understand how Large Language Models (LLMs) interact with PDF documents using Vector Databases and Retrieval-Augmented Generation (RAG).
When we upload a PDF into an AI application, the LLM does not directly “read” the file like humans do. Instead, the document goes through multiple processing stages before the model can answer questions or summarize the content.
LLM + PDF = Intelligent Summarization
The main goal of combining an LLM with a PDF is to:
- summarize information
- answer questions
- retrieve relevant content
- perform semantic search
To achieve this, the PDF content is first converted into embeddings and stored inside a Vector Database.
Step 1: Loading the PDF into a Vector Database
The first step is extracting the text from the PDF.
Once extracted, the content is split into smaller chunks:
- sentence by sentence
- paragraph by paragraph
- or token-based chunks
Why do we split the content?
Because smaller chunks improve:
- retrieval accuracy
- semantic matching
- context relevance
These chunks are then stored inside a Vector Database.
Why Do We Need a Vector Database?
Traditional databases store:
- rows
- columns
- text values
But AI systems require a different way of storing information.
A Vector Database stores data as:
- embeddings
- vectors
- multi-dimensional numerical points
Flow:
Input Text → Embedding Model → Multi-Dimensional Vector
Instead of storing plain text, the system stores the “meaning” of the text as numbers.
Understanding Embeddings
Embedding models convert text into vectors.
For example:
"The weather is good today"
may become:
[0.21, -0.44, 0.98, ...]
These values represent the semantic meaning of the sentence.
Models like:
- Nomic Embed Text
- OpenAI Embeddings
- Sentence Transformers
generate vectors with hundreds of dimensions.
Example:
- 768 dimensions
-
represented mathematically as:
x, y, z, a, b, c … and so on
The higher-dimensional representation helps AI systems understand relationships between words and sentences.
One-Hot Encoding: A Basic Representation Technique
Before modern embeddings, a simple encoding technique called One-Hot Encoding was commonly used.
Example sentences:
["Hi", "how are you", "are you okay"]
First, the vocabulary is created:
["Hi", "how", "are", "you", "okay"]
Repeated words are consolidated into a single vocabulary list.
Now each sentence is converted into numerical representation using 0s and 1s.
Sentence 1:
["1", "0", "0", "0", "0"]
Sentence 2:
["0", "1", "1", "1", "0"]
Sentence 3:
["0", "0", "1", "1", "1"]
Notice how:
- “are”
- “you”
appear in both Sentence 2 and Sentence 3.
This creates a relationship between the sentences.
However, One-Hot Encoding has limitations:
- it cannot understand meaning
- vectors become sparse
- it does not capture semantic similarity
That is why modern AI systems use embeddings instead.
What is Semantic Search?
Semantic Search means searching based on meaning rather than exact keyword matching.
Example:
rice
and
Arisi
(Tamil word for rice)
Both represent the same meaning.
A semantic search system can understand this relationship and retrieve relevant results even if the exact words differ.
This is called:
- Semantic Search
- Similarity Search
How LLMs Use Semantic Search
LLMs are trained on massive datasets.
If the training data already contains relationships between:
- languages
- concepts
- meanings
then the model can retrieve semantically related information.
But what if your custom business data is not part of the model training?
In that case:
- You create embeddings from your own documents
- Store them in a Vector Database
- Retrieve relevant chunks during user queries
This is the foundation of RAG (Retrieval-Augmented Generation).
Query Flow in a Vector Database
When a user asks a question:
User Query → Embedding Model → Vector
The query is converted into a vector.
The Vector Database then searches for the nearest matching vectors.
Example:
Top 5 nearest vectors
The closest matching chunks are retrieved and passed to the LLM.
Finally, the LLM generates a contextual answer using the retrieved information.
How Does the Database Decide Which Vectors Are Close?
This is where similarity metrics come into play.
The most commonly used metric is:
Cosine Similarity
Cosine similarity measures the angle between two vectors.
If two vectors point in similar directions:
- they are considered semantically similar
Smaller angle:
- higher similarity
Larger angle:
- lower similarity
The comparison is usually based on:
- angle (theta)
- vector direction
- distance metrics
Cosine similarity works especially well for:
- text embeddings
- high-dimensional vectors
- semantic comparisons
Other similarity metrics also exist:
- Euclidean Distance
- Manhattan Distance
- Dot Product Similarity
Cosine similarity is simply one of the most effective methods for NLP applications.
KNN and ANN Algorithms
Vector Databases use specialized search algorithms.
KNN (K-Nearest Neighbour)
KNN searches for the exact nearest vectors.
Example:
- find the top 5 closest vectors
ANN (Approximate Nearest Neighbour)
ANN is optimized for speed.
Instead of checking every vector, it approximates the nearest matches efficiently.
Most production-grade Vector Databases use ANN because:
- datasets are huge
- search needs to be fast
- exact comparison is computationally expensive
Simple RAG Workflow Example
Imagine you have 5 custom sentences stored locally.
Flow 1: Storing Data
User Data
↓
Embedding Model
↓
Vector Conversion
↓
Store in ChromaDB
Flow 2: Querying Data
User Query
↓
Embedding Model
↓
Similarity Search in ChromaDB
↓
Relevant Chunks Retrieved
↓
LLM Generates Answer
This is the basic architecture behind most RAG applications.
Popular Tools You Can Try
Embedding Models
- Nomic Embed Text
- Sentence Transformers
- OpenAI Embeddings
Vector Databases
- ChromaDB
- Qdrant
- Redis Vector Search
- PostgreSQL with pgvector
Dimensionality Reduction
Embedding vectors can contain hundreds or thousands of dimensions.
To visualize or optimize them, dimensionality reduction techniques are used.
Popular algorithms:
- PCA (Principal Component Analysis)
- t-SNE
- UMAP
For example:
(1,1,1)
can represent a point in 3D space:
- x
- y
- z
Dimensionality reduction helps compress and visualize such high-dimensional data.
Text Search vs Image Search
Different Vector Databases specialize in different workloads.
Text Search
- ChromaDB
- pgvector
Image Search
- Qdrant
Image-based systems also use techniques like:
- feature extraction
- max pooling
- vector compression
Max pooling is commonly used in deep learning for image compression and feature reduction.
Final Thoughts
Vector Databases are one of the core building blocks of modern AI systems and RAG applications.
They enable:
- semantic search
- contextual retrieval
- intelligent question answering
- document summarization
Without Vector Databases, LLMs would struggle to work effectively with large custom datasets like PDFs, enterprise documents, and knowledge bases.
Understanding embeddings, similarity search, cosine similarity, and vector retrieval is essential for building scalable AI applications in the real world.
Comments
Post a Comment