Why we go for chunking?
If we upload the full file, then the data get stored as a single value in Vector. so different context will be stored in a single place. which is not correct.
so We are trying to chunk the data, so Multiple context will be achieved after chunking.
We will do the chunking with two different context. over lapping and fixed length. so mostly we will go with overlapping concept. so most of the content will be splitted and data will be stored.
problem with overlapping.
P1 - paragraph1 and P2 - Paragraph 2 - in this we combine few words will be combined in P2 but if sentence will be related, if the second sentence depends on the first point then relation will be missed here. So this kind of chunking will be likely semantic chunking.
Sometime if both the paragraphs are almost similar then we dont need to combine as well. both points will be stored nearby in vector DB. so in this state also overlapping will be another problem. so these problem will also be semantic chunking.
Example: text = "Java is a enterprise application and python is a open source".
After overlapping chunking data will be splitted like
["Java is a ", "enterprise application and python", "is a open source"]
If you see above enterprise and python will get stored in a single pointer in vector which is not right.
this refers like semantic chunking.
Semantic chunking - similar words will be stored based on the context. Grouping the content based on meaning rather than the fixed size.
Threshold will be set for comparing the context for two different pointers. IF two different sentences are matched with 75% then threshold will be set a .75. so while comparing two different sentences the threshold limit will be checked.
we can achieve the threshold limit comparison using NLTK package.
how we can split the data - chunking concept will be defined by us. We will have different how we can do that.
Embedding chunking - its costly one.- as we are involving the LLM + and converting the context.
embedding transformer - for the given content the value of pointers will be generated then that value is stored in Vector DB.
gives a score with the cosine_similarity value and threshold is calculated. then data will be stored in the vector DB with different chunk data.
two sentences are closer together then single vector points will be shared.
If three different documents are compared, then P1 and P3 are closer in Vector points and P2 will gets stored in another point.
Conclusion:
1. Fixed Chunking
Split the text into equal‑sized chunks based on a fixed number of characters or tokens. Example:
Chunk size = 50 characters
Output →
[0–50],[50–100],[100–150]
Useful when you need predictable chunk sizes.
2. Overlap Chunking
Split the text into chunks with a sliding window so each chunk overlaps with the previous one. Example with 50‑character chunks and 10‑character overlap:
[0–50][40–90][80–130]
This helps preserve context across chunk boundaries.
3. Semantic Chunking
Split the text based on meaning, not size. Uses NLP tools (e.g., NLTK) to detect sentence boundaries, topics, or semantic similarity. Chunks end where the meaning naturally shifts.
4. Embedding‑Based Chunking
Use an embedding model (e.g., transformer encoder) to convert each chunk into a vector representation. These vectors are then stored in a Vector Database (like ChromaDB, Pinecone, Weaviate) for semantic search and retrieval.
Comments
Post a Comment