Embedding concept

Now we are going to see the concept of

Chunking -> Embedding

text -> Number

why we need to convert the text to number.

To find the semantic search we need to convert the text in to number so that the cosine distance will be calculated and returned with nearby distances.

Why cosine formula has been used.

COS is used to find the distance between two data to identify the closer points.

As per the below trignometry table, you can se the value of cos value alone. closer values will be 1 and rest will became 0 or -ve values.

so we can find the closer value from this COS only. rest all angles will give different values.

In SIN 0 degree and 180 degree will have the same 0 value which will cause confusion. like this we can find the angle between other points will not related.

Source of below image is from google search response.

angle between two data is close together then return1 which is closer
else both values are not closer.

currently in market 256 to 3000 dimension is present in model. Model context is refers the combinations and response to users.

Quantization - without affecting the data the size will get reduced.

KNN & ANN - will be used for retriving the results to the user.

K nearest neighbour. (Size constraint)

A approximate nearest neighbour (Time constraint)

If points are higher in model context then search will take time. based on the scenario the results will be responded.

If points are more - whether performance is better. It depends on the scenario it will behave.

Different models are used for different operations.

How we can segregate the model

By query Type: (user input)

1) Symmetrical model = query is identical to the docs in the document. - Nomic embeded text - Qwen3

same details will be responded based on the query. Ex: news article, same article will be responded.

leave document I have, is there any similar documents or not.

2) Asymmetrical model - Gemini model - query is shorter - Longer explanation documents. ex: small comment and gives a blog post. Company policy - leave count. straight forward answers.

Retrival mode:

1) Dense Embedding -> Highly understand the synonyms. - cohert embed models, chatgpt oss 120B

if text is given then we will get value. value will be 0.

Dense = semantic similar search

2) Sparse embedding -> exact keyword march -> ISBN, PARTs id, serial id.

Sparse = exact search

TF = Term Frequency - Number of occurrence in a Text.

IDF = for the manipulation data is present then IDF concept will be used - Inverse Document frequency.

like duplicate entries are stored then this will give importance of the word.

BM25 algorithm to find the best 25 results. this depends on IDF

TF = keyword match

BM25 = IDF = importance.

Search option

OpenSearch(BM25 concept) / ElasticSearch -> Hybrid search is possible. search options.

Transformers:

encode and decode mechanism

encoder - text to embedding

decoder - after few operations embedding to text.

read attention is all you need.

[1706.03762] Attention Is All You Need

1706.03762

Model - gives dimension points.

Vectorsearch.

For learning purpose:

chunk -> Local model [Nomic-Embed Text] -> chroma DB

Try the usecase

Text in list [] -> convert the text into embedding using ollama embedding model -> store that in chroma db. -> Query (convert this also in embedding values) and print the response.

search with mock data for rag building. you will get a sample rag data. for larger data set. else go with list of similar objects max of 5 sentences. thats is more than enough to built this app.

for visual representation of the embedding data you can use the sklearn, Matplolib, numpy. so you can create a sample vector search on your own.

HappY CodinG!!!

Tester Friend

Search This Blog

Embedding concept - Part 1

Comments

Post a Comment