The Mathematics of Text Embeddings: High-Dimensional Semantic Vector Spaces
By LLM Practitioner
To an AI model, words are not letters—they are coordinates in a high-dimensional mathematical space. When a model processes a block of text, it maps sentences to a dense vector (an array of numbers, typically ranging from 768 to 3072 dimensions depending on the embedding model). In this high-dimensional space, the absolute direction of the vector represents its semantic meaning. Words or sentences with similar meanings are mapped to vectors that point in nearly identical directions.
To determine how similar two sentences are, we compute the angle between their vectors using a metric called Cosine Similarity. Mathematically, the cosine similarity of two vectors, A and B, is calculated as the dot product of the vectors divided by the product of their magnitudes: `similarity = (A · B) / (||A|| * ||B||)`. The result is a value between -1 and 1. A similarity of 1 means the vectors are pointing in the exact same direction, representing near-identical semantic content, while a similarity of 0 represents orthogonal, unrelated meanings.
In TellPDF, this mathematical mechanism powers the Document QA feature. When you ask a question like 'what is the termination clause in this contract?', we first generate the query's vector embedding. We then compare it against the pre-computed embeddings of all text segments in the PDF. The segments with the highest cosine similarity are extracted and fed directly to the Gemini LLM as context. This process, known as Retrieval-Augmented Generation (RAG), allows the model to answer questions accurately without needing to read the entire document for every turn.