“Attention” is a mathematical process that helps make AI models “context-aware”.
It is the backbone of LLM’s, calculating how different words interact to convey meaning.
Attention has 4 primary components: Embeddings, Queries, Keys, Values. Each is made of learned “weights”.
When combined, these weights allow us to predict the next word in a sentence.
Embeddings are mathematical representations of words*.
Larger embeddings capture more nuance around how words are used.
Each embedding “vector” is multiplied by “Query”, “Key”, and “Value” matrices separately, resulting in “Query”, “Key”, and “Value” vectors.
We compare the Query and Key vectors, which give us our “Attention Pattern”; scoring how relevant each word is to updating the meaning of every other word.
The value vector tells us how to update the meaning of these words by multiplying it by these Attention scores.
This result is added back to original word embedding, thereby capturing the context from all surrounding words.
During “training”, this process is used to predict probabilities of the next possible word.
This list is compared to the true word, and the model is “penalized” for incorrect and low confidence predictions.
This penalty is used to adjust the “weights”, to make subsequent predictions more accurate.