Attention in Transformers Query, Key and Value in Machine Learning

Опубликовано: 09 Июль 2024
на канале: Stephen Blum

5,596

183

When using query, key, and value (Q, K, V) in a transformer model's self-attention mechanism, they actually all come from the same input data. This input data, which is words turned into numbers (tokens) through a process called embedding, gives us these Q, K, and V. Think of embedding like a dictionary lookup that gives each word a unique ID and then represents it as an array of numbers (features).

For example, the word "fox" might be represented by 512 different floating point numbers, each reflecting different properties or relationships to other words. Despite having different names (query, key, value), Q, K, and V start off identical. These arrays then can have positional encodings added to them, which help the machine understand the order of words.

This can change how the model attends to different parts of the data. The model performs attention by multiplying these matrices together, adjusting their scales, and then running them through an activation function like Softmax. This creates a correlation that helps the model understand the importance of words relative to each other.

When looking at multi-headed attention, you split the array of numbers into smaller parts to train on different aspects in parallel, speeding up the learning process and improving performance. For example, 512 features could be split into four groups of 128 features each, allowing the model to process these chunks simultaneously.