Multi-Head Attention & Word Embeddings Explorer

Understanding Word Embeddings

Word embeddings are dense vector representations of words in a continuous vector space. Unlike simple one-hot encoding, these vectors capture semantic relationships between words. Each dimension in the embedding space represents a latent feature learned from the data.

The famous example "king - man + woman ≈ queen" demonstrates how word embeddings capture semantic relationships:

The vector difference (king - man) captures the concept of "royalty"
Adding this to "woman" yields a point close to "queen" in the embedding space
This shows how embeddings encode meaningful semantic relationships

Learning Path: From Words to Attention

Word Representations

Start with raw text converted to numerical vectors:

Embedding Space

Words are mapped to a continuous vector space where similar words cluster together.

Attention Mechanism

Words interact through Query, Key, Value transformations:

Multi-Head Processing

Multiple attention heads capture different aspects of relationships.

Multi-Head Attention Mechanism

Attention mechanisms allow models to focus on relevant parts of input data when making predictions. Multi-head attention splits this process into several parallel "heads," each learning different aspects of relationships in the data.

Key Components:

Query (Q): What we're looking for
Key (K): What we match against
Value (V): The information we retrieve

Benefits:

Parallel processing of information
Capture different types of relationships
More robust feature learning

Mathematical Foundations of Word Embeddings

Word2Vec Skip-gram Objective Function

\[ J(\theta) = -\frac{1}{T} \sum_{t=1}^T \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j}|w_t) \] \[ p(w_O|w_I) = \frac{\exp(v'^T_{w_O} v_{w_I})}{\sum_{w=1}^W \exp(v'^T_w v_{w_I})} \]

Where:

\(T\) is the total number of training words
\(c\) is the context window size
\(v_{w_I}\) is the input vector for word \(w_I\)
\(v'_{w_O}\) is the output vector for word \(w_O\)

Multi-Head Attention Mathematics

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \] \[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O \] \[ \text{where head}_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i) \]

Key Components:

\(d_k\): Dimension of keys (scaling factor prevents vanishing gradients)
\(W^Q_i, W^K_i, W^V_i\): Learned projection matrices for each head
\(W^O\): Output projection matrix

Positional Encoding

\[ PE_{(pos,2i)} = \sin(pos/10000^{2i/d_{model}}) \] \[ PE_{(pos,2i+1)} = \cos(pos/10000^{2i/d_{model}}) \]

Self-Attention Complexity Analysis

Time Complexity: \(O(n^2 \cdot d)\)

Space Complexity: \(O(n^2)\)

Where \(n\) is sequence length and \(d\) is embedding dimension

Advanced Concepts

Gradient Flow Through Attention:
\[ \frac{\partial \mathcal{L}}{\partial K} = \frac{\partial \mathcal{L}}{\partial A} \cdot \frac{\partial A}{\partial K} \]
Information Theoretic Perspective:
\[ I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} \]

Multi-Head Attention & Word Embeddings Explorer

Understanding Word Embeddings

Learning Path: From Words to Attention

Word Representations

Embedding Space

Attention Mechanism

Multi-Head Processing

Multi-Head Attention Mechanism

Key Components:

Benefits:

Mathematical Foundations of Word Embeddings

Word2Vec Skip-gram Objective Function

Multi-Head Attention Mathematics

Positional Encoding

Self-Attention Complexity Analysis

Advanced Concepts

Word Embedding Analysis

Multi-Head Attention Visualization