Multi-Head Attention & Word Embeddings Explorer

Understanding Word Embeddings

Word embeddings are dense vector representations of words in a continuous vector space. Unlike simple one-hot encoding, these vectors capture semantic relationships between words. Each dimension in the embedding space represents a latent feature learned from the data.

The famous example "king - man + woman ≈ queen" demonstrates how word embeddings capture semantic relationships:

Learning Path: From Words to Attention

1

Word Representations

Start with raw text converted to numerical vectors:

"king" → [1.2, 0.8, 1.0] "queen" → [1.0, 0.9, 1.1] "man" → [0.7, 0.3, 0.5]
2

Embedding Space

Words are mapped to a continuous vector space where similar words cluster together.

3

Attention Mechanism

Words interact through Query, Key, Value transformations:

Query (Q) Key (K) Value (V) Attention
4

Multi-Head Processing

Multiple attention heads capture different aspects of relationships.

Multi-Head Attention Mechanism

Attention mechanisms allow models to focus on relevant parts of input data when making predictions. Multi-head attention splits this process into several parallel "heads," each learning different aspects of relationships in the data.

Key Components:

  • Query (Q): What we're looking for
  • Key (K): What we match against
  • Value (V): The information we retrieve

Benefits:

  • Parallel processing of information
  • Capture different types of relationships
  • More robust feature learning

Mathematical Foundations of Word Embeddings

Word2Vec Skip-gram Objective Function

\[ J(\theta) = -\frac{1}{T} \sum_{t=1}^T \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j}|w_t) \] \[ p(w_O|w_I) = \frac{\exp(v'^T_{w_O} v_{w_I})}{\sum_{w=1}^W \exp(v'^T_w v_{w_I})} \]

Where:

Multi-Head Attention Mathematics

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \] \[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O \] \[ \text{where head}_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i) \]

Key Components:

Positional Encoding

\[ PE_{(pos,2i)} = \sin(pos/10000^{2i/d_{model}}) \] \[ PE_{(pos,2i+1)} = \cos(pos/10000^{2i/d_{model}}) \]

Self-Attention Complexity Analysis

Time Complexity: \(O(n^2 \cdot d)\)

Space Complexity: \(O(n^2)\)

Where \(n\) is sequence length and \(d\) is embedding dimension

Advanced Concepts

Word Embedding Analysis

Word Embedding Analogy:

king - man + woman ≈ queen

Multi-Head Attention Visualization