Transformer Notes

Transformer Architecture

Transformer Architecture

Key Definitions

  • Token index:

    A token index is an integer that represents a specific token (word, subword or character) within a vocabulary.


  • Vocab Size:

    This is the size of the vocabulary - the total number of unique tokens.


  • d_model:

    The embedding dimension - the length of the vector that will represent each token.


  • Embedding Layer:

    The embedding layer maps each token index to a $learned$ d_model dimensional vector. The embedding layer initializes a parameter matrix W of size [Vocab Size, d_model]. Each row of this matrix corresponds to the embedding vector for a particular token in the vocabulary.


  • Self-Attention:

    A type of attention where the model relates different positions of a single sequence to compute a representation of that sequence.


  • Query (Q):

    Vector representing what we're looking for in the attention computation.


  • Key (K):

    Vector used to match against queries to determine attention weights.


  • Value (V):

    Vector containing the actual content to be aggregated in attention computation.


  • Encoder:

    Part of the transformer that processes the input sequence and creates a representation.


  • Decoder:

    Part of the transformer that generates the output sequence based on the encoder's representation.


Transformer Fundamentals

  • Self-Attention Mechanism:

    Core computation in transformers:

    $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

    where $Q$ is queries, $K$ is keys, $V$ is values, and $d_k$ is key dimension.


  • Multi-Head Attention:
    • Parallel attention computations
    • Each head captures different aspects of relationships
    • Concatenated and projected to final output

Architecture Components

  • Encoder Block:
    • Multi-head self-attention
    • Feed-forward neural network
    • Layer normalization
    • Residual connections

  • Positional Encoding:

    $$PE_{(pos,2i)} = \sin(pos/10000^{2i/d_{model}})$$

    $$PE_{(pos,2i+1)} = \cos(pos/10000^{2i/d_{model}})$$

    Adds position information to token embeddings.


Training Techniques

  • Masked Training:

    Used in decoder for autoregressive generation:

    • Prevents attending to future tokens
    • Essential for language modeling

  • Pre-training Objectives:
    • Masked Language Modeling (MLM)
    • Next Sentence Prediction (NSP)
    • Causal Language Modeling

Model Variants

  • Popular Architectures:
    • BERT: Bidirectional Encoder
    • GPT: Autoregressive Decoder
    • T5: Text-to-Text Transfer
    • BART: Denoising Autoencoder

  • Specialized Variants:
    • ViT: Vision Transformer
    • DALL-E: Image Generation
    • Perceiver: Universal Architecture

Advanced Concepts

  • Efficiency Improvements:
    • Sparse Attention Patterns
    • Linear Attention Mechanisms
    • Parameter Sharing Techniques

  • Scaling Techniques:
    • Model Parallelism
    • Pipeline Parallelism
    • Mixed Precision Training