Blog/Research

Linear Algebra Is All You Need: The Math Behind Every AI Model

From matrix multiplications to attention heads — why vectors and transformations are the actual language of artificial intelligence.

Introduction

Every AI paper you read, every model you train, every inference you run reduces to the same thing: linear algebra. Not calculus, not probability, not information theory. Those matter, but they are not the substrate. The substrate is vectors, matrices, and the transformations between them.

A neural network's forward pass is a sequence of matrix multiplications interleaved with nonlinearities. Backpropagation computes gradients through those same matrices. Attention, the mechanism that made transformers dominant, is a structured dot product. Embeddings are vectors. Loss landscapes are surfaces in high-dimensional vector spaces. The entire edifice of modern AI is built on operations that a second-year linear algebra student could, in principle, perform by hand.

This post is not a textbook review. It is a guided tour of the specific linear algebra concepts that actually run inside every AI model, from the simplest feedforward network to GPT-scale transformers. If you understand these operations, you understand what the machine is doing. If you do not, no amount of API calls will give you that understanding.

High-performance GPU hardware powering matrix operations at scale
Every GPU cycle spent on AI training is a matrix operation — linear algebra is the computational bedrockPhoto on Unsplash

Vectors: The Language of Data

Before anything can be processed by a neural network, it must become a vector: an ordered list of numbers in Rn\mathbb{R}^n. This is not a metaphor. It is a literal requirement. Every input to every AI model is a point in some high-dimensional vector space.

A word becomes a vector through an embedding table. The word “king” might map to a vector in R768\mathbb{R}^{768}: 768 floating-point numbers that encode the word's semantic identity. An image becomes a vector (or a tensor of vectors) by reading pixel intensities into arrays. An audio waveform becomes a vector by sampling amplitudes at discrete time steps.

1
Text: Each token maps to an embedding vector xRd\mathbf{x} \in \mathbb{R}^d via a learned lookup table ERV×dE \in \mathbb{R}^{V \times d}, where VV is vocabulary size and dd is embedding dimension.
2
Images: A 224x224 RGB image is a tensor in R3×224×224\mathbb{R}^{3 \times 224 \times 224}. Vision transformers patch this into a sequence of vectors, each patch a flattened R768\mathbb{R}^{768} vector.
3
Audio: A spectrogram converts waveforms into a 2D matrix of frequency magnitudes over time, which is then treated as a sequence of feature vectors.

The key insight is that once data is in vector form, the same mathematical operations apply regardless of modality. A matrix multiply does not care whether the input vector represents a word, a pixel patch, or an audio frame. This is why the same transformer architecture works for text, images, and audio. The architecture operates on vectors. The embedding layer is the only part that knows what the data originally was.

Why this matters

The universality of AI architectures is not a design choice. It is a consequence of representing everything as vectors. Once you are in vector space, the geometry of linear algebra provides a universal language for similarity, transformation, and composition.

Matrix Multiplication: The Core Operation

If there is one operation that defines neural networks, it is matrix multiplication. A single layer of a neural network computes:

y=Wx+b\mathbf{y} = W\mathbf{x} + \mathbf{b}

where xRn\mathbf{x} \in \mathbb{R}^n is the input vector, WRm×nW \in \mathbb{R}^{m \times n} is the weight matrix, and bRm\mathbf{b} \in \mathbb{R}^m is a bias vector. The output yRm\mathbf{y} \in \mathbb{R}^m is a new vector in a potentially different-dimensional space. This is a linear transformation: the matrix WW rotates, scales, and projects the input.

A feedforward neural network is just this operation applied repeatedly with nonlinear activation functions between layers:

h1=σ(W1x+b1)\mathbf{h}_1 = \sigma(W_1\mathbf{x} + \mathbf{b}_1)
h2=σ(W2h1+b2)\mathbf{h}_2 = \sigma(W_2\mathbf{h}_1 + \mathbf{b}_2)
y^=W3h2+b3\hat{\mathbf{y}} = W_3\mathbf{h}_2 + \mathbf{b}_3

Without the activation function σ\sigma, the entire stack of layers would collapse into a single matrix multiplication: W3W2W1xW_3 W_2 W_1 \mathbf{x} is still a linear transformation. The nonlinearity is what gives depth its power. But the skeleton of every forward pass is matrix multiplication.

Dimensions Tell the Story

Reading a neural network's architecture is really reading a chain of matrix dimensions. A transformer block with hidden dimension d=768d = 768 and MLP intermediate size 4d=30724d = 3072 has weight matrices W1R3072×768W_1 \in \mathbb{R}^{3072 \times 768} and W2R768×3072W_2 \in \mathbb{R}^{768 \times 3072}. The first expands the representation into a higher-dimensional space where the nonlinearity can separate features. The second projects it back. Every matrix shape is a design decision about how information flows.

The computational bottleneck

A GPT-3-scale model (175B parameters) performs roughly 3.5×10143.5 \times 10^{14} floating-point multiply-accumulate operations per forward pass. Almost all of them are matrix multiplications. This is why AI progress tracks GPU FLOPS, and why NVIDIA is the most valuable company in the world.

Dot Products and Similarity

The dot product of two vectors a\mathbf{a} and b\mathbf{b} is defined as:

ab=i=1naibi=abcosθ\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = \|\mathbf{a}\| \|\mathbf{b}\| \cos\theta

where θ\theta is the angle between the vectors. This single operation is the foundation of similarity in AI. When two embedding vectors have a high dot product, they point in roughly the same direction in the embedding space. When the dot product is near zero, they are orthogonal, meaning unrelated.

Cosine similarity normalizes the dot product by the magnitudes of both vectors:

cos_sim(a,b)=abab\text{cos\_sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}

This gives a value between 1-1 and 11, independent of vector magnitude. It measures direction, not scale, which is exactly what you want when comparing semantic meaning.

1
Search engines: Encode the query and every document as vectors. Return documents with the highest cosine similarity to the query. This is how semantic search works at every major tech company.
2
Recommendation systems: User preferences and item features are vectors. The dot product between them predicts how much a user will like an item. Netflix, Spotify, and Amazon all use variants of this.
3
Word analogies: The famous result kingman+womanqueen\text{king} - \text{man} + \text{woman} \approx \text{queen} is vector arithmetic. The relationships between words are encoded as directions in embedding space, and the dot product measures alignment with those directions.
The geometric intuition

Every time you use a search engine, a recommendation system, or a chatbot with retrieval, the system is computing dot products in a high-dimensional space. Similarity is geometry. Relevance is proximity. The dot product is how machines measure both.

Eigenvalues and Eigenvectors

An eigenvector of a matrix AA is a nonzero vector v\mathbf{v} that, when multiplied by AA, only gets scaled, not rotated:

Av=λvA\mathbf{v} = \lambda\mathbf{v}

The scalar λ\lambda is the eigenvalue. In plain terms: eigenvectors are the directions along which a linear transformation acts as simple scaling. Every other vector gets both scaled and rotated. Eigenvectors are the special directions where the matrix's action is maximally simple.

Why does this matter for AI? Because of Principal Component Analysis (PCA). Given a dataset of nn vectors in Rd\mathbb{R}^d, the covariance matrix CRd×dC \in \mathbb{R}^{d \times d} captures how features vary together. The eigenvectors of CC are the principal components, the directions along which the data varies most.

C=1ni=1n(xiμ)(xiμ)C = \frac{1}{n} \sum_{i=1}^{n} (\mathbf{x}_i - \boldsymbol{\mu})(\mathbf{x}_i - \boldsymbol{\mu})^\top

The eigenvalues of CC tell you how much variance lies along each principal component. If the first 50 eigenvalues capture 95% of the total variance, you can project your data onto those 50 eigenvectors and discard the rest. You have reduced dimensionality from dd to 50 with only 5% information loss.

1
Dimensionality reduction: PCA on high-dimensional embeddings compresses them to a manageable size while preserving the most informative structure.
2
Data visualization: Project 768-dimensional embeddings to 2D or 3D using the top eigenvectors of the covariance matrix. This is how t-SNE and UMAP start (before their nonlinear refinements).
3
Noise removal: Small eigenvalues correspond to directions of minimal variance, which are often noise. Discarding these components denoises the data.
Eigenvalues as a lens

When you compute eigenvalues, you are asking: what are the natural axes of this data? Where does it stretch, and where does it compress? The eigenvalue spectrum of a dataset's covariance matrix is a fingerprint of its internal structure.

Singular Value Decomposition

The Singular Value Decomposition (SVD) generalizes eigendecomposition to any matrix, not just square ones. Every matrix ARm×nA \in \mathbb{R}^{m \times n} can be decomposed as:

A=UΣVA = U \Sigma V^\top

where URm×mU \in \mathbb{R}^{m \times m} and VRn×nV \in \mathbb{R}^{n \times n} are orthogonal matrices (rotations/reflections), and ΣRm×n\Sigma \in \mathbb{R}^{m \times n} is a diagonal matrix of singular values σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0. The decomposition says: every linear transformation is a rotation, followed by axis-aligned scaling, followed by another rotation.

Low-Rank Approximation

The power of SVD lies in truncation. If you keep only the top kk singular values and their corresponding columns of UU and VV, you get the best rank-kk approximation of the original matrix:

Ak=UkΣkVkA_k = U_k \Sigma_k V_k^\top

This is the Eckart-Young theorem: no other rank-kk matrix is closer to AA in the Frobenius norm. This result is everywhere in AI:

1
Image compression: An image is a matrix of pixel values. SVD decomposes it into a sum of rank-1 matrices. Keeping the top 50 components often captures 95% of the visual information at a fraction of the storage.
2
Latent spaces: Variational autoencoders and diffusion models learn compressed representations. The bottleneck is essentially a learned low-rank approximation of the data manifold.
3
LoRA (Low-Rank Adaptation): Instead of fine-tuning all parameters of a large model, LoRA decomposes weight updates as ΔW=BA\Delta W = BA where BRd×rB \in \mathbb{R}^{d \times r} and ARr×dA \in \mathbb{R}^{r \times d} with rdr \ll d. This is a direct application of low-rank matrix factorization.
The compression principle

SVD reveals that most real-world data matrices are approximately low-rank. The information content is concentrated in a small number of singular values. This is not a mathematical trick. It is a statement about the structure of the world: high-dimensional data almost always lives on a much lower-dimensional manifold.

The Transformer Is Pure Linear Algebra

The transformer architecture, the foundation of GPT, BERT, and every modern language model, is a carefully orchestrated sequence of linear algebra operations. The core mechanism is scaled dot-product attention, and understanding it requires no concept beyond what we have already covered.

Given an input sequence of nn token embeddings packed into a matrix XRn×dX \in \mathbb{R}^{n \times d}, the attention mechanism first computes three matrices through linear projections:

Q=XWQ,K=XWK,V=XWVQ = XW_Q, \quad K = XW_K, \quad V = XW_V

where WQ,WK,WVRd×dkW_Q, W_K, W_V \in \mathbb{R}^{d \times d_k} are learned weight matrices. QQ is the query (what each token is looking for), KK is the key (what each token advertises about itself), and VV is the value (the actual information each token carries).

Attention is then computed as:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Let me unpack this operation by operation:

1
QKQK^\top computes a dot product between every query and every key. The result is an n×nn \times n matrix where entry (i,j)(i, j) measures how much token ii should attend to token jj. This is a similarity matrix built from dot products.
2
Scaling by 1dk\frac{1}{\sqrt{d_k}} prevents dot products from growing too large in high dimensions. Without this, the softmax would saturate and produce near-one-hot distributions, killing gradient flow.
3
Softmax normalizes each row to a probability distribution. Token ii's attention weights over all tokens sum to 1. This is the only nonlinear operation in the attention mechanism.
4
Multiplying by VV produces a weighted sum of value vectors for each token. Each token's output is a mixture of all other tokens' values, weighted by attention scores. This is how information flows between positions.

Multi-Head Attention

In practice, attention is computed in parallel across hh heads, each with its own projection matrices WQi,WKi,WViRd×dkW_Q^i, W_K^i, W_V^i \in \mathbb{R}^{d \times d_k} where dk=d/hd_k = d / h. The outputs are concatenated and projected back:

MultiHead(X)=Concat(head1,,headh)WO\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W_O

Each head learns to attend to different types of relationships: syntactic, semantic, positional. The concatenation and final projection WORd×dW_O \in \mathbb{R}^{d \times d} combine these perspectives into a single representation. The entire operation is matrix multiplications and a softmax. Nothing more.

The attention insight

Attention is not a mysterious neural mechanism. It is a lookup table built from dot products. The queries ask questions, the keys provide addresses, and the values supply answers. The softmax just normalizes the addressing. Every breakthrough in large language models is, at its core, an improvement to this dot-product addressing scheme.

Why GPUs Exist for This

Matrix multiplication is embarrassingly parallel. To compute C=ABC = AB where ARm×kA \in \mathbb{R}^{m \times k} and BRk×nB \in \mathbb{R}^{k \times n}, each element CijC_{ij} is an independent dot product:

Cij=l=1kAilBljC_{ij} = \sum_{l=1}^{k} A_{il} B_{lj}

All m×nm \times n elements of CC can be computed simultaneously. A CPU with 16 cores can compute 16 dot products at once. An NVIDIA H100 GPU has 16,896 CUDA cores and can execute thousands of dot products per clock cycle through SIMD (Single Instruction, Multiple Data) parallelism.

1
SIMD parallelism: Every CUDA core applies the same multiply-accumulate instruction to different data elements simultaneously. Matrix multiplication is the ideal workload because every element requires the same operation pattern.
2
Memory bandwidth: The H100 has 3.35 TB/s of HBM3 bandwidth. Large matrix multiplications are compute-bound (not memory-bound), meaning the GPU can keep its cores fed with data. This is why larger batch sizes improve GPU utilization.
3
Tensor Cores: Specialized hardware units that compute 4×44 \times 4 matrix multiplications in a single cycle. The H100's Tensor Cores deliver nearly 1,000 TFLOPS of FP16 throughput, roughly 100x faster than general-purpose CUDA cores for matrix math.

This is not a coincidence. GPUs were originally designed for rendering 3D graphics, which requires transforming millions of vertices through 4×44 \times 4 transformation matrices every frame. The same hardware that rotates polygons in video games multiplies weight matrices in neural networks. When NVIDIA pivoted to AI compute, they were not building new capability. They were repurposing existing capability for a structurally identical workload.

The hardware-math alignment

AI did not become practical when algorithms improved. It became practical when hardware caught up to the math. The algorithms, backpropagation, gradient descent, attention, existed for decades. What changed was the availability of massively parallel hardware purpose-built for the only operation that matters: matrix multiplication.

Python Verification

Theory is convincing. Code is conclusive. Let us verify the core linear algebra operations that run inside a neural network using NumPy. First, a minimal forward pass through a two-layer network:

python
import numpy as np

# Seed for reproducibility

np.random.seed(42)

# Two-layer feedforward network

d_in, d_hidden, d_out = 4, 8, 3
W1 = np.random.randn(d_hidden, d_in) * 0.5
b1 = np.zeros(d_hidden)
W2 = np.random.randn(d_out, d_hidden) * 0.5
b2 = np.zeros(d_out)

def relu(x):
    return np.maximum(0, x)

def forward(x):
    h = relu(W1 @ x + b1)   # Linear transform + activation

    y = W2 @ h + b2          # Output projection

    return y

x = np.array([1.0, 0.5, -0.3, 0.8])
output = forward(x)
print(f"Input:  {x}")
print(f"Output: {output}")
print(f"Shape:  {x.shape} → {output.shape}")
# The entire forward pass is two matrix multiplications

Every line of the forward pass is a matrix-vector product. The @ operator in NumPy is matrix multiplication. The ReLU is applied elementwise between the two linear transformations. That is the entire computation.

Now, the attention mechanism. This is a complete, working implementation of scaled dot-product attention in 15 lines:

python
import numpy as np

def softmax(x, axis=-1):
    e = np.exp(x - x.max(axis=axis, keepdims=True))
    return e / e.sum(axis=axis, keepdims=True)

def attention(X, W_Q, W_K, W_V):
    """Scaled dot-product attention from scratch."""
    Q = X @ W_Q               # Queries:  (n, d_k)

    K = X @ W_K               # Keys:     (n, d_k)

    V = X @ W_V               # Values:   (n, d_v)

    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)  # Similarity matrix: (n, n)

    weights = softmax(scores)          # Attention weights:  (n, n)

    return weights @ V                 # Weighted values:    (n, d_v)


# Simulate a 6-token sequence with d=8

n_tokens, d_model, d_k = 6, 8, 4
X = np.random.randn(n_tokens, d_model)
W_Q = np.random.randn(d_model, d_k) * 0.5
W_K = np.random.randn(d_model, d_k) * 0.5
W_V = np.random.randn(d_model, d_k) * 0.5

output = attention(X, W_Q, W_K, W_V)
print(f"Input shape:   {X.shape}")          # (6, 8)

print(f"Output shape:  {output.shape}")     # (6, 4)

print(f"Attention out:\n{output.round(3)}")
# 3 matrix multiplications + softmax = attention

Count the linear algebra operations: three matrix multiplications to compute QQ, KK, VV; one matrix multiplication for the scores QKQK^\top; one matrix multiplication for the output weights×V\text{weights} \times V. Five matrix multiplications and a softmax. That is the attention mechanism. There is nothing else hiding underneath.

The verification

Both examples are runnable as-is. Copy them into a Python file or notebook. The forward pass demonstrates that neural network inference is chained matrix multiplications. The attention function demonstrates that the transformer's core innovation is a dot-product similarity lookup. No frameworks, no abstractions, just NumPy doing linear algebra.

Conclusion

Every concept in this post, vectors, matrix multiplication, dot products, eigenvalues, SVD, attention, maps directly to operations running inside AI models right now. These are not background theory. They are the actual computation.

When a language model generates a sentence, it is multiplying matrices. When an image model recognizes a face, it is computing dot products. When a recommendation engine suggests a song, it is measuring cosine similarity in a vector space. When a fine-tuned model adapts to a new task through LoRA, it is exploiting low-rank matrix structure. The math is not an abstraction over the computation. The math is the computation.

This is why I find linear algebra more illuminating than any framework tutorial. PyTorch and TensorFlow are interfaces. The operations beneath them are the same ones Euler, Gauss, and Sylvester studied centuries ago. Understanding those operations, not just calling functions that execute them, is what separates someone who uses AI from someone who understands it.

If you want to go deeper, the path is clear: implement things from scratch. Write a forward pass by hand. Build attention with nothing but NumPy. Compute an SVD and reconstruct an image from truncated singular values. The linear algebra will stop being abstract the moment you see it produce real outputs. And once it does, every AI paper you read will make more sense, because you will recognize the operations underneath the notation.

The substrate of intelligence

Linear algebra is not a prerequisite for AI. It is not a tool used by AI. It is the substrate of AI. Vectors are the representation. Matrices are the transformation. Dot products are the similarity measure. Eigenvalues are the structure detector. Every advance in artificial intelligence is, at bottom, an advance in how we multiply matrices. Understand the math, and you understand the machine.

Stay in the loop

Follow along as I explore the intersection of medicine, AI, and engineering.

Just honest writing, straight from me. Unsubscribe anytime.