Introduction
Every AI paper you read, every model you train, every inference you run reduces to the same thing: linear algebra. Not calculus, not probability, not information theory. Those matter, but they are not the substrate. The substrate is vectors, matrices, and the transformations between them.
A neural network's forward pass is a sequence of matrix multiplications interleaved with nonlinearities. Backpropagation computes gradients through those same matrices. Attention, the mechanism that made transformers dominant, is a structured dot product. Embeddings are vectors. Loss landscapes are surfaces in high-dimensional vector spaces. The entire edifice of modern AI is built on operations that a second-year linear algebra student could, in principle, perform by hand.
This post is not a textbook review. It is a guided tour of the specific linear algebra concepts that actually run inside every AI model, from the simplest feedforward network to GPT-scale transformers. If you understand these operations, you understand what the machine is doing. If you do not, no amount of API calls will give you that understanding.

Vectors: The Language of Data
Before anything can be processed by a neural network, it must become a vector: an ordered list of numbers in . This is not a metaphor. It is a literal requirement. Every input to every AI model is a point in some high-dimensional vector space.
A word becomes a vector through an embedding table. The word “king” might map to a vector in : 768 floating-point numbers that encode the word's semantic identity. An image becomes a vector (or a tensor of vectors) by reading pixel intensities into arrays. An audio waveform becomes a vector by sampling amplitudes at discrete time steps.
The key insight is that once data is in vector form, the same mathematical operations apply regardless of modality. A matrix multiply does not care whether the input vector represents a word, a pixel patch, or an audio frame. This is why the same transformer architecture works for text, images, and audio. The architecture operates on vectors. The embedding layer is the only part that knows what the data originally was.
The universality of AI architectures is not a design choice. It is a consequence of representing everything as vectors. Once you are in vector space, the geometry of linear algebra provides a universal language for similarity, transformation, and composition.
Matrix Multiplication: The Core Operation
If there is one operation that defines neural networks, it is matrix multiplication. A single layer of a neural network computes:
where is the input vector, is the weight matrix, and is a bias vector. The output is a new vector in a potentially different-dimensional space. This is a linear transformation: the matrix rotates, scales, and projects the input.
A feedforward neural network is just this operation applied repeatedly with nonlinear activation functions between layers:
Without the activation function , the entire stack of layers would collapse into a single matrix multiplication: is still a linear transformation. The nonlinearity is what gives depth its power. But the skeleton of every forward pass is matrix multiplication.
Dimensions Tell the Story
Reading a neural network's architecture is really reading a chain of matrix dimensions. A transformer block with hidden dimension and MLP intermediate size has weight matrices and . The first expands the representation into a higher-dimensional space where the nonlinearity can separate features. The second projects it back. Every matrix shape is a design decision about how information flows.
A GPT-3-scale model (175B parameters) performs roughly floating-point multiply-accumulate operations per forward pass. Almost all of them are matrix multiplications. This is why AI progress tracks GPU FLOPS, and why NVIDIA is the most valuable company in the world.
Dot Products and Similarity
The dot product of two vectors and is defined as:
where is the angle between the vectors. This single operation is the foundation of similarity in AI. When two embedding vectors have a high dot product, they point in roughly the same direction in the embedding space. When the dot product is near zero, they are orthogonal, meaning unrelated.
Cosine similarity normalizes the dot product by the magnitudes of both vectors:
This gives a value between and , independent of vector magnitude. It measures direction, not scale, which is exactly what you want when comparing semantic meaning.
Every time you use a search engine, a recommendation system, or a chatbot with retrieval, the system is computing dot products in a high-dimensional space. Similarity is geometry. Relevance is proximity. The dot product is how machines measure both.
Eigenvalues and Eigenvectors
An eigenvector of a matrix is a nonzero vector that, when multiplied by , only gets scaled, not rotated:
The scalar is the eigenvalue. In plain terms: eigenvectors are the directions along which a linear transformation acts as simple scaling. Every other vector gets both scaled and rotated. Eigenvectors are the special directions where the matrix's action is maximally simple.
Why does this matter for AI? Because of Principal Component Analysis (PCA). Given a dataset of vectors in , the covariance matrix captures how features vary together. The eigenvectors of are the principal components, the directions along which the data varies most.
The eigenvalues of tell you how much variance lies along each principal component. If the first 50 eigenvalues capture 95% of the total variance, you can project your data onto those 50 eigenvectors and discard the rest. You have reduced dimensionality from to 50 with only 5% information loss.
When you compute eigenvalues, you are asking: what are the natural axes of this data? Where does it stretch, and where does it compress? The eigenvalue spectrum of a dataset's covariance matrix is a fingerprint of its internal structure.
Singular Value Decomposition
The Singular Value Decomposition (SVD) generalizes eigendecomposition to any matrix, not just square ones. Every matrix can be decomposed as:
where and are orthogonal matrices (rotations/reflections), and is a diagonal matrix of singular values . The decomposition says: every linear transformation is a rotation, followed by axis-aligned scaling, followed by another rotation.
Low-Rank Approximation
The power of SVD lies in truncation. If you keep only the top singular values and their corresponding columns of and , you get the best rank- approximation of the original matrix:
This is the Eckart-Young theorem: no other rank- matrix is closer to in the Frobenius norm. This result is everywhere in AI:
SVD reveals that most real-world data matrices are approximately low-rank. The information content is concentrated in a small number of singular values. This is not a mathematical trick. It is a statement about the structure of the world: high-dimensional data almost always lives on a much lower-dimensional manifold.
The Transformer Is Pure Linear Algebra
The transformer architecture, the foundation of GPT, BERT, and every modern language model, is a carefully orchestrated sequence of linear algebra operations. The core mechanism is scaled dot-product attention, and understanding it requires no concept beyond what we have already covered.
Given an input sequence of token embeddings packed into a matrix , the attention mechanism first computes three matrices through linear projections:
where are learned weight matrices. is the query (what each token is looking for), is the key (what each token advertises about itself), and is the value (the actual information each token carries).
Attention is then computed as:
Let me unpack this operation by operation:
Multi-Head Attention
In practice, attention is computed in parallel across heads, each with its own projection matrices where . The outputs are concatenated and projected back:
Each head learns to attend to different types of relationships: syntactic, semantic, positional. The concatenation and final projection combine these perspectives into a single representation. The entire operation is matrix multiplications and a softmax. Nothing more.
Attention is not a mysterious neural mechanism. It is a lookup table built from dot products. The queries ask questions, the keys provide addresses, and the values supply answers. The softmax just normalizes the addressing. Every breakthrough in large language models is, at its core, an improvement to this dot-product addressing scheme.
Why GPUs Exist for This
Matrix multiplication is embarrassingly parallel. To compute where and , each element is an independent dot product:
All elements of can be computed simultaneously. A CPU with 16 cores can compute 16 dot products at once. An NVIDIA H100 GPU has 16,896 CUDA cores and can execute thousands of dot products per clock cycle through SIMD (Single Instruction, Multiple Data) parallelism.
This is not a coincidence. GPUs were originally designed for rendering 3D graphics, which requires transforming millions of vertices through transformation matrices every frame. The same hardware that rotates polygons in video games multiplies weight matrices in neural networks. When NVIDIA pivoted to AI compute, they were not building new capability. They were repurposing existing capability for a structurally identical workload.
AI did not become practical when algorithms improved. It became practical when hardware caught up to the math. The algorithms, backpropagation, gradient descent, attention, existed for decades. What changed was the availability of massively parallel hardware purpose-built for the only operation that matters: matrix multiplication.
Python Verification
Theory is convincing. Code is conclusive. Let us verify the core linear algebra operations that run inside a neural network using NumPy. First, a minimal forward pass through a two-layer network:
import numpy as np
# Seed for reproducibility
np.random.seed(42)
# Two-layer feedforward network
d_in, d_hidden, d_out = 4, 8, 3
W1 = np.random.randn(d_hidden, d_in) * 0.5
b1 = np.zeros(d_hidden)
W2 = np.random.randn(d_out, d_hidden) * 0.5
b2 = np.zeros(d_out)
def relu(x):
return np.maximum(0, x)
def forward(x):
h = relu(W1 @ x + b1) # Linear transform + activation
y = W2 @ h + b2 # Output projection
return y
x = np.array([1.0, 0.5, -0.3, 0.8])
output = forward(x)
print(f"Input: {x}")
print(f"Output: {output}")
print(f"Shape: {x.shape} → {output.shape}")
# The entire forward pass is two matrix multiplicationsEvery line of the forward pass is a matrix-vector product. The @ operator in NumPy is matrix multiplication. The ReLU is applied elementwise between the two linear transformations. That is the entire computation.
Now, the attention mechanism. This is a complete, working implementation of scaled dot-product attention in 15 lines:
import numpy as np
def softmax(x, axis=-1):
e = np.exp(x - x.max(axis=axis, keepdims=True))
return e / e.sum(axis=axis, keepdims=True)
def attention(X, W_Q, W_K, W_V):
"""Scaled dot-product attention from scratch."""
Q = X @ W_Q # Queries: (n, d_k)
K = X @ W_K # Keys: (n, d_k)
V = X @ W_V # Values: (n, d_v)
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # Similarity matrix: (n, n)
weights = softmax(scores) # Attention weights: (n, n)
return weights @ V # Weighted values: (n, d_v)
# Simulate a 6-token sequence with d=8
n_tokens, d_model, d_k = 6, 8, 4
X = np.random.randn(n_tokens, d_model)
W_Q = np.random.randn(d_model, d_k) * 0.5
W_K = np.random.randn(d_model, d_k) * 0.5
W_V = np.random.randn(d_model, d_k) * 0.5
output = attention(X, W_Q, W_K, W_V)
print(f"Input shape: {X.shape}") # (6, 8)
print(f"Output shape: {output.shape}") # (6, 4)
print(f"Attention out:\n{output.round(3)}")
# 3 matrix multiplications + softmax = attentionCount the linear algebra operations: three matrix multiplications to compute , , ; one matrix multiplication for the scores ; one matrix multiplication for the output . Five matrix multiplications and a softmax. That is the attention mechanism. There is nothing else hiding underneath.
Both examples are runnable as-is. Copy them into a Python file or notebook. The forward pass demonstrates that neural network inference is chained matrix multiplications. The attention function demonstrates that the transformer's core innovation is a dot-product similarity lookup. No frameworks, no abstractions, just NumPy doing linear algebra.
Conclusion
Every concept in this post, vectors, matrix multiplication, dot products, eigenvalues, SVD, attention, maps directly to operations running inside AI models right now. These are not background theory. They are the actual computation.
When a language model generates a sentence, it is multiplying matrices. When an image model recognizes a face, it is computing dot products. When a recommendation engine suggests a song, it is measuring cosine similarity in a vector space. When a fine-tuned model adapts to a new task through LoRA, it is exploiting low-rank matrix structure. The math is not an abstraction over the computation. The math is the computation.
This is why I find linear algebra more illuminating than any framework tutorial. PyTorch and TensorFlow are interfaces. The operations beneath them are the same ones Euler, Gauss, and Sylvester studied centuries ago. Understanding those operations, not just calling functions that execute them, is what separates someone who uses AI from someone who understands it.
If you want to go deeper, the path is clear: implement things from scratch. Write a forward pass by hand. Build attention with nothing but NumPy. Compute an SVD and reconstruct an image from truncated singular values. The linear algebra will stop being abstract the moment you see it produce real outputs. And once it does, every AI paper you read will make more sense, because you will recognize the operations underneath the notation.
Linear algebra is not a prerequisite for AI. It is not a tool used by AI. It is the substrate of AI. Vectors are the representation. Matrices are the transformation. Dot products are the similarity measure. Eigenvalues are the structure detector. Every advance in artificial intelligence is, at bottom, an advance in how we multiply matrices. Understand the math, and you understand the machine.