Navtej Singh

Introduction

Every machine learning engineer has had the same experience: you write a model in Python, the shapes do not match, and you do not find out until 45 minutes into a training run. The error message says RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x512 and 256x1024). You stare at it. You trace backward through six function calls. You fix it, run again, and hit a different shape mismatch three layers deeper.

This is the problem that motivated Synthr. Not Python's syntax. Not its speed. Its fundamental inability to reason about tensor shapes at compile time. Synthr is a programming language designed from the ground up for machine learning, with a type system that catches shape errors before any code runs, automatic differentiation as a first-class primitive, and a compiler that generates GPU-native code.

IDE with syntax-highlighted code showing a programming language's type system — Designing a language from scratch — every syntax decision reflects a tradeoff between familiarity for ML researchers and expressiveness for the type system — Photo on Unsplash

Why Existing Languages Fall Short

The critique of Python for ML is well-worn, but most discussions focus on the wrong problems. Speed is solvable with compilation. The GIL is solvable with multiprocessing. The real problems are structural.

No static shape reasoning. In Python, a tensor is an opaque blob at the type level. torch.Tensor tells you nothing about its rank, dimensions, or dtype. Shape mismatches are caught at runtime, which means they are caught late and expensively. NumPy and PyTorch have made heroic efforts with runtime checks, but the information simply is not available at the type level.

Automatic differentiation is bolted on. Autograd in PyTorch is a runtime tape. It records operations as they execute and replays them backward. This works, but it means the compiler cannot see the full computation graph ahead of time. Optimizations like operator fusion, memory planning, and gradient checkpointing scheduling must be done by the framework, not the compiler.

GPU execution is a foreign function call. Python dispatches to CUDA kernels through C++ bindings. The language has no concept of a GPU, a kernel, or a memory space. This impedance mismatch means that every performance-critical optimization requires dropping out of Python into a different language entirely.

The core thesis

A language designed for ML should make tensors, automatic differentiation, and GPU execution first-class citizens of the type system and compiler, not library features grafted onto a general-purpose language.

Language Design Goals

Synthr was designed around five non-negotiable goals. Every syntax decision, type system feature, and compiler optimization traces back to one of these.

First-class tensors with shape types. Tensor shapes are part of the type. A Tensor[Float32, 128, 784] is a different type from a Tensor[Float32, 128, 512]. Shape mismatches are compile-time errors.

Built-in automatic differentiation. The grad keyword is a language primitive, not a library function. The compiler sees the full computation graph and can optimize the backward pass at compile time.

GPU-native execution. Synthr code compiles to GPU kernels directly. There is no host language dispatching to device code. The compiler decides what runs on the GPU and what runs on the CPU based on the computation graph.

Python-like readability. ML researchers should be able to read Synthr code without a learning curve. Indentation-sensitive syntax, familiar function definitions, and minimal ceremony.

Rust-like safety. Ownership semantics for GPU memory, no null references, exhaustive pattern matching. The language should prevent classes of bugs that waste GPU hours.

The Type System

The type system is the heart of Synthr, and the part that required the most design iteration. The key innovation is dependent types for tensor shapes.

In a standard type system, types are fixed at compile time: int, float, String. In a dependent type system, types can depend on values. This lets you express constraints like “this tensor has the same batch dimension as that tensor” directly in the type signature.

// Synthr: A linear layer with shape-checked types
fn linear[B: Nat, In: Nat, Out: Nat](
    x: Tensor[Float32, B, In],
    w: Tensor[Float32, In, Out],
    b: Tensor[Float32, Out]
) -> Tensor[Float32, B, Out]:
    return x @ w + b

The type parameters B, In, and Out are natural number type variables. When you call linear, the compiler unifies these variables with the actual tensor shapes and verifies that all constraints are satisfied. If you pass a weight matrix with the wrong dimensions, the error appears at compile time with a clear message about which dimension does not match.

This extends to more complex patterns. Here is a multi-head attention implementation:

// Synthr: Multi-head attention with compile-time shape checking
fn multi_head_attention[B: Nat, Seq: Nat, D: Nat, H: Nat](
    q: Tensor[Float32, B, Seq, D],
    k: Tensor[Float32, B, Seq, D],
    v: Tensor[Float32, B, Seq, D],
    num_heads: Static[H],
) -> Tensor[Float32, B, Seq, D]
where D % H == 0:
    let head_dim = D / H
    let q_heads = reshape[B, Seq, H, head_dim](q)
    let k_heads = reshape[B, Seq, H, head_dim](k)
    let v_heads = reshape[B, Seq, H, head_dim](v)

    let scores = einsum("bshd,bthd->bsht", q_heads, k_heads)
    let weights = softmax(scores / sqrt(head_dim), axis=-1)
    let output = einsum("bsht,bthd->bshd", weights, v_heads)

    return reshape[B, Seq, D](output)

The where D % H == 0 clause is a type-level constraint. The compiler proves at compile time that the model dimension is evenly divisible by the number of heads. If it is not, the program does not compile. In PyTorch, this would be a runtime assertion that might not trigger until you change a hyperparameter three months later.

Why this matters in practice

A single shape mismatch caught at compile time saves an average of 20-60 minutes of debugging time. Over a research project with hundreds of experiments, this compounds into days of saved effort.

Automatic Differentiation as a Primitive

In Synthr, grad is a keyword, not a function call. The compiler performs source-to-source transformation to generate the backward pass at compile time.

// Synthr: Automatic differentiation
fn mse_loss(pred: Tensor[Float32, B, D], target: Tensor[Float32, B, D]) -> Scalar[Float32]:
    return mean((pred - target) ** 2)

fn train_step(model: &mut Linear, x: Tensor[Float32, B, In], y: Tensor[Float32, B, Out]):
    let loss, grads = grad(mse_loss, wrt=model.params)(model(x), y)
    model.params -= 0.01 * grads

The grad keyword tells the compiler to differentiate mse_loss with respect to model.params. Because the compiler sees the full computation graph statically, it can perform optimizations that a runtime tape cannot.

Operator fusion. Adjacent operations that would be separate kernel launches in PyTorch can be fused into a single kernel. The compiler identifies fusible subgraphs and generates combined kernels automatically.

Memory planning. The compiler knows the lifetime of every intermediate tensor in both the forward and backward passes. It allocates a single memory pool and reuses buffers, reducing peak memory usage by 20-40% compared to eager-mode autograd.

Gradient checkpointing scheduling. Instead of requiring the user to manually wrap layers in checkpoint calls, the compiler analyzes the memory-compute tradeoff for each layer and inserts checkpointing automatically based on a target memory budget.

Compiler Architecture

Synthr compiles through four stages, from source to executable GPU code.

Parsing and type checking. A recursive descent parser produces a typed AST. The type checker performs bidirectional type inference with shape unification. Type errors are reported with source-mapped diagnostics that show the exact shape mismatch.

High-level IR (Synthr IR). The typed AST is lowered to a graph-based intermediate representation where nodes are tensor operations. This IR is where autodiff transformation, operator fusion, and memory planning happen.

Low-level IR (LLVM IR). The optimized Synthr IR is lowered to LLVM IR for CPU code and PTX (via NVPTX) for GPU kernels. LLVM handles register allocation, instruction scheduling, and target-specific optimizations.

JIT compilation. For interactive use (notebooks, REPL), Synthr supports JIT compilation. Functions are compiled on first call and cached. The JIT maintains the same type checking and optimization pipeline as ahead-of-time compilation, so you never sacrifice safety for interactivity.

// Compilation pipeline (pseudocode)
source.sn -> [Parser] -> Untyped AST
          -> [Type Checker + Shape Inference] -> Typed AST
          -> [Lowering] -> Synthr IR (graph-based)
          -> [Autodiff Transform] -> Synthr IR with backward graph
          -> [Fusion + Memory Planning] -> Optimized Synthr IR
          -> [LLVM Codegen] -> LLVM IR / PTX
          -> [LLVM Backend] -> Native binary + GPU kernels

The LLVM backend was the pragmatic choice. Writing a production-quality code generator from scratch would have taken years. LLVM gives Synthr access to decades of optimization passes for free. The tradeoff is compile time. A full model compilation takes 3-8 seconds, which is acceptable for training scripts but noticeable in a REPL. The JIT mitigates this by caching compiled functions across sessions.

System architecture diagram representing the compiler pipeline stages — The Synthr compiler pipeline in action — from source code through type checking, IR optimization, and LLVM code generation to GPU-native binaries — Photo on Unsplash

Comparison to Existing Approaches

Synthr is not the first attempt to improve ML tooling. It is worth positioning it against the alternatives.

Tool	Shape checking	Autodiff	GPU native	Tradeoff
PyTorch	Runtime only	Runtime tape	Via C++ dispatch	Maximum flexibility, minimum static guarantees
JAX	Runtime only	Tracing-based	Via XLA	Functional purity enables optimization, but no shape types
Julia	Parametric types (partial)	Library (Zygote)	Via CUDA.jl	Multiple dispatch is powerful but not shape-specialized
Mojo	Planned	Via MAX	Native	Systems-level performance, but ML semantics are secondary
Synthr	Compile-time (dependent types)	Compiler primitive	Native (LLVM/PTX)	Full static guarantees, but smaller ecosystem

The honest weakness of Synthr is ecosystem maturity. PyTorch has thousands of pretrained models, community libraries, and production tooling. Synthr has none of that. The bet is that the safety and performance guarantees will justify the ecosystem cost for teams where GPU hours are expensive and debugging shape errors is a recurring time sink.

What I Learned About Language Design

Designing a programming language taught me things that building applications never could.

Error messages are a user interface. The most important feature of a compiler is not the code it generates. It is the error messages it produces when something goes wrong. I spent more time on Synthr's diagnostic system than on the code generator. A shape mismatch error that says expected Tensor[Float32, 128, 512] but got Tensor[Float32, 128, 256] in argument 2 of linear() saves ten minutes of debugging. A generic type error saves nothing.

Syntax is politics. Every syntax decision is a tradeoff between familiarity and expressiveness. I chose Python-like indentation over braces because the target audience (ML researchers) reads Python fluently. I chose Rust-like ownership for GPU memory because explicit resource management prevents an entire class of memory leaks. Every choice alienated someone. That is unavoidable.

The type system is the language. In Synthr, the type system does most of the interesting work. Shape inference, autodiff eligibility checking, memory lifetime analysis, all of these live in the type checker. The runtime is almost boring by comparison. This inverts the typical language design experience, where the runtime is complex and the type system is simple.

The deepest lesson

A programming language is an argument about how people should think about a problem domain. Synthr argues that tensor shapes are types, differentiation is compilation, and GPU execution is not special. Whether that argument is right is something only adoption can answer.

Conclusion

Synthr is not finished. The compiler works. The type system catches real bugs. The generated code runs on GPUs. But the ecosystem is nascent, the standard library is thin, and the language needs years of use before the rough edges are smoothed.

What I am most confident about is the core design bet: that a language which treats tensor shapes as types and automatic differentiation as a compiler feature will produce better ML code than one that treats both as runtime concerns. The evidence so far, in compile-time bug catches, in generated code performance, and in the clarity of Synthr programs, supports that bet.

Building a programming language is the most humbling project I have undertaken. Every design decision has consequences that ripple through the entire system. Every shortcut in the compiler becomes a limitation that users hit months later. It has made me a better engineer in every other domain, because it forced me to think about systems at every level of abstraction simultaneously.

Get involved

Synthr is an ongoing research project. If you are interested in programming language design, type systems, or ML compilers, I would love to hear from you.

Designing a Programming Language for Machine Learning From Scratch

Introduction

Why Existing Languages Fall Short

Language Design Goals

The Type System

Automatic Differentiation as a Primitive

Compiler Architecture

Comparison to Existing Approaches

What I Learned About Language Design

Conclusion

More posts

Stay in the loop

Designing a Programming Language for Machine Learning From Scratch

Introduction

Why Existing Languages Fall Short

Language Design Goals

The Type System

Automatic Differentiation as a Primitive

Compiler Architecture

Comparison to Existing Approaches

What I Learned About Language Design

Conclusion

More posts

Linear Algebra Is All You Need: The Math Behind Every AI Model

Building Medical AI: A Complete Guide To Pathology Slide Analysis

Cantor's Diagonal Argument: Why Some Infinities Are Bigger Than Others

Stay in the loop