Blog/Case Studies

Training a 12-Billion Parameter Model: Architecture Decisions and Tradeoffs

The architectural choices, training infrastructure, and hard-won lessons from training a large language model.

Introduction

When I started building Clover, the question everyone asked was the same: why train your own model? Fair question. The open-source ecosystem is rich. But the honest answer is that I wanted to understand every layer of the stack, from tokenizer to RLHF, and the only way to do that is to actually train the thing.

Clover is a 12-billion parameter autoregressive language model. Not 7B, not 70B. The choice of scale was deliberate, and this post walks through every major decision, the infrastructure that made it possible, and the failures that taught me more than the successes.

High-performance GPU cluster used for large language model training
Training a 12-billion parameter model requires serious GPU infrastructure — every architectural decision is shaped by the hardware constraintsPhoto on Unsplash

Why 12 Billion Parameters

The parameter count was not arbitrary. 7B models are well-explored territory. They train fast, fit on a single high-end GPU for inference, and the community has established strong baselines. But in practice, 7B models hit a ceiling on multi-step reasoning and code generation that I could not accept for my target use cases.

70B and above require distributed inference, which constrains deployment. The economics stop making sense for a solo researcher without a dedicated cluster budget.

12B occupies a sweet spot. It fits within a single A100 80GB for inference at half precision. It shows measurable gains over 7B on benchmarks that matter for my applications, reasoning chains, tool use, and structured output. And it trains on a modest multi-GPU setup without requiring full-scale model parallelism.

The tradeoff in plain terms

At 12B, you get roughly 80% of the reasoning capability of a 30B model while remaining deployable on a single GPU. That margin is where Clover lives.

Architecture Decisions

Clover is a decoder-only transformer, following the lineage of GPT and LLaMA. The encoder-decoder pattern (T5-style) has its merits for sequence-to-sequence tasks, but for general-purpose generation, decoder-only has become the dominant paradigm for good reason: it simplifies the training pipeline, scales more predictably, and the community tooling is better.

Within that frame, four architectural choices defined the model.

1
Rotary Position Embeddings (RoPE). Absolute position embeddings degrade at context lengths the model has not seen during training. RoPE encodes relative position directly into the attention computation, which gives Clover a natural ability to extrapolate beyond its training context. In practice, this means an 8K training context can generalize to 16K at inference with minimal quality loss.
2
Grouped Query Attention (GQA). Standard multi-head attention allocates separate key and value heads for each query head. GQA shares key-value heads across groups. Clover uses 32 query heads and 8 key-value heads. This cuts the KV-cache memory by 4x during inference, which directly translates to higher batch sizes and lower latency in production.
3
SwiGLU activation. The feed-forward blocks use SwiGLU instead of standard ReLU or GELU. SwiGLU is a gated activation that multiplies the output of a Swish activation with a linear gate. It requires a third weight matrix in the FFN, but the empirical gains on loss-per-FLOP are consistent across scales. Clover's FFN hidden dimension is 14,336 with a model dimension of 5,120.
4
RMSNorm over LayerNorm. Pre-normalization with RMSNorm removes the mean-centering step from LayerNorm. The computational savings are small per layer but compound across 40 transformer blocks. More importantly, RMSNorm produces more stable gradients during the early phase of training.
python
# Clover architecture config

model_config = {
    "hidden_size": 5120,
    "intermediate_size": 14336,
    "num_attention_heads": 32,
    "num_key_value_heads": 8,       # GQA: 4 query heads per KV head

    "num_hidden_layers": 40,
    "max_position_embeddings": 8192,
    "rope_theta": 500000.0,
    "rms_norm_eps": 1e-5,
    "vocab_size": 64000,
    "activation": "silu",            # SwiGLU uses SiLU internally

    "tie_word_embeddings": False,
}

Training Infrastructure

Clover was trained on a cluster of 8x A100 80GB GPUs connected via NVLink. Not a massive setup by industry standards, but enough to make 12B work with the right parallelism strategy.

The training stack was built on PyTorch FSDP (Fully Sharded Data Parallelism). FSDP shards the model parameters, gradients, and optimizer states across all GPUs, only gathering them when needed for the forward and backward passes. This brought the per-GPU memory footprint from an impossible 90GB down to roughly 62GB at peak.

1
Mixed precision (BF16). All matrix multiplications run in bfloat16. The key advantage of BF16 over FP16 is the wider dynamic range, which eliminates the need for loss scaling in almost all cases. Accumulations happen in FP32 for numerical stability.
2
Gradient checkpointing. With 40 layers and a sequence length of 8192, storing all activations would exceed memory. Gradient checkpointing recomputes activations during the backward pass instead of storing them. The tradeoff is roughly 30% more compute for a 60% reduction in activation memory.
3
Flash Attention 2. Standard attention is quadratic in sequence length. Flash Attention fuses the attention computation into a single kernel that is IO-aware, reducing memory from O(n^2) to O(n) and providing a 2-3x wallclock speedup on 8K sequences.
yaml
# Training launch config (torchrun)
training:
  gpus: 8
  strategy: fsdp
  mixed_precision: bf16
  gradient_checkpointing: true
  micro_batch_size: 2
  gradient_accumulation_steps: 16
  effective_batch_size: 256  # 2 * 16 * 8 GPUs
  sequence_length: 8192
  flash_attention: true
  max_steps: 150000

The effective batch size was 256 sequences of 8,192 tokens each, roughly 2 million tokens per step. At an average throughput of 1,400 tokens/second/GPU, each step took about 180 seconds. The full training run consumed approximately 1.2 trillion tokens over six weeks.

GPU server rack hardware used for large-scale model training
The compute cluster behind Clover — eight A100 GPUs connected via NVLink, running continuously for six weeksPhoto on Unsplash

Data Pipeline

Data quality determines model quality. This is the part that does not get enough attention in most training write-ups, because it is unglamorous and time-consuming. I spent more time on the data pipeline than on the model architecture.

1
Deduplication. Near-duplicate documents inflate the dataset without improving model capability. I used MinHash LSH deduplication with a Jaccard threshold of 0.8, applied at the paragraph level rather than the document level. This caught boilerplate text that exact-match dedup misses.
2
Quality filtering. A lightweight classifier trained on a hand-labeled subset of 50,000 documents scored every document on coherence, information density, and formatting. Documents below the threshold were removed. This cut the raw corpus by approximately 40%.
3
Domain mixing. The training mixture was not static. Clover used a form of curriculum learning: the first 60% of training was dominated by web text and books, while the final 40% upweighted code, scientific papers, and instruction-formatted data. Shifting the distribution late in training biases the model toward the capabilities you care about most at inference time.
Hard-won lesson

A 5% increase in data quality consistently outperformed a 20% increase in data quantity. Filtering aggressively and deduplicating thoroughly will do more for your model than adding another terabyte of scraped web text.

The Training Run

The loss curve was smooth for the first 40,000 steps. Then it was not. Around step 42,000, loss spiked sharply and failed to recover for nearly 2,000 steps. This is a loss spike, and it is terrifying the first time you see one.

The cause, in Clover's case, was a batch of pathologically long documents that contained repeated token sequences. The gradient norms exceeded the clipping threshold by an order of magnitude. The fix was twofold: I reduced the gradient clipping threshold from 1.0 to 0.5 and added a data quality gate that rejected any batch with more than 15% repeated n-grams.

Learning rate scheduling followed a cosine decay with warmup. Warmup ran for 2,000 steps to a peak learning rate of 3e-4, then decayed to 3e-5 over the remaining steps. The warmup phase is critical for FSDP training because the sharded optimizer states need time to synchronize before aggressive updates begin.

python
# Learning rate schedule

def get_lr(step, warmup=2000, max_lr=3e-4, min_lr=3e-5, total_steps=150000):
    if step < warmup:
        return max_lr * step / warmup
    progress = (step - warmup) / (total_steps - warmup)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

Checkpointing happened every 1,000 steps. I kept the last 5 checkpoints and every 10,000-step checkpoint permanently. Disk is cheap. Losing a week of training to a hardware fault is not.

Alignment: DPO Over RLHF

After pretraining, the base model generates coherent text but has no sense of instruction following. Alignment fixes this. The two dominant approaches are RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization).

I chose DPO. The reason is operational simplicity. RLHF requires training a separate reward model, maintaining a PPO training loop with a reference model, and tuning a KL penalty coefficient that is notoriously sensitive. DPO collapses this into a single supervised loss over preference pairs.

python
# DPO loss (simplified)

def dpo_loss(policy_chosen_logps, policy_rejected_logps,
             ref_chosen_logps, ref_rejected_logps, beta=0.1):
    chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)
    return -F.logsigmoid(chosen_rewards - rejected_rewards).mean()

The preference dataset contained approximately 80,000 pairs. Each pair consisted of a prompt, a preferred response, and a rejected response. I generated candidates using the base model at different temperatures, then ranked them using a combination of human review and GPT-4 as a judge. In practice, the quality of preference data matters far more than the quantity. Poorly constructed pairs teach the model nothing or, worse, teach it the wrong thing.

Key takeaway

DPO achieves 90% of RLHF's alignment quality with 30% of the engineering complexity. For a solo researcher or small team, it is the correct default choice.

Deployment and Inference

The aligned model needed to serve requests at reasonable latency. I deployed Clover using vLLM, which implements PagedAttention for efficient KV-cache management. With GQA and BF16, the model fits on a single A100 80GB and serves requests with a time-to-first-token of under 200ms for typical prompt lengths.

Quantization was the final optimization. GPTQ 4-bit quantization reduced the model size from 24GB to 7GB with negligible degradation on my target benchmarks. This made it possible to run Clover on consumer hardware with a RTX 4090, which opened up local deployment scenarios I had not originally planned for.

Conclusion

Training Clover took six weeks of compute and six months of preparation. The architecture decisions, RoPE, GQA, SwiGLU, RMSNorm, were not novel individually. The value was in understanding how they interact at the 12B scale and making informed tradeoffs between capability, memory, and deployment constraints.

The hardest parts were not the ones I expected. Data quality consumed more time than model architecture. Loss spikes were more stressful than hyperparameter tuning. And alignment, which I initially treated as an afterthought, turned out to be the difference between a model that generates text and a model that is actually useful.

If you take one thing away

The model architecture is the easy part. The data pipeline, training stability, and alignment process are where large language model projects succeed or fail. Start there.

Stay in the loop

Follow along as I explore the intersection of medicine, AI, and engineering.

Just honest writing, straight from me. Unsubscribe anytime.