Introduction
When I started building Clover, the question everyone asked was the same: why train your own model? Fair question. The open-source ecosystem is rich. But the honest answer is that I wanted to understand every layer of the stack, from tokenizer to RLHF, and the only way to do that is to actually train the thing.
Clover is a 12-billion parameter autoregressive language model. Not 7B, not 70B. The choice of scale was deliberate, and this post walks through every major decision, the infrastructure that made it possible, and the failures that taught me more than the successes.

Why 12 Billion Parameters
The parameter count was not arbitrary. 7B models are well-explored territory. They train fast, fit on a single high-end GPU for inference, and the community has established strong baselines. But in practice, 7B models hit a ceiling on multi-step reasoning and code generation that I could not accept for my target use cases.
70B and above require distributed inference, which constrains deployment. The economics stop making sense for a solo researcher without a dedicated cluster budget.
12B occupies a sweet spot. It fits within a single A100 80GB for inference at half precision. It shows measurable gains over 7B on benchmarks that matter for my applications, reasoning chains, tool use, and structured output. And it trains on a modest multi-GPU setup without requiring full-scale model parallelism.
At 12B, you get roughly 80% of the reasoning capability of a 30B model while remaining deployable on a single GPU. That margin is where Clover lives.
Architecture Decisions
Clover is a decoder-only transformer, following the lineage of GPT and LLaMA. The encoder-decoder pattern (T5-style) has its merits for sequence-to-sequence tasks, but for general-purpose generation, decoder-only has become the dominant paradigm for good reason: it simplifies the training pipeline, scales more predictably, and the community tooling is better.
Within that frame, four architectural choices defined the model.
4x during inference, which directly translates to higher batch sizes and lower latency in production.SwiGLU instead of standard ReLU or GELU. SwiGLU is a gated activation that multiplies the output of a Swish activation with a linear gate. It requires a third weight matrix in the FFN, but the empirical gains on loss-per-FLOP are consistent across scales. Clover's FFN hidden dimension is 14,336 with a model dimension of 5,120.# Clover architecture config
model_config = {
"hidden_size": 5120,
"intermediate_size": 14336,
"num_attention_heads": 32,
"num_key_value_heads": 8, # GQA: 4 query heads per KV head
"num_hidden_layers": 40,
"max_position_embeddings": 8192,
"rope_theta": 500000.0,
"rms_norm_eps": 1e-5,
"vocab_size": 64000,
"activation": "silu", # SwiGLU uses SiLU internally
"tie_word_embeddings": False,
}Training Infrastructure
Clover was trained on a cluster of 8x A100 80GB GPUs connected via NVLink. Not a massive setup by industry standards, but enough to make 12B work with the right parallelism strategy.
The training stack was built on PyTorch FSDP (Fully Sharded Data Parallelism). FSDP shards the model parameters, gradients, and optimizer states across all GPUs, only gathering them when needed for the forward and backward passes. This brought the per-GPU memory footprint from an impossible 90GB down to roughly 62GB at peak.
bfloat16. The key advantage of BF16 over FP16 is the wider dynamic range, which eliminates the need for loss scaling in almost all cases. Accumulations happen in FP32 for numerical stability.O(n^2) to O(n) and providing a 2-3x wallclock speedup on 8K sequences.# Training launch config (torchrun)
training:
gpus: 8
strategy: fsdp
mixed_precision: bf16
gradient_checkpointing: true
micro_batch_size: 2
gradient_accumulation_steps: 16
effective_batch_size: 256 # 2 * 16 * 8 GPUs
sequence_length: 8192
flash_attention: true
max_steps: 150000The effective batch size was 256 sequences of 8,192 tokens each, roughly 2 million tokens per step. At an average throughput of 1,400 tokens/second/GPU, each step took about 180 seconds. The full training run consumed approximately 1.2 trillion tokens over six weeks.

Data Pipeline
Data quality determines model quality. This is the part that does not get enough attention in most training write-ups, because it is unglamorous and time-consuming. I spent more time on the data pipeline than on the model architecture.
A 5% increase in data quality consistently outperformed a 20% increase in data quantity. Filtering aggressively and deduplicating thoroughly will do more for your model than adding another terabyte of scraped web text.
The Training Run
The loss curve was smooth for the first 40,000 steps. Then it was not. Around step 42,000, loss spiked sharply and failed to recover for nearly 2,000 steps. This is a loss spike, and it is terrifying the first time you see one.
The cause, in Clover's case, was a batch of pathologically long documents that contained repeated token sequences. The gradient norms exceeded the clipping threshold by an order of magnitude. The fix was twofold: I reduced the gradient clipping threshold from 1.0 to 0.5 and added a data quality gate that rejected any batch with more than 15% repeated n-grams.
Learning rate scheduling followed a cosine decay with warmup. Warmup ran for 2,000 steps to a peak learning rate of 3e-4, then decayed to 3e-5 over the remaining steps. The warmup phase is critical for FSDP training because the sharded optimizer states need time to synchronize before aggressive updates begin.
# Learning rate schedule
def get_lr(step, warmup=2000, max_lr=3e-4, min_lr=3e-5, total_steps=150000):
if step < warmup:
return max_lr * step / warmup
progress = (step - warmup) / (total_steps - warmup)
return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))Checkpointing happened every 1,000 steps. I kept the last 5 checkpoints and every 10,000-step checkpoint permanently. Disk is cheap. Losing a week of training to a hardware fault is not.
Alignment: DPO Over RLHF
After pretraining, the base model generates coherent text but has no sense of instruction following. Alignment fixes this. The two dominant approaches are RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization).
I chose DPO. The reason is operational simplicity. RLHF requires training a separate reward model, maintaining a PPO training loop with a reference model, and tuning a KL penalty coefficient that is notoriously sensitive. DPO collapses this into a single supervised loss over preference pairs.
# DPO loss (simplified)
def dpo_loss(policy_chosen_logps, policy_rejected_logps,
ref_chosen_logps, ref_rejected_logps, beta=0.1):
chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)
return -F.logsigmoid(chosen_rewards - rejected_rewards).mean()The preference dataset contained approximately 80,000 pairs. Each pair consisted of a prompt, a preferred response, and a rejected response. I generated candidates using the base model at different temperatures, then ranked them using a combination of human review and GPT-4 as a judge. In practice, the quality of preference data matters far more than the quantity. Poorly constructed pairs teach the model nothing or, worse, teach it the wrong thing.
DPO achieves 90% of RLHF's alignment quality with 30% of the engineering complexity. For a solo researcher or small team, it is the correct default choice.
Deployment and Inference
The aligned model needed to serve requests at reasonable latency. I deployed Clover using vLLM, which implements PagedAttention for efficient KV-cache management. With GQA and BF16, the model fits on a single A100 80GB and serves requests with a time-to-first-token of under 200ms for typical prompt lengths.
Quantization was the final optimization. GPTQ 4-bit quantization reduced the model size from 24GB to 7GB with negligible degradation on my target benchmarks. This made it possible to run Clover on consumer hardware with a RTX 4090, which opened up local deployment scenarios I had not originally planned for.
Conclusion
Training Clover took six weeks of compute and six months of preparation. The architecture decisions, RoPE, GQA, SwiGLU, RMSNorm, were not novel individually. The value was in understanding how they interact at the 12B scale and making informed tradeoffs between capability, memory, and deployment constraints.
The hardest parts were not the ones I expected. Data quality consumed more time than model architecture. Loss spikes were more stressful than hyperparameter tuning. And alignment, which I initially treated as an afterthought, turned out to be the difference between a model that generates text and a model that is actually useful.
The model architecture is the easy part. The data pipeline, training stability, and alignment process are where large language model projects succeed or fail. Start there.