The Cost Problem Nobody Talks About
Every demo looks cheap. The playground is essentially free. Then you ship the feature and the invoice arrives. Running LLM inference at production scale is surprisingly expensive, and the cost structure is unlike anything most engineering teams have dealt with before.
The problem is not that individual requests are costly — it is that the cost is multiplicative. A verbose prompt, an uncontrolled output length, a naive retry strategy, and zero caching can easily turn a $500/month feature into a $15,000/month one. When I was running Trovex at scale, cost optimization was not an afterthought. It was a core architectural concern from day one. The techniques in this post came from real production systems, not theoretical exercises.

Understanding the Cost Model
LLM pricing has two components: input tokens and output tokens. Output tokens are typically 3-4x more expensive than input tokens. This single fact should reshape how you think about prompt design.
| Model Tier | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) | Typical Use Case |
|---|---|---|---|
| Frontier (GPT-4o, Claude Opus) | $2.50 - $15.00 | $10.00 - $75.00 | Complex reasoning, code generation, analysis |
| Mid-tier (GPT-4o-mini, Claude Sonnet) | $0.15 - $3.00 | $0.60 - $15.00 | Most production tasks, classification, extraction |
| Small (Gemini Flash, Haiku) | $0.01 - $0.25 | $0.04 - $1.25 | Routing, formatting, simple transformations |
The implication is direct: controlling output length is more impactful than reducing prompt length. A prompt that is 200 tokens longer but produces outputs 500 tokens shorter will be cheaper overall. This is counterintuitive, and it is one of the first things teams get wrong.
Prompt Compression Techniques
Most prompts contain significant redundancy. Compressing them without losing effectiveness is the lowest-effort, highest-impact optimization available.
Semantic Caching
The most effective cost optimization is not making the API call at all. Semantic caching stores responses for queries that are semantically similar to previous ones, not just exact matches. This is transformative for applications with repetitive query patterns.
The architecture is straightforward: embed the incoming query, search your cache for vectors above a similarity threshold, and return the cached response if found. The subtlety is in the threshold — too high and you miss valid cache hits, too low and you return stale or incorrect responses.
import hashlib
import numpy as np
from openai import OpenAI
client = OpenAI()
class SemanticCache:
def __init__(self, threshold=0.92):
self.threshold = threshold
self.cache = {} # {hash: (embedding, response, timestamp)}
def _embed(self, text):
resp = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return np.array(resp.data[0].embedding)
def _cosine_sim(self, a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def get(self, query):
query_emb = self._embed(query)
best_score, best_response = 0, None
for key, (emb, response, ts) in self.cache.items():
score = self._cosine_sim(query_emb, emb)
if score > best_score:
best_score, best_response = score, response
if best_score >= self.threshold:
return best_response
return None
def set(self, query, response):
emb = self._embed(query)
key = hashlib.sha256(query.encode()).hexdigest()
self.cache[key] = (emb, response, time.time())In production, replace the in-memory dictionary with Redis or a vector database like Pinecone or Qdrant. Add a TTL-based invalidation strategy — cached responses for factual queries should expire faster than cached responses for stable tasks like classification or formatting.
Semantic caching with a 0.92 similarity threshold typically achieves 30-60% cache hit rates on production workloads. At scale, this translates directly to a 30-60% reduction in API costs.
Model Routing
Not every request needs a frontier model. Model routing is the practice of classifying incoming requests by complexity and sending them to the cheapest model capable of handling them correctly.
The implementation pattern is a lightweight classifier — often a small model itself — that evaluates the query and routes it. Simple classification tasks, formatting operations, and data extraction go to a small model. Multi-step reasoning, creative generation, and ambiguous queries go to a large model.
def route_request(query, context):
# Use a small, fast model to classify complexity
routing_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": "Classify this query as SIMPLE or COMPLEX. "
"SIMPLE: factual lookup, formatting, extraction. "
"COMPLEX: reasoning, analysis, ambiguous intent."
}, {
"role": "user",
"content": query
}],
max_tokens=10
)
complexity = routing_response.choices[0].message.content.strip()
target_model = "gpt-4o-mini" if "SIMPLE" in complexity else "gpt-4o"
return client.chat.completions.create(
model=target_model,
messages=[{"role": "user", "content": query}]
)When we implemented routing in Trovex, approximately 65% of queries were classified as simple and handled by a smaller model. The quality difference was negligible for those queries, but the cost reduction was substantial — roughly a 4x decrease in average cost per request.

Output Token Control
Uncontrolled output is the single largest source of cost overruns. Models will generate verbose, rambling responses unless explicitly constrained. Three techniques address this.
max_tokens=100. If it should be a JSON object with three fields, set max_tokens=200. Leaving max_tokens at the default is leaving money on the table. Measure your actual output lengths and set the limit to the 95th percentile plus a small buffer.Be concise. Respond in 2-3 sentences maximum. to a system prompt measurably reduces output length. This sounds obvious, but the effect is significant — typical output length reductions of 40-60% with no quality loss on factual tasks.A production system generating 1M responses per month at an average of 800 output tokens instead of 300 is overspending by roughly 2.5x on the most expensive component of the API call. This adds up to thousands of dollars monthly.
Batching and Fine-Tuning Tradeoffs
Batching is underutilized. Most LLM API providers offer batch endpoints that process requests asynchronously at a 50% discount. If your use case tolerates latency — background processing, overnight analysis, non-real-time classification — batching is free money. Structure your pipeline to accumulate requests and submit them in bulk.
Fine-tuning vs. prompting is a cost decision, not just a quality decision. A fine-tuned small model that handles your specific task can replace a prompted frontier model at a fraction of the cost. The break-even point depends on your volume. In practice, if you are spending more than $2,000/month on a single prompt pattern, it is worth evaluating whether a fine-tuned model can handle it. The upfront cost of fine-tuning is fixed; the per-request savings compound indefinitely.
| Strategy | Typical Cost Reduction | Implementation Effort | Best For |
|---|---|---|---|
| Prompt compression | 15-30% | Low (hours) | Every project, start here |
| Semantic caching | 30-60% | Medium (days) | Repetitive query patterns |
| Model routing | 40-70% | Medium (days) | Mixed-complexity workloads |
| Output control | 20-50% | Low (hours) | Verbose output patterns |
| Fine-tuning | 60-90% | High (weeks) | High-volume, stable task patterns |
Putting It Together: A Real Example
Here is what a cost optimization pass looks like in practice. On Trovex, we had a query-answering pipeline that was costing approximately $8,200/month at our traffic level. After applying the techniques in this post systematically, we reduced that to roughly $1,900/month — a 77% reduction — with no measurable decrease in answer quality as measured by our eval suite.
max_tokens=400 (down from default) and added structured output for the response schema. Average output dropped from 620 tokens to 280. Savings: ~30% on output costs.Each technique was validated independently against our eval suite before combining them. This is critical — optimizations that degrade quality are not optimizations, they are bugs.
Conclusion
LLM cost optimization is not optional at production scale. It is a core engineering discipline that sits alongside reliability and latency as a system quality. The techniques are not complex — prompt compression, caching, routing, output control, and strategic fine-tuning — but they require measurement, systematic application, and continuous evaluation.
Start by instrumenting your current costs per endpoint. Measure input tokens, output tokens, and cache hit rates. Then apply the cheapest interventions first: prompt compression and output control cost hours of engineering time and deliver immediate savings. Layer in caching and routing as your volume justifies the infrastructure investment. The goal is not the lowest possible cost — it is the best quality per dollar spent.
Measure before you optimize. Instrument token usage per endpoint, track cost per request, and let the data tell you where the biggest savings are. The highest-impact optimization is almost never where you expect it.