Blog/Tips

LLM Cost Optimization: Token Efficiency, Caching, and Prompt Design

Practical strategies for reducing LLM inference costs without sacrificing output quality.

The Cost Problem Nobody Talks About

Every demo looks cheap. The playground is essentially free. Then you ship the feature and the invoice arrives. Running LLM inference at production scale is surprisingly expensive, and the cost structure is unlike anything most engineering teams have dealt with before.

The problem is not that individual requests are costly — it is that the cost is multiplicative. A verbose prompt, an uncontrolled output length, a naive retry strategy, and zero caching can easily turn a $500/month feature into a $15,000/month one. When I was running Trovex at scale, cost optimization was not an afterthought. It was a core architectural concern from day one. The techniques in this post came from real production systems, not theoretical exercises.

Analytics dashboard displaying cost metrics and usage data
Cost monitoring dashboards are essential — tracking token usage, cache hit rates, and per-endpoint spend in real timePhoto on Unsplash

Understanding the Cost Model

LLM pricing has two components: input tokens and output tokens. Output tokens are typically 3-4x more expensive than input tokens. This single fact should reshape how you think about prompt design.

Model TierInput Cost (per 1M tokens)Output Cost (per 1M tokens)Typical Use Case
Frontier (GPT-4o, Claude Opus)$2.50 - $15.00$10.00 - $75.00Complex reasoning, code generation, analysis
Mid-tier (GPT-4o-mini, Claude Sonnet)$0.15 - $3.00$0.60 - $15.00Most production tasks, classification, extraction
Small (Gemini Flash, Haiku)$0.01 - $0.25$0.04 - $1.25Routing, formatting, simple transformations

The implication is direct: controlling output length is more impactful than reducing prompt length. A prompt that is 200 tokens longer but produces outputs 500 tokens shorter will be cheaper overall. This is counterintuitive, and it is one of the first things teams get wrong.

Prompt Compression Techniques

Most prompts contain significant redundancy. Compressing them without losing effectiveness is the lowest-effort, highest-impact optimization available.

1
Remove redundant context. If your system prompt restates information that is already in the user message, you are paying for it twice. Audit your prompts line by line. Every sentence should convey information the model cannot infer from other parts of the prompt.
2
Use references instead of full text. Instead of pasting an entire document into context, extract the relevant sections first. A retrieval step that selects the top 3 relevant paragraphs from a 50-page document can reduce input tokens by 90% while maintaining answer quality.
3
Abbreviate few-shot examples. Few-shot examples are powerful but expensive. Test whether shorter examples work just as well. Often a two-line example is as effective as a ten-line one — the model needs the pattern, not the prose.
4
Strip formatting tokens. Markdown headers, bullet points, and excessive whitespace all consume tokens. For system prompts that are never shown to users, plain text with minimal formatting is cheaper and equally effective.

Semantic Caching

The most effective cost optimization is not making the API call at all. Semantic caching stores responses for queries that are semantically similar to previous ones, not just exact matches. This is transformative for applications with repetitive query patterns.

The architecture is straightforward: embed the incoming query, search your cache for vectors above a similarity threshold, and return the cached response if found. The subtlety is in the threshold — too high and you miss valid cache hits, too low and you return stale or incorrect responses.

python
import hashlib
import numpy as np
from openai import OpenAI

client = OpenAI()

class SemanticCache:
    def __init__(self, threshold=0.92):
        self.threshold = threshold
        self.cache = {}  # {hash: (embedding, response, timestamp)}


    def _embed(self, text):
        resp = client.embeddings.create(
            model="text-embedding-3-small",
            input=text
        )
        return np.array(resp.data[0].embedding)

    def _cosine_sim(self, a, b):
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    def get(self, query):
        query_emb = self._embed(query)
        best_score, best_response = 0, None

        for key, (emb, response, ts) in self.cache.items():
            score = self._cosine_sim(query_emb, emb)
            if score > best_score:
                best_score, best_response = score, response

        if best_score >= self.threshold:
            return best_response
        return None

    def set(self, query, response):
        emb = self._embed(query)
        key = hashlib.sha256(query.encode()).hexdigest()
        self.cache[key] = (emb, response, time.time())

In production, replace the in-memory dictionary with Redis or a vector database like Pinecone or Qdrant. Add a TTL-based invalidation strategy — cached responses for factual queries should expire faster than cached responses for stable tasks like classification or formatting.

Key takeaway

Semantic caching with a 0.92 similarity threshold typically achieves 30-60% cache hit rates on production workloads. At scale, this translates directly to a 30-60% reduction in API costs.

Model Routing

Not every request needs a frontier model. Model routing is the practice of classifying incoming requests by complexity and sending them to the cheapest model capable of handling them correctly.

The implementation pattern is a lightweight classifier — often a small model itself — that evaluates the query and routes it. Simple classification tasks, formatting operations, and data extraction go to a small model. Multi-step reasoning, creative generation, and ambiguous queries go to a large model.

python
def route_request(query, context):
    # Use a small, fast model to classify complexity

    routing_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Classify this query as SIMPLE or COMPLEX. "
                       "SIMPLE: factual lookup, formatting, extraction. "
                       "COMPLEX: reasoning, analysis, ambiguous intent."
        }, {
            "role": "user",
            "content": query
        }],
        max_tokens=10
    )

    complexity = routing_response.choices[0].message.content.strip()

    target_model = "gpt-4o-mini" if "SIMPLE" in complexity else "gpt-4o"

    return client.chat.completions.create(
        model=target_model,
        messages=[{"role": "user", "content": query}]
    )

When we implemented routing in Trovex, approximately 65% of queries were classified as simple and handled by a smaller model. The quality difference was negligible for those queries, but the cost reduction was substantial — roughly a 4x decrease in average cost per request.

Visualization of a neural network architecture with interconnected layers
Model architecture choices directly impact cost — routing queries to the right model size is one of the highest-leverage optimizations availablePhoto on Unsplash

Output Token Control

Uncontrolled output is the single largest source of cost overruns. Models will generate verbose, rambling responses unless explicitly constrained. Three techniques address this.

1
Set max_tokens aggressively. If you know the answer should be a single sentence, set max_tokens=100. If it should be a JSON object with three fields, set max_tokens=200. Leaving max_tokens at the default is leaving money on the table. Measure your actual output lengths and set the limit to the 95th percentile plus a small buffer.
2
Use structured outputs. JSON mode and function calling force the model to produce compact, parseable responses. A structured output with defined fields is almost always shorter than the equivalent free-text response. This saves tokens and eliminates parsing failures simultaneously.
3
Design prompts that request brevity. Adding Be concise. Respond in 2-3 sentences maximum. to a system prompt measurably reduces output length. This sounds obvious, but the effect is significant — typical output length reductions of 40-60% with no quality loss on factual tasks.
Why this matters

A production system generating 1M responses per month at an average of 800 output tokens instead of 300 is overspending by roughly 2.5x on the most expensive component of the API call. This adds up to thousands of dollars monthly.

Batching and Fine-Tuning Tradeoffs

Batching is underutilized. Most LLM API providers offer batch endpoints that process requests asynchronously at a 50% discount. If your use case tolerates latency — background processing, overnight analysis, non-real-time classification — batching is free money. Structure your pipeline to accumulate requests and submit them in bulk.

Fine-tuning vs. prompting is a cost decision, not just a quality decision. A fine-tuned small model that handles your specific task can replace a prompted frontier model at a fraction of the cost. The break-even point depends on your volume. In practice, if you are spending more than $2,000/month on a single prompt pattern, it is worth evaluating whether a fine-tuned model can handle it. The upfront cost of fine-tuning is fixed; the per-request savings compound indefinitely.

StrategyTypical Cost ReductionImplementation EffortBest For
Prompt compression15-30%Low (hours)Every project, start here
Semantic caching30-60%Medium (days)Repetitive query patterns
Model routing40-70%Medium (days)Mixed-complexity workloads
Output control20-50%Low (hours)Verbose output patterns
Fine-tuning60-90%High (weeks)High-volume, stable task patterns

Putting It Together: A Real Example

Here is what a cost optimization pass looks like in practice. On Trovex, we had a query-answering pipeline that was costing approximately $8,200/month at our traffic level. After applying the techniques in this post systematically, we reduced that to roughly $1,900/month — a 77% reduction — with no measurable decrease in answer quality as measured by our eval suite.

1
Prompt audit: removed 340 redundant tokens from the system prompt, switched from 5 few-shot examples to 2. Savings: ~18%.
2
Semantic caching: deployed with a 0.93 threshold. Cache hit rate stabilized at 42%. Savings: ~42% of remaining volume.
3
Model routing: routed simple factual lookups (about 55% of queries) to a smaller model. Savings: ~35% on routed queries.
4
Output control: set max_tokens=400 (down from default) and added structured output for the response schema. Average output dropped from 620 tokens to 280. Savings: ~30% on output costs.

Each technique was validated independently against our eval suite before combining them. This is critical — optimizations that degrade quality are not optimizations, they are bugs.

Conclusion

LLM cost optimization is not optional at production scale. It is a core engineering discipline that sits alongside reliability and latency as a system quality. The techniques are not complex — prompt compression, caching, routing, output control, and strategic fine-tuning — but they require measurement, systematic application, and continuous evaluation.

Start by instrumenting your current costs per endpoint. Measure input tokens, output tokens, and cache hit rates. Then apply the cheapest interventions first: prompt compression and output control cost hours of engineering time and deliver immediate savings. Layer in caching and routing as your volume justifies the infrastructure investment. The goal is not the lowest possible cost — it is the best quality per dollar spent.

Key takeaway

Measure before you optimize. Instrument token usage per endpoint, track cost per request, and let the data tell you where the biggest savings are. The highest-impact optimization is almost never where you expect it.

Stay in the loop

Follow along as I explore the intersection of medicine, AI, and engineering.

Just honest writing, straight from me. Unsubscribe anytime.