The model router that actually works

Stop paying for LLM calls that an ML model can handle.

Other routers pick between LLMs. TRACER routes the easy 90% to a lightweight ML surrogate trained on your own traces, gated by measured agreement so quality never drops. Open source, or hosted.

Auditable: every routing decision explained End-to-end: traces in, endpoint out Parity-gated: never silently degrades
Incoming query "How do I change my PIN?"
predicted intent → change_pin accept_score 0.96
Without TRACER teacher LLM
cost $0.005
latency ~820 ms
With TRACER ML surrogate
cost $0.000001
latency <10 ms
Saved
~5,000× cheaper ~80× faster same accuracy (0.96 ≈ 0.98)
What is TRACER?
TRACER is a routing layer that sits above your LLM stack. It trains a lightweight ML surrogate on your LLM's own production traces, routes the easy 90% of classification calls to the surrogate for free, and defers only the hard 10% back to the LLM, gated by measured agreement so quality never silently drops. Run the open-source SDK locally, or deploy a hosted endpoint end to end with the 6-step setup wizard.

What it does

Three lanes, one router.

TRACER carves your incoming traffic into routable lanes by learned difficulty and gates each lane by measured agreement with your teacher. Easy goes to a free surrogate. Mid-difficulty can use a smaller LLM. The rest defers back to the teacher.

Of 100% incoming traffic →
Predictable
62%
→ surrogate (local ML)  ·  ~$0 / call
Mid-difficulty
24%
→ smaller LLM (optional tier)
Ambiguous / OOD
14%
→ teacher LLM  ·  full cost
Distribution above is illustrative · actual coverage is measured per workload.

Cost & latency

The two metrics that matter, both crushed.

Per-call cost and per-call latency are how a routing layer is judged in production. TRACER cuts both by orders of magnitude on the predictable slice, without giving up accuracy, because the parity gate is doing the work.

Cost per call on the predictable slice
teacher LLM $0.005 ~$5 / 1,000 calls
ML surrogate $0.000001 ~$1 / 1,000,000 calls
on the routed 90% → ~5,000× cheaper per call
Latency per call P50, no network jitter
teacher LLM ~820 ms API round-trip + generation
ML surrogate <10 ms CPU classifier · co-located
on the routed 90% → ~80× faster, often sub-10ms

Numbers above are demo-workload measurements (Banking77, gpt-5.5 teacher, BGE-M3 + linear surrogate). Your mileage scales with how predictable your traffic actually is.

How it works

Your LLM traces become free training data.

Every classification call your LLM already makes is a labeled input-output pair. TRACER fits a surrogate on those, calibrates a confidence gate, and only routes when the gate is happy.

01

Log traces

Every LLM classification call produces a labeled input-output pair, already in your logs. No manual labeling.

02

Fit a surrogate

tracer.fit() trains ML candidates, picks the best, calibrates a confidence gate against the teacher.

03

Route and save

Easy inputs handled by the surrogate (free). Hard inputs deferred to the teacher (paid). Coverage compounds.

Hosted Tracer

How does it work?

setup wizard · 6 steps

Like a typical model router, but routing isn't only between LLMs: it's between an ML surrogate trained on your own traces and the LLMs of your choice. Hosted Tracer gives you an HTTPS endpoint plus an audit dashboard. Endpoints are callable directly, or as agentic skills you plug into your agent framework.

Typical model router

Picks which LLM to call

query frontier LLM or smaller LLM
TRACER

Routes the easy 90% out of the LLM stack

query ML surrogate or smaller LLM or frontier LLM
01 · Connect

Point to your traces

Upload JSONL, connect a warehouse, or start from a sample. Pick task and embeddings.

02 · Configure

Pick models & quality

Choose your routing menu (ML + LLMs) and a quality target. Tracer builds the partition in the background.

03 · Deploy

Your endpoint is live

Production HTTPS endpoint, API key, audit dashboard, monitoring. cURL it from anywhere.

your endpoint, after setup
curl -X POST https://api.tracer.deeprecall.io/{project}/classify \
  -H "Authorization: Bearer trc_..." \
  -d '{"input":"How do I change my PIN?"}'

 {"label": "change_pin", "decision": "handled", "accept_score": 0.96, "model": "tracer_surrogate"}

What it routes

Two surfaces. One router.

TRACER works wherever your LLM makes a discrete decision over and over. That's classification calls (intents, moderation, triage), and the tool-selection step inside agentic workflows.

Classification live

LLM classification calls

Intent classification, content moderation, support triage, document extraction, eval pipelines. The router sends the predictable slice to a local ML surrogate and defers the rest to your teacher LLM.

endpoint: POST /classify · returns label + accept_score + decision
Agentic live

Agentic tool selection

Inside multi-step agents, tool selection is a classification problem in disguise. TRACER routes the easy tool-call decisions to a lightweight ML classifier and defers ambiguous ones to the LLM. Plug it in as an agentic skill.

skill: tracer.skill(project) · drop into LangGraph, Hermes, your own

In production

We replaced tool-calling with ML in Hermes. Cost dropped 50%.

We rewired Hermes (an open-source agent framework) to route tool selection through a TRACER classifier instead of the LLM. End-to-end agent cost dropped about 50% with no degradation on the traces we measured. Same surrogate, same parity gate, just plugged in as an agentic skill.

live · hermes + tracer
Case · Hermes (agentic)

Same agent. ~50% cheaper.

Tool selection inside an agent is a classification problem. Once the agent has run for a while, the same tool-call decisions repeat. TRACER learns them, gates by parity, and routes the predictable ones to a local classifier. For free.

E2E cost −50%
Tool-call latency ↓ ↓ ↓
running in prod inference at getclaw.sh Read the thread →

Trust

A model router that finally works.

Most model routers give you a black-box score. TRACER shows you the map. Every routing decision is grounded in a cluster you can read and a measured track record you can verify.

Cluster card one of 60 clusters in this workload

Read what the surrogate handles.

Every cluster card shows what kind of queries it covers, how each model performs on them, and which model TRACER picked. A domain expert can read it and predict the routing decision without running the system.

Handled by ML surrogate
PIN & card operations
"How do I change my PIN?"
Per-model accuracy on this cluster How often each model picks the right label on these queries. The cheapest model that meets the parity bar wins.
surrogate
96%
mini
97%
teacher
98%
Similar queries in this cluster
  • "How do I reset my PIN?"
  • "I forgot the PIN of my card"
  • "Can I pick a new PIN online?"
  • "My PIN is wrong, what do I do?"
  • "Update PIN for my Apple Pay card"
Partition map 60 clusters · grouped by topic

See where each query lands.

Every cluster projected to 2D, colored by routing tier and labeled by topic. Live queries pin onto the map. Out-of-scope queries land in the red region and escalate to the teacher automatically.

surrogate · 38 clusters · 62% mini · 14 clusters · 24% teacher · 8 clusters · 14% live query

Two ways to ship

Run it yourself, or let us host it.

Same routing core. Pick the deployment that fits your team. Start with the SDK, graduate to the hosted endpoint when you want to skip the infra.

01 · Open source

TRACER SDK

MIT-licensed Python package. Fits, calibrates, and serves a two-tier routing policy (1 ML surrogate + 1 LLM) from your laptop or your own infra.

  • pip install tracer-llm
  • 1 ML surrogate + 1 teacher LLM · two-tier routing
  • Ships with the parity gate, OOD check, qualitative reports
  • Self-host on any box, any cloud
02 · Hosted (paid)

Tracer Cloud

Managed routing endpoint with a multi-tier model menu. Routes across ML surrogate, smaller LLMs, and frontier LLMs for fine-grained cost control and bigger savings on long-tail traffic.

  • Multi-tier routing: ML + many LLMs (mini, mid, frontier)
  • Callable as API or as an agentic skill
  • Audit dashboard with cluster cards and partition map
  • Hosted BGE-M3 embeddings, monitoring, support

vs. everything else

How is this different from caching, smaller LLMs, or model routers?

Most LLM cost tools keep the request inside the LLM cost structure. TRACER routes predictable slices out of it entirely, gated by measured parity so you never silently degrade quality.

Approach What it does Where it falls short
Caching Reuses identical responses Only works when requests repeat exactly.
Prompt optimization Cuts tokens per call Request still goes through the LLM.
Smaller LLMs Cheaper per call Still orders of magnitude more than CPU-class ML at high volume.
Fine-tuning Specializes one model Heavier to maintain. Still inside the LLM cost structure.
Model routers Picks which LLM Never asks "do we need an LLM at all?".
TRACER Routes predictable slices to local ML Customer-trained. Parity-gated. Interpretable.

Safety

Parity gate · deploy only when safe

The surrogate goes live only when its agreement with the teacher exceeds your threshold on held-out data. If the task is too hard (hallucination detection, compositional NLI), TRACER correctly refuses to route. No silent degradation.

# the gate checks before promoting
result = tracer.fit(
    "traces.jsonl",
    embeddings=X,
    config=tracer.FitConfig(
        target_teacher_agreement=0.95
    ),
)
# result.manifest.coverage_cal -> 0.92
# result.manifest.method        -> "l2d"

Continual learning

The teacher-trace flywheel

Every deferred LLM call produces a new labeled trace at no extra cost. tracer.update() refits the surrogate on the expanded dataset. Coverage compounds: by day 4 the surrogate handles 100% of in-distribution traffic.

Coverage by day demo workload
Day 1
43% Day 2
98% Day 4
100%
Each deferred LLM call becomes a new training trace. The cold-start gap closes itself.

Quickstart · OSS

Five minutes to your first routing policy.

Install, run the demo, fit on your traces, serve. No labeling pipeline, no fine-tuning job. Open source, MIT-licensed.

01 · Install
pip install tracer-llm
02 · Demo
tracer demo
03 · Fit
tracer fit traces.jsonl --target 0.95
04 · Serve
tracer serve .tracer --port 8000
Python
import tracer

result = tracer.fit("traces.jsonl", embeddings=X)
router = tracer.load_router(".tracer", embedder=embedder)
out = router.predict("What is my balance?")
# {"label": "check_balance", "decision": "handled", "accept_score": 0.96}
JavaScript / Node.js
const { label, decision } = await fetch('http://localhost:8000/predict', {
  method: 'POST',
  body: JSON.stringify({ embedding })
}).then(r => r.json())

if (decision === 'deferred') label = await callYourLLM(text)

FAQ

Common questions.

What is TRACER?

TRACER is an open-source routing layer that trains a lightweight machine-learning surrogate on your LLM's own production classification traces. It routes the predictable 90 percent of traffic to the surrogate (near-zero cost) and defers only the hard 10 percent back to the LLM. Available as a Python SDK (pip install tracer-llm) or as a one-click hosted endpoint.

How do I reduce LLM costs?

To reduce LLM costs in production, route only the requests that genuinely need an LLM. Most production LLM workloads are repetitive classification tasks (intent detection, content moderation, support triage, tool selection). TRACER trains a small ML surrogate on your existing LLM traces and routes the predictable 90% of traffic to that surrogate at near-zero cost, deferring only the hard 10% back to the LLM. Typical impact: 5,000× cheaper per call on the routed slice and 80× lower latency, with a parity gate guaranteeing quality stays above your threshold. No fine-tuning, no manual labeling required.

What is LLM routing?

LLM routing sends each request to the cheapest model that can answer it correctly, instead of always hitting the same frontier LLM. Most model routers pick which LLM to call (frontier vs smaller LLM). TRACER goes further: it routes predictable requests out of the LLM stack entirely, into a lightweight ML surrogate trained on your own production traces. Routing is gated by measured agreement with your teacher LLM, so quality never silently drops. Available as tracer-llm on PyPI or as a hosted multi-tier routing endpoint.

How much does TRACER reduce LLM cost?

On the Banking77 benchmark with 10,000 daily classification calls, TRACER offloaded 92.2 percent of traffic to a local ML surrogate at 0.961 teacher agreement, cutting per-day cost from $44.50 to $3.47, about $14,976 saved per year. Actual savings depend on your workload's predictability; the more repetitive the traffic, the larger the saving.

How is TRACER different from a model router or smaller LLM?

Most LLM cost tools keep the request inside the LLM cost structure: caching only works on exact repeats, prompt optimization shaves tokens, smaller LLMs are still orders of magnitude more expensive than CPU-class ML, and model routers only pick which LLM to call. TRACER routes predictable slices out of the LLM stack entirely, gated by measured agreement (parity) with your teacher LLM so quality never silently degrades.

How does TRACER guarantee quality on the routed traffic?

TRACER deploys a parity gate: the surrogate goes live only when its agreement with the teacher LLM exceeds your threshold (for example 0.95) on held-out calibration data. If a workload is too hard, TRACER refuses to route it and everything stays on the LLM. Every routing decision exposes the matched cluster, the per-model accuracy on that cluster, and the confidence bound, fully auditable.

What kinds of workloads does TRACER work for?

TRACER targets repetitive LLM classification workloads: intent classification, content moderation, compliance scanning, support triage, document extraction, eval pipelines, and per-step tool selection in agentic workflows. Anywhere the same kinds of decisions happen many times a day, TRACER finds the predictable slices.

How long does it take to deploy TRACER?

On the hosted version, the setup wizard is six steps: pick your task, point to your traces, choose embeddings, pick your model menu, set a quality target, and get a live HTTPS endpoint. The build runs in the background and takes minutes (not days) depending on dataset size. With the open-source SDK, the equivalent is pip install tracer-llm followed by tracer fit traces.jsonl --target 0.95 and tracer serve.

Is TRACER open source?

Yes. The TRACER routing core is MIT-licensed and available on GitHub at github.com/adrida/tracer and on PyPI as tracer-llm. The hosted version layers managed infrastructure (managed embeddings, hosted endpoint, monitoring, audit dashboard) on top of the same OSS core.

Do I need to label my training data?

No. Every classification call your LLM already makes is a labeled (input, output) pair already in your logs. TRACER fits the surrogate directly on these traces with no manual labeling. As traces accumulate the surrogate refits and coverage compounds: 43% on day 1, 98% on day 2, 100% by day 4 in the demo workload.

Deploying TRACER at scale?

Tracer Cloud handles managed embeddings, multi-model routing, audit dashboards, monitoring, and reliable hosting, so your team focuses on the model menu, not the infra.

Book a call →