The model router that actually works

Stop paying for LLM calls that an ML model can handle.

Other routers pick between LLMs. TRACER routes the easy 90% to a lightweight ML surrogate trained on your own traces, gated by measured agreement so quality never drops. Open source, or hosted.

Try the hosted version → Run it yourself GitHub

Auditable: every routing decision explained End-to-end: traces in, endpoint out Parity-gated: never silently degrades

Incoming query "How do I change my PIN?"

predicted intent → change_pin accept_score 0.96

Without TRACER teacher LLM

cost $0.005

latency ~820 ms

With TRACER ML surrogate

cost $0.000001

latency <10 ms

Saved

~5,000× cheaper ~80× faster same accuracy (0.96 ≈ 0.98)

What is TRACER?

TRACER is a routing layer that sits above your LLM stack. It trains a lightweight ML surrogate on your LLM's own production traces, routes the easy 90% of classification calls to the surrogate for free, and defers only the hard 10% back to the LLM, gated by measured agreement so quality never silently drops. Run the open-source SDK locally, or deploy a hosted endpoint end to end with the 6-step setup wizard.

What it does

Three lanes, one router.

TRACER carves your incoming traffic into routable lanes by learned difficulty and gates each lane by measured agreement with your teacher. Easy goes to a free surrogate. Mid-difficulty can use a smaller LLM. The rest defers back to the teacher.

Of 100% incoming traffic →

Predictable

62%

→ surrogate (local ML) · ~$0 / call

Mid-difficulty

24%

→ smaller LLM (optional tier)

Ambiguous / OOD

14%

→ teacher LLM · full cost

Distribution above is illustrative · actual coverage is measured per workload.

Cost & latency

The two metrics that matter, both crushed.

Per-call cost and per-call latency are how a routing layer is judged in production. TRACER cuts both by orders of magnitude on the predictable slice, without giving up accuracy, because the parity gate is doing the work.

Cost per call on the predictable slice

teacher LLM $0.005 ~$5 / 1,000 calls

→

ML surrogate $0.000001 ~$1 / 1,000,000 calls

on the routed 90% → ~5,000× cheaper per call

Latency per call P50, no network jitter

teacher LLM ~820 ms API round-trip + generation

→

ML surrogate <10 ms CPU classifier · co-located

on the routed 90% → ~80× faster, often sub-10ms

Numbers above are demo-workload measurements (Banking77, gpt-5.5 teacher, BGE-M3 + linear surrogate). Your mileage scales with how predictable your traffic actually is.

How it works

Your LLM traces become free training data.

Every classification call your LLM already makes is a labeled input-output pair. TRACER fits a surrogate on those, calibrates a confidence gate, and only routes when the gate is happy.

Log traces

Every LLM classification call produces a labeled input-output pair, already in your logs. No manual labeling.

Fit a surrogate

tracer.fit() trains ML candidates, picks the best, calibrates a confidence gate against the teacher.

Route and save

Easy inputs handled by the surrogate (free). Hard inputs deferred to the teacher (paid). Coverage compounds.

Hosted Tracer

How does it work?

setup wizard · 6 steps

Like a typical model router, but routing isn't only between LLMs: it's between an ML surrogate trained on your own traces and the LLMs of your choice. Hosted Tracer gives you an HTTPS endpoint plus an audit dashboard. Endpoints are callable directly, or as agentic skills you plug into your agent framework.

Typical model router

Picks which LLM to call

query → frontier LLM or smaller LLM

TRACER

Routes the easy 90% out of the LLM stack

query → ML surrogate or smaller LLM or frontier LLM

01 · Connect

Point to your traces

Upload JSONL, connect a warehouse, or start from a sample. Pick task and embeddings.

02 · Configure

Pick models & quality

Choose your routing menu (ML + LLMs) and a quality target. Tracer builds the partition in the background.

03 · Deploy

Your endpoint is live

Production HTTPS endpoint, API key, audit dashboard, monitoring. cURL it from anywhere.

your endpoint, after setup

curl -X POST https://api.tracer.deeprecall.io/{project}/classify \
  -H "Authorization: Bearer trc_..." \
  -d '{"input":"How do I change my PIN?"}'

→ {"label": "change_pin", "decision": "handled", "accept_score": 0.96, "model": "tracer_surrogate"}

Get hosted access → Or run it yourself

What it routes

Two surfaces. One router.

TRACER works wherever your LLM makes a discrete decision over and over. That's classification calls (intents, moderation, triage), and the tool-selection step inside agentic workflows.

Classification live

LLM classification calls

Intent classification, content moderation, support triage, document extraction, eval pipelines. The router sends the predictable slice to a local ML surrogate and defers the rest to your teacher LLM.

endpoint: POST /classify · returns label + accept_score + decision

Agentic live

Agentic tool selection

Inside multi-step agents, tool selection is a classification problem in disguise. TRACER routes the easy tool-call decisions to a lightweight ML classifier and defers ambiguous ones to the LLM. Plug it in as an agentic skill.

skill: tracer.skill(project) · drop into LangGraph, Hermes, your own

In production

We replaced tool-calling with ML in Hermes. Cost dropped 50%.

We rewired Hermes (an open-source agent framework) to route tool selection through a TRACER classifier instead of the LLM. End-to-end agent cost dropped about 50% with no degradation on the traces we measured. Same surrogate, same parity gate, just plugged in as an agentic skill.

live · hermes + tracer

Case · Hermes (agentic)

Same agent. ~50% cheaper.

Tool selection inside an agent is a classification problem. Once the agent has run for a while, the same tool-call decisions repeat. TRACER learns them, gates by parity, and routes the predictable ones to a local classifier. For free.

E2E cost −50%

Tool-call latency ↓ ↓ ↓

running in prod inference at getclaw.sh Read the thread →

Trust

A model router that finally works.

Most model routers give you a black-box score. TRACER shows you the map. Every routing decision is grounded in a cluster you can read and a measured track record you can verify.

Cluster card one of 60 clusters in this workload

Read what the surrogate handles.

Every cluster card shows what kind of queries it covers, how each model performs on them, and which model TRACER picked. A domain expert can read it and predict the routing decision without running the system.

Handled by ML surrogate

PIN & card operations

"How do I change my PIN?"

Per-model accuracy on this cluster How often each model picks the right label on these queries. The cheapest model that meets the parity bar wins.

surrogate

96%

mini

97%

teacher

98%

Similar queries in this cluster

"How do I reset my PIN?"
"I forgot the PIN of my card"
"Can I pick a new PIN online?"
"My PIN is wrong, what do I do?"
"Update PIN for my Apple Pay card"

Partition map 60 clusters · grouped by topic

See where each query lands.

Every cluster projected to 2D, colored by routing tier and labeled by topic. Live queries pin onto the map. Out-of-scope queries land in the red region and escalate to the teacher automatically.

OOD region pin & card ops card delivery balance & transactions card lost exchange rate top-up & payments disputed charges compliance / KYC novel queries UMAP · cosine

surrogate · 38 clusters · 62% mini · 14 clusters · 24% teacher · 8 clusters · 14% live query

Two ways to ship

Run it yourself, or let us host it.

Same routing core. Pick the deployment that fits your team. Start with the SDK, graduate to the hosted endpoint when you want to skip the infra.

01 · Open source

TRACER SDK

MIT-licensed Python package. Fits, calibrates, and serves a two-tier routing policy (1 ML surrogate + 1 LLM) from your laptop or your own infra.

pip install tracer-llm
1 ML surrogate + 1 teacher LLM · two-tier routing
Ships with the parity gate, OOD check, qualitative reports
Self-host on any box, any cloud

Quickstart GitHub

02 · Hosted (paid)

Tracer Cloud

Managed routing endpoint with a multi-tier model menu. Routes across ML surrogate, smaller LLMs, and frontier LLMs for fine-grained cost control and bigger savings on long-tail traffic.

Multi-tier routing: ML + many LLMs (mini, mid, frontier)
Callable as API or as an agentic skill
Audit dashboard with cluster cards and partition map
Hosted BGE-M3 embeddings, monitoring, support

Get hosted access →

vs. everything else

How is this different from caching, smaller LLMs, or model routers?

Most LLM cost tools keep the request inside the LLM cost structure. TRACER routes predictable slices out of it entirely, gated by measured parity so you never silently degrade quality.

Approach	What it does	Where it falls short
Caching	Reuses identical responses	Only works when requests repeat exactly.
Prompt optimization	Cuts tokens per call	Request still goes through the LLM.
Smaller LLMs	Cheaper per call	Still orders of magnitude more than CPU-class ML at high volume.
Fine-tuning	Specializes one model	Heavier to maintain. Still inside the LLM cost structure.
Model routers	Picks which LLM	Never asks "do we need an LLM at all?".
TRACER	Routes predictable slices to local ML	Customer-trained. Parity-gated. Interpretable.

Safety

Parity gate · deploy only when safe

The surrogate goes live only when its agreement with the teacher exceeds your threshold on held-out data. If the task is too hard (hallucination detection, compositional NLI), TRACER correctly refuses to route. No silent degradation.

# the gate checks before promoting
result = tracer.fit(
    "traces.jsonl",
    embeddings=X,
    config=tracer.FitConfig(
        target_teacher_agreement=0.95
    ),
)
# result.manifest.coverage_cal -> 0.92
# result.manifest.method        -> "l2d"

Continual learning

The teacher-trace flywheel

Every deferred LLM call produces a new labeled trace at no extra cost. tracer.update() refits the surrogate on the expanded dataset. Coverage compounds: by day 4 the surrogate handles 100% of in-distribution traffic.

Coverage by day demo workload

Day 1

43% Day 2

98% Day 4

100%

Each deferred LLM call becomes a new training trace. The cold-start gap closes itself.

Quickstart · OSS

Five minutes to your first routing policy.

Install, run the demo, fit on your traces, serve. No labeling pipeline, no fine-tuning job. Open source, MIT-licensed.

01 · Install

pip install tracer-llm

02 · Demo

tracer demo

03 · Fit

tracer fit traces.jsonl --target 0.95

04 · Serve

tracer serve .tracer --port 8000

Python

import tracer

result = tracer.fit("traces.jsonl", embeddings=X)
router = tracer.load_router(".tracer", embedder=embedder)
out = router.predict("What is my balance?")
# {"label": "check_balance", "decision": "handled", "accept_score": 0.96}

JavaScript / Node.js

const { label, decision } = await fetch('http://localhost:8000/predict', {
  method: 'POST',
  body: JSON.stringify({ embedding })
}).then(r => r.json())

if (decision === 'deferred') label = await callYourLLM(text)

FAQ

Common questions.

What is TRACER?

TRACER is an open-source routing layer that trains a lightweight machine-learning surrogate on your LLM's own production classification traces. It routes the predictable 90 percent of traffic to the surrogate (near-zero cost) and defers only the hard 10 percent back to the LLM. Available as a Python SDK (pip install tracer-llm) or as a one-click hosted endpoint.

How do I reduce LLM costs?

To reduce LLM costs in production, route only the requests that genuinely need an LLM. Most production LLM workloads are repetitive classification tasks (intent detection, content moderation, support triage, tool selection). TRACER trains a small ML surrogate on your existing LLM traces and routes the predictable 90% of traffic to that surrogate at near-zero cost, deferring only the hard 10% back to the LLM. Typical impact: 5,000× cheaper per call on the routed slice and 80× lower latency, with a parity gate guaranteeing quality stays above your threshold. No fine-tuning, no manual labeling required.

What is LLM routing?

LLM routing sends each request to the cheapest model that can answer it correctly, instead of always hitting the same frontier LLM. Most model routers pick which LLM to call (frontier vs smaller LLM). TRACER goes further: it routes predictable requests out of the LLM stack entirely, into a lightweight ML surrogate trained on your own production traces. Routing is gated by measured agreement with your teacher LLM, so quality never silently drops. Available as tracer-llm on PyPI or as a hosted multi-tier routing endpoint.

How much does TRACER reduce LLM cost?

On the Banking77 benchmark with 10,000 daily classification calls, TRACER offloaded 92.2 percent of traffic to a local ML surrogate at 0.961 teacher agreement, cutting per-day cost from $44.50 to $3.47, about $14,976 saved per year. Actual savings depend on your workload's predictability; the more repetitive the traffic, the larger the saving.

How is TRACER different from a model router or smaller LLM?

Most LLM cost tools keep the request inside the LLM cost structure: caching only works on exact repeats, prompt optimization shaves tokens, smaller LLMs are still orders of magnitude more expensive than CPU-class ML, and model routers only pick which LLM to call. TRACER routes predictable slices out of the LLM stack entirely, gated by measured agreement (parity) with your teacher LLM so quality never silently degrades.

How does TRACER guarantee quality on the routed traffic?

TRACER deploys a parity gate: the surrogate goes live only when its agreement with the teacher LLM exceeds your threshold (for example 0.95) on held-out calibration data. If a workload is too hard, TRACER refuses to route it and everything stays on the LLM. Every routing decision exposes the matched cluster, the per-model accuracy on that cluster, and the confidence bound, fully auditable.

What kinds of workloads does TRACER work for?

TRACER targets repetitive LLM classification workloads: intent classification, content moderation, compliance scanning, support triage, document extraction, eval pipelines, and per-step tool selection in agentic workflows. Anywhere the same kinds of decisions happen many times a day, TRACER finds the predictable slices.

How long does it take to deploy TRACER?

On the hosted version, the setup wizard is six steps: pick your task, point to your traces, choose embeddings, pick your model menu, set a quality target, and get a live HTTPS endpoint. The build runs in the background and takes minutes (not days) depending on dataset size. With the open-source SDK, the equivalent is pip install tracer-llm followed by tracer fit traces.jsonl --target 0.95 and tracer serve.

Is TRACER open source?

Yes. The TRACER routing core is MIT-licensed and available on GitHub at github.com/adrida/tracer and on PyPI as tracer-llm. The hosted version layers managed infrastructure (managed embeddings, hosted endpoint, monitoring, audit dashboard) on top of the same OSS core.

Do I need to label my training data?

No. Every classification call your LLM already makes is a labeled (input, output) pair already in your logs. TRACER fits the surrogate directly on these traces with no manual labeling. As traces accumulate the surrogate refits and coverage compounds: 43% on day 1, 98% on day 2, 100% by day 4 in the demo workload.

Deploying TRACER at scale?

Tracer Cloud handles managed embeddings, multi-model routing, audit dashboards, monitoring, and reliable hosting, so your team focuses on the model menu, not the infra.

Book a call →