The harness for agents
you're actually shipping
Trace every call, eval every output, catch regressions before your users do. Works with any LLM, any framework, any modality — text, voice, image, video.
llm_call → tool_call (search) → llm_call → tool_call (summarize) → 4 spans · 1.4s · 3.1k tokens
12/14 rows passed · naturalness 0.91 · instruction-following 0.83 · 2 failures flagged
One trace per agent run
Every step — tool calls, LLM calls, DB reads — stitched into a single waterfall. No fragmented logs.
Tool Call: stripe.refund
"order_id": "ORD-8429",
"amount": 49.99,
"reason": "customer_requested"
}
"status": "refund_processed",
"receipt_url": "https://stripe.com/..."
}
Agents break in ways you don't expect
Multi-step, multi-modal, multi-model — the more capable your agent, the harder it is to know when it's failing.
Fragmented Traces
Multi-step agents spawn traces across threads and services. You get 10 disconnected logs instead of one run.
No Quality Signal
The agent returns a response. You have no idea if it's right until a user files a complaint.
Manual QA Doesn't Scale
Spot-checking outputs by hand works for 10 test cases. It breaks at 1,000.
Invisible Cost
Token usage and latency blow up silently across sessions and devices. You find out on the invoice.
Everything in the harness
Trace, evaluate, and ship agents across text, voice, image, and video.
Agent Tracing
Every LLM call, tool invocation, HTTP hop, and DB query lands as a typed span — one waterfall per run.
Zero-Config SDK
One init() call. All providers patched automatically.
Online Evaluation
Every production trace scored automatically as it arrives. No cron jobs.
Voice Agent Eval
Generate audio, score naturalness + instruction-following, replay rows. Connect live in-browser.
Batch Evaluation
Run a dataset through your agent, score per row, regenerate failures individually.
Image & Video Eval
Generate, then LLM-as-judge score across visual quality dimensions.
Session & Device
session_id groups a conversation. device_id tracks originating clients APM-style.
Model Catalog
Central registry for all providers. Set access grants, test models inline.
Evaluator Collections
Group LLM-as-judge prompts into reusable collections. Apply to any project or set globally. Three rule types: model-graded, custom prompt, statistical.
SDK, harness, dashboard — one stack
Everything your agent needs from dev to production, without stitching together five tools.
SDK
One init() call instruments your entire agent stack — LLM clients, HTTP, SQL, logging. Python and TypeScript, zero deps.
- Auto-patches OpenAI, Anthropic, Gemini, Bedrock
- W3C traceparent propagation
- Session + device context
Agent Harness
The evaluation core. Run batch jobs, score traces online, compare models pairwise — text, voice, image, and video.
- Online + batch evaluation
- Voice agent eval via LiveKit
- Image / video quality scoring
Dashboard
Trace waterfall, session explorer, eval results, cost breakdown, and model catalog — everything in one place.
- Trace waterfall per agent run
- Per-session cost analytics
- Regression comparison across deploys
One call.
Everything traced.
init() patches every LLM client, HTTP layer, and database adapter your agent touches. Traces start flowing in seconds — no manual spans, no config files.
- pip install syntropylabs / npm install @syntropylabs/sdk
- Auto-instruments OpenAI, Anthropic, Gemini, Bedrock, HTTP, SQL, Redis, Mongoose
- Online eval: attach evaluators to a project, every trace gets scored automatically
- W3C traceparent — frontend and backend stitched into one trace, no extra work
Works with your entire AI stack
Simple, Transparent Pricing
Scale your evaluation pipeline as your product grows.
Enterprise
For organizations with custom security and scale needs.
- Custom model adapters
- On-premise deployment
- Dedicated account manager
- SLA & security audits
- SSO / SAML