Agent Harness — now in beta

The harness for agents
you're actually shipping

Trace every call, eval every output, catch regressions before your users do. Works with any LLM, any framework, any modality — text, voice, image, video.

agent_run_#1047.trace

EVALUATING

00:00:38

research_agent

0.94

llm_call → tool_call (search) → llm_call → tool_call (summarize) → 4 spans · 1.4s · 3.1k tokens

voice_eval_job #22

0.87

12/14 rows passed · naturalness 0.91 · instruction-following 0.83 · 2 failures flagged

Avg Latency

1.2s

Token Usage

4.2k

Pass Rate

98.2%

One trace per agent run

Every step — tool calls, LLM calls, DB reads — stitched into a single waterfall. No fragmented logs.

Agent Trace Viewer

trace_#1094_supportPASSED EVAL

Execution Waterfall

Support Agent

3.4s

db.get_order

45ms

Planner ReAct

800ms

stripe.refund

412ms

llm.generate

1.2s

Tool Call: stripe.refund

Latency412ms

Tokens1,402

// Agent Tool Input

{
"order_id": "ORD-8429",
"amount": 49.99,
"reason": "customer_requested"
}

// API Output

{
"status": "refund_processed",
"receipt_url": "https://stripe.com/..."
}

Agents break in ways you don't expect

Multi-step, multi-modal, multi-model — the more capable your agent, the harder it is to know when it's failing.

Fragmented Traces

Multi-step agents spawn traces across threads and services. You get 10 disconnected logs instead of one run.

No Quality Signal

The agent returns a response. You have no idea if it's right until a user files a complaint.

Manual QA Doesn't Scale

Spot-checking outputs by hand works for 10 test cases. It breaks at 1,000.

Invisible Cost

Token usage and latency blow up silently across sessions and devices. You find out on the invoice.

Everything in the harness

Trace, evaluate, and ship agents across text, voice, image, and video.

Agent Tracing

Every LLM call, tool invocation, HTTP hop, and DB query lands as a typed span — one waterfall per run.

support_agent

3.4s

db.get_order

45ms

llm.plan

0.8s

stripe.refund

412ms

llm.reply

1.2s

llm_calltool_callhttp_calldb_query

Zero-Config SDK

One init() call. All providers patched automatically.

$ pip install syntropylabs

$ npm i @syntropylabs/sdk

Online Evaluation

Every production trace scored automatically as it arrives. No cron jobs.

Scoring live traces…

Voice Agent Eval

Generate audio, score naturalness + instruction-following, replay rows. Connect live in-browser.

naturalness 0.91instruction 0.8712 / 14 passed

Batch Evaluation

Run a dataset through your agent, score per row, regenerate failures individually.

row_001

94%

row_002

72%

row_003

38%

row_004

91%

Image & Video Eval

Generate, then LLM-as-judge score across visual quality dimensions.

Visual quality88%

Prompt alignment74%

Coherence91%

Session & Device

session_id groups a conversation. device_id tracks originating clients APM-style.

sess_4f2a$0.031

sess_8c1b$0.089

sess_2d9e$0.214

Model Catalog

Central registry for all providers. Set access grants, test models inline.

OpenAI

Anthropic

Gemini

Bedrock

Cohere

Azure

Evaluator Collections

Group LLM-as-judge prompts into reusable collections. Apply to any project or set globally. Three rule types: model-graded, custom prompt, statistical.

RelevancyGroundednessSafetyPII DetectionCoherenceCustom Prompt

The Platform

SDK, harness, dashboard — one stack

Everything your agent needs from dev to production, without stitching together five tools.

SDK

One init() call instruments your entire agent stack — LLM clients, HTTP, SQL, logging. Python and TypeScript, zero deps.

Auto-patches OpenAI, Anthropic, Gemini, Bedrock
W3C traceparent propagation
Session + device context

Agent Harness

The evaluation core. Run batch jobs, score traces online, compare models pairwise — text, voice, image, and video.

Online + batch evaluation
Voice agent eval via LiveKit
Image / video quality scoring

Dashboard

Trace waterfall, session explorer, eval results, cost breakdown, and model catalog — everything in one place.

Trace waterfall per agent run
Per-session cost analytics
Regression comparison across deploys

Built for developers

One call.
Everything traced.

init() patches every LLM client, HTTP layer, and database adapter your agent touches. Traces start flowing in seconds — no manual spans, no config files.

pip install syntropylabs / npm install @syntropylabs/sdk
Auto-instruments OpenAI, Anthropic, Gemini, Bedrock, HTTP, SQL, Redis, Mongoose
Online eval: attach evaluators to a project, every trace gets scored automatically
W3C traceparent — frontend and backend stitched into one trace, no extra work

agent_setup.py

1import syntropylabs as stl

3stl.init(

4 subscription_key="sk_live_...",

5 service_name="my-agent",

6 environment="production",

7 session_id=user_session_id,

8 device_id=device_fingerprint,

11# OpenAI, Anthropic, HTTP, SQL — all patched automatically

12from openai import OpenAI

13client = OpenAI()

15response = client.chat.completions.create(

16 model="gpt-4o",

17 messages=[{"role": "user", "content": prompt}],

18)

19# ↑ captured: model, prompt, completion, tokens, latency

Works with your entire AI stack

OpenAI

Anthropic

Gemini

Bedrock

Cohere

Azure

< 5ms

ingest p99 latency

90d

trace retention

LLM providers

SDK languages

Simple, Transparent Pricing

Scale your evaluation pipeline as your product grows.

Enterprise

Custom

For organizations with custom security and scale needs.

Custom model adapters
On-premise deployment
Dedicated account manager
SLA & security audits
SSO / SAML

Learn how teams ship agents

Guides on agent observability, evaluation techniques, voice and multimodal pipelines, and running evals in production.

Ship agents
you can trust

Instrument in minutes. Catch failures before your users do.

The harness for agents you're actually shipping

One trace per agent run

Tool Call: stripe.refund

Agents break in ways you don't expect

Fragmented Traces

No Quality Signal

Manual QA Doesn't Scale

Invisible Cost

Everything in the harness

Agent Tracing

Zero-Config SDK

Online Evaluation

Voice Agent Eval

Batch Evaluation

Image & Video Eval

Session & Device

Model Catalog

Evaluator Collections

SDK, harness, dashboard — one stack

SDK

Agent Harness

Dashboard

One call. Everything traced.

Simple, Transparent Pricing

Enterprise

Learn how teams ship agents

Ship agents you can trust

The harness for agents
you're actually shipping

One call.
Everything traced.

Ship agents
you can trust