Agent Harness — now in beta

    The harness for agents
    you're actually shipping

    Trace every call, eval every output, catch regressions before your users do. Works with any LLM, any framework, any modality — text, voice, image, video.

    agent_run_#1047.trace
    EVALUATING
    00:00:38
    AG
    research_agent
    0.94

    llm_call → tool_call (search) → llm_call → tool_call (summarize) → 4 spans · 1.4s · 3.1k tokens

    VC
    voice_eval_job #22
    0.87

    12/14 rows passed · naturalness 0.91 · instruction-following 0.83 · 2 failures flagged

    Avg Latency
    1.2s
    Token Usage
    4.2k
    Pass Rate
    98.2%

    One trace per agent run

    Every step — tool calls, LLM calls, DB reads — stitched into a single waterfall. No fragmented logs.

    PASSED EVAL
    Execution Waterfall
    Support Agent
    3.4s
    db.get_order
    45ms
    Planner ReAct
    800ms
    stripe.refund
    412ms
    llm.generate
    1.2s

    Tool Call: stripe.refund

    Latency412ms
    Tokens1,402
    // Agent Tool Input
    {
    "order_id": "ORD-8429",
    "amount": 49.99,
    "reason": "customer_requested"
    }
    // API Output
    {
    "status": "refund_processed",
    "receipt_url": "https://stripe.com/..."
    }

    Agents break in ways you don't expect

    Multi-step, multi-modal, multi-model — the more capable your agent, the harder it is to know when it's failing.

    Fragmented Traces

    Multi-step agents spawn traces across threads and services. You get 10 disconnected logs instead of one run.

    No Quality Signal

    The agent returns a response. You have no idea if it's right until a user files a complaint.

    Manual QA Doesn't Scale

    Spot-checking outputs by hand works for 10 test cases. It breaks at 1,000.

    Invisible Cost

    Token usage and latency blow up silently across sessions and devices. You find out on the invoice.

    Everything in the harness

    Trace, evaluate, and ship agents across text, voice, image, and video.

    Agent Tracing

    Every LLM call, tool invocation, HTTP hop, and DB query lands as a typed span — one waterfall per run.

    support_agent
    3.4s
    db.get_order
    45ms
    llm.plan
    0.8s
    stripe.refund
    412ms
    llm.reply
    1.2s
    llm_calltool_callhttp_calldb_query

    Zero-Config SDK

    One init() call. All providers patched automatically.

    $ pip install syntropylabs
    $ npm i @syntropylabs/sdk

    Online Evaluation

    Every production trace scored automatically as it arrives. No cron jobs.

    Scoring live traces…

    Voice Agent Eval

    Generate audio, score naturalness + instruction-following, replay rows. Connect live in-browser.

    naturalness 0.91instruction 0.8712 / 14 passed

    Batch Evaluation

    Run a dataset through your agent, score per row, regenerate failures individually.

    row_001
    94%
    row_002
    72%
    row_003
    38%
    row_004
    91%

    Image & Video Eval

    Generate, then LLM-as-judge score across visual quality dimensions.

    Visual quality88%
    Prompt alignment74%
    Coherence91%

    Session & Device

    session_id groups a conversation. device_id tracks originating clients APM-style.

    sess_4f2a$0.031
    sess_8c1b$0.089
    sess_2d9e$0.214

    Model Catalog

    Central registry for all providers. Set access grants, test models inline.

    OpenAI
    Anthropic
    Gemini
    Bedrock
    Cohere
    Azure

    Evaluator Collections

    Group LLM-as-judge prompts into reusable collections. Apply to any project or set globally. Three rule types: model-graded, custom prompt, statistical.

    RelevancyGroundednessSafetyPII DetectionCoherenceCustom Prompt
    The Platform

    SDK, harness, dashboard — one stack

    Everything your agent needs from dev to production, without stitching together five tools.

    SDK

    One init() call instruments your entire agent stack — LLM clients, HTTP, SQL, logging. Python and TypeScript, zero deps.

    • Auto-patches OpenAI, Anthropic, Gemini, Bedrock
    • W3C traceparent propagation
    • Session + device context

    Agent Harness

    The evaluation core. Run batch jobs, score traces online, compare models pairwise — text, voice, image, and video.

    • Online + batch evaluation
    • Voice agent eval via LiveKit
    • Image / video quality scoring

    Dashboard

    Trace waterfall, session explorer, eval results, cost breakdown, and model catalog — everything in one place.

    • Trace waterfall per agent run
    • Per-session cost analytics
    • Regression comparison across deploys
    Built for developers

    One call.
    Everything traced.

    init() patches every LLM client, HTTP layer, and database adapter your agent touches. Traces start flowing in seconds — no manual spans, no config files.

    • pip install syntropylabs / npm install @syntropylabs/sdk
    • Auto-instruments OpenAI, Anthropic, Gemini, Bedrock, HTTP, SQL, Redis, Mongoose
    • Online eval: attach evaluators to a project, every trace gets scored automatically
    • W3C traceparent — frontend and backend stitched into one trace, no extra work
    agent_setup.py
    1import syntropylabs as stl
    2
    3stl.init(
    4 subscription_key="sk_live_...",
    5 service_name="my-agent",
    6 environment="production",
    7 session_id=user_session_id,
    8 device_id=device_fingerprint,
    9)
    10
    11# OpenAI, Anthropic, HTTP, SQL — all patched automatically
    12from openai import OpenAI
    13client = OpenAI()
    14
    15response = client.chat.completions.create(
    16 model="gpt-4o",
    17 messages=[{"role": "user", "content": prompt}],
    18)
    19# ↑ captured: model, prompt, completion, tokens, latency

    Works with your entire AI stack

    O
    OpenAI
    A
    Anthropic
    G
    Gemini
    A
    Bedrock
    C
    Cohere
    A
    Azure
    < 5ms
    ingest p99 latency
    90d
    trace retention
    6+
    LLM providers
    2
    SDK languages

    Simple, Transparent Pricing

    Scale your evaluation pipeline as your product grows.

    Enterprise

    Custom

    For organizations with custom security and scale needs.

    • Custom model adapters
    • On-premise deployment
    • Dedicated account manager
    • SLA & security audits
    • SSO / SAML

    Learn how teams ship agents

    Guides on agent observability, evaluation techniques, voice and multimodal pipelines, and running evals in production.

    Ship agents
    you can trust

    Instrument in minutes. Catch failures before your users do.