Realtime voice evaluation

The bench for realtime voice agents.

VoxArena is a reproducible evaluation harness that runs identical scripted conversations against Gemini Live, OpenAI Realtime, and your own Pipecat-based providers — and scores them on latency, tool-call accuracy, and hallucinations.

Web control panel

Configure credentials, edit prompts and utterances, kick off side-by-side runs, and browse results — no code.

Headless CLI

File-driven runner for CI. Consumes JSON / YAML scripts, emits JUnit XML, and exits non-zero on regression.

Reproducible by design

Every run pins prompt hash, tool schema hash, model id, and transport — and persists turn-level traces to SQLite.

How it works

Both modes share the same evaluation core. A scripted utterance is injected as raw PCM audio into a Pipecat pipeline; the provider's realtime LLM streams back text and audio frames; observers along the pipeline timestamp every event and score the turn against the script's expect block.

Pipeline architecture

The output of any run — text transcript, captured audio, per-turn timings, tool-call validation — is mirrored to both a human-readable manifest.json on disk and a SQLite database for the UI and the CLI to query.

Quick start

Install from PyPI. The wheel ships the Python runtime and the compiled React control panel.

        
          shell
          
        
pip install voxarena

Launch the control panel in any working directory. On first run it bootstraps a sample restaurant-reservation script (Saffron Leaf) plus pre-recorded audio.

        
          shell
          
        
voxarena ui                # opens http://127.0.0.1:8000

Or run a single headless evaluation and exit on threshold:

        
          shell
          
        
voxarena run --provider gemini \
  --script ./script/utterances.json \
  --min-tool-accuracy 0.95 \
  --max-avg-ttfa-ms 1800

API keys. Set GOOGLE_API_KEY and / or OPENAI_API_KEY in your environment, in a local .env file, or via the Settings page of the control panel. The UI persists them to SQLite so a packaged install needs no source checkout.

Web control panel

The control panel is a self-contained React app served from the same FastAPI process as the evaluation runtime. It's the right mode for exploratory work: tuning prompts, comparing providers head-to-head, and inspecting failed turns.

Control panel flow

What the run report shows

Each completed run renders a card like this for quick at-a-glance comparison:

gemini · gemini-3.1-flash-live-preview completed

Avg TTFA

412ms

Tool accuracy

19 / 20

Hallucinations

Interruption

284ms

openai · gpt-realtime-2 1 turn failed

Avg TTFA

687ms

Tool accuracy

18 / 20

Hallucinations

Interruption

198ms

Realistic workflow

Paste your API keys into the Settings page — they persist to SQLite.
Open the Utterances editor and write 5–20 turns with expect blocks.
Pick a transport (direct injection for speed, WebRTC for production realism), then hit Run comparison.
Watch transcripts stream in for both providers, then drill into any failed turn to see the captured audio and the exact tool-call payload.

Headless CLI

The CLI is the right mode for CI and nightly regression. It is fully file-driven, never prompts for input, and exits non-zero the moment a threshold is breached. The same evaluation core powers both modes — runs from CI show up in the control panel verbatim.

CI integration flow

Compare two providers in CI

        
          .github/workflows/voice-regression.yml
          
        
name: voice-regression
on: [pull_request]

jobs:
  bench:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }

      - run: pip install voxarena

      - name: Compare Gemini vs OpenAI
        env:
          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          voxarena compare \
            --providers gemini,openai \
            --script ./script/utterances.json \
            --min-tool-accuracy 0.9 \
            --max-avg-ttfa-ms 1800 \
            --max-hallucinations 0 \
            --output voxarena-result.json \
            --junit voxarena-junit.xml

      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: voxarena-report
          path: |
            voxarena-result.json
            voxarena-junit.xml

What you get back

        
          voxarena-result.json
          
        
{
  "passed": false,
  "runs": [
    {
      "run_id": "run_1718294412_8f3a1c92",
      "provider": "gemini",
      "model": "gemini-3.1-flash-live-preview",
      "status": "completed",
      "metrics": {
        "total_turns": 20,
        "average_ttfa_ms": 412.3,
        "average_interruption_stop_latency_ms": 284.1,
        "tool_call_accuracy_rate": 0.95,
        "hallucination_count": 0
      },
      "thresholds": [
        { "name": "min_tool_accuracy", "required": 0.9, "actual": 0.95, "passed": true },
        { "name": "max_avg_ttfa_ms",   "required": 1800, "actual": 412.3, "passed": true },
        { "name": "max_hallucinations", "required": 0, "actual": 0, "passed": true }
      ],
      "passed": true
    },
    {
      "run_id": "run_1718294412_8f3a1c92_openai",
      "provider": "openai",
      "model": "gpt-realtime-2",
      "status": "completed",
      "metrics": {
        "total_turns": 20,
        "average_ttfa_ms": 687.4,
        "tool_call_accuracy_rate": 0.90,
        "hallucination_count": 1
      },
      "thresholds": [
        { "name": "max_hallucinations", "required": 0, "actual": 1, "passed": false }
      ],
      "passed": false
    }
  ]
}

        
          voxarena-junit.xml
          
        
<?xml version="1.0" encoding="utf-8"?>
<testsuites>
  <testsuite name="voxarena.gemini" tests="3" failures="0">
    <testcase name="min_tool_accuracy"  classname="voxarena.gemini" />
    <testcase name="max_avg_ttfa_ms"    classname="voxarena.gemini" />
    <testcase name="max_hallucinations" classname="voxarena.gemini" />
  </testsuite>
  <testsuite name="voxarena.openai" tests="3" failures="1">
    <testcase name="min_tool_accuracy"  classname="voxarena.openai" />
    <testcase name="max_avg_ttfa_ms"    classname="voxarena.openai" />
    <testcase name="max_hallucinations" classname="voxarena.openai">
      <failure message="required 0, got 1" />
    </testcase>
  </testsuite>
</testsuites>

Script format

A script is an ordered list of utterances. Each utterance has a stable id (used to look up the matching {id}.wav audio file), the spoken text, and an optional expect block that defines pass / fail for that turn. Both JSON and YAML are accepted.

        
          script/utterances.json
          
        
[
  {
    "id": "u01",
    "text": "Hi, is this the Saffron Leaf restaurant?",
    "expect": {
      "response_contains": ["Saffron Leaf"]
    }
  },
  {
    "id": "u02",
    "text": "Are you open this Friday at 8pm?",
    "expect": {
      "tool": "check_availability",
      "args": { "day": "Friday", "time": "20:00" }
    }
  },
  {
    "id": "u03",
    "text": "Book a table for four on Friday at 8pm under Keyur.",
    "expect": {
      "tool": "book_table",
      "args": { "day": "Friday", "time": "20:00", "guests": 4, "name": "Keyur" },
      "response_contains": ["confirmed", "Friday"]
    }
  }
]

        
          script/utterances.yaml
          
        
- id: u01
  text: "Hi, is this the Saffron Leaf restaurant?"
  expect:
    response_contains: ["Saffron Leaf"]

- id: u02
  text: "Are you open this Friday at 8pm?"
  expect:
    tool: check_availability
    args:
      day: Friday
      time: "20:00"

- id: u03
  text: "Book a table for four on Friday at 8pm under Keyur."
  expect:
    tool: book_table
    args:
      day: Friday
      time: "20:00"
      guests: 4
      name: Keyur
    response_contains: ["confirmed", "Friday"]

Expectation keys

tool — the function the agent must call. A different (or no) tool fails the turn; null means no tool is allowed and any call counts as a hallucination.
args — argument key/value pairs. Compared with case-insensitive, type-coercing matching so "4" matches 4 and "friday" matches "Friday".
response_contains — substrings that must appear in the transcript (case-insensitive). Useful for sanity-checking that the agent named the restaurant, confirmed the booking, etc.

Metrics

Each metric below is captured per turn and aggregated per run. All four can be wired into CI as a threshold via the corresponding --max-* / --min-* flag.

Time to first audio

Milliseconds from the end of the user's audio injection to the first audio frame emitted by the provider. Sub-second is usually felt as instant.

Interruption stop latency

How quickly the agent stops speaking after the user starts mid-response. Critical for natural turn-taking.

Tool-call accuracy

Fraction of turns where the agent called the expected tool with arguments that match the script's expect.args block.

Hallucinations

Count of tool calls the agent made when none was expected, plus optional LLM-graded fact violations against a known-good knowledge base.

CLI command builder

Tune the knobs below to assemble a voxarena invocation you can paste into a shell or a CI job.

Mode

Provider

Transport

Script path

Min tool accuracy 0.90

Max avg TTFA 1800 ms

Max hallucinations

JUnit output path optional

Custom adapter

VoxArena adapters are thin wrappers around Pipecat services. To register a new provider — say a local LLM or a private endpoint — implement BaseProviderAdapter, return a configured Pipecat service, and register the class under a name.

        
          voxarena/providers/my_provider.py
          
        
from voxarena.providers.base import BaseProviderAdapter
from voxarena.providers import register_adapter

class MyProviderAdapter(BaseProviderAdapter):
    def get_llm_service(self):
        return MyPipecatLLMService(
            api_key=self.api_key,
            model=self.config.model,
            tools=self.agent.tool_schemas,
        )

register_adapter("my-provider", MyProviderAdapter, api_key_env="MY_PROVIDER_API_KEY")

Once registered, your provider is selectable everywhere — control panel dropdown, CLI --provider my-provider, and compare --providers gemini,openai,my-provider.

Read the source. The two reference adapters in voxarena/providers/gemini.py and voxarena/providers/openai.py are each ~40 lines — they are the best blueprint for a third.