Voxli Voxli

Comparing Agents

Run one agent across the scenarios and personalities you care about and get a pass-rate verdict for each slice of your coverage. Use it to check that an agent is ready to ship, or as a coverage audit across different customer segments.

Comparing Agent Performance

Creating a comparison

A comparison checks one agent against a structured slice of your coverage. You organize the work into groups, where each group picks a set of scenarios and one or more personalities to run them with. Every scenario in a group is run against every personality in the group. Groups let you tag different slices of the work (for example “New customers” vs “Existing customers”) and see a pass-rate verdict for each.

  1. Go to the Compare page in the left sidebar and click New comparison.
  2. Pick the agent being evaluated.
  3. Add one or more groups. For each group, give it an optional name and pick the scenarios and personalities it covers.
  4. Optionally set Repeat each test (1-10) to surface flaky behavior.
  5. Click Create comparison.
The New comparison dialog showing the agent, groups, scenarios, and personalities

Voxli runs every scenario in each group against every configuration in that group and generates one test result per pair.

Reading the results

The results page opens with an insights row at the top showing the average score, how many tests have passed, and the number of blocker failures and warnings. Below that, each group is rendered as its own matrix of scenarios and personalities.

  • Green pass rates (90% or above) indicate the group is ready to ship.
  • Yellow (70-89%) indicates review-worthy issues.
  • Red (below 70%, or any blocker failure) indicates a critical problem in that group.

Group names can be edited inline by clicking the title. Click a scenario row to expand it into its individual tests. Click any cell to open the full test result with the conversation transcript and assertion outcomes.

Comparison results showing per-group matrices of scenarios and personalities

Choosing which metrics to show

Open the Settings (gear) menu above the matrix to control what each row reports. Under Metrics, pick one or more measurements - each one you select becomes its own column:

  • Score - the weighted pass rate per test. Always available.
  • Hallucinations - the number of unsupported or contradicted claims, when hallucination detection is on.
  • Custom metrics - anything you have defined under Settings > Metrics, such as response time, token usage, or cost. Each metric and each of its aggregates (for example, an average and a p95) is offered separately.

The selector lists every metric you have configured, so you can read the comparison along one dimension at a time. To add more measurements, define them first - see Custom Metrics. The same Settings menu also switches whether personalities are laid out as columns or rows.

Your choices - the selected metrics, expanded scenarios, and layout - are stored in the page URL. That makes a configured view shareable, and it is restored when you come back to the report.

Repetitions and flaky cells

When a comparison repeats each test more than once, the matrix stays calm where the repetitions agree and only adds detail where they disagree.

A repeated cell shows the mean as its headline value. If its repetitions disagree on whether the test passed, the cell is flaky - click it to open a breakdown listing each repetition with its score, pass or blocker status, and a link to that result. A test counts as passed in the insights row only when every one of its repetitions passed.

To see every repetition as its own row, open the Settings menu and turn on Show repetitions. Each test then expands into one row per repetition, each linking to its result. This option becomes available once a test has run more than once.

Comparing two reports

Once you have more than one finished comparison, you can lay two of them side by side to answer “did readiness improve since last time?” or “how does agent X stack up against agent Y on the same scenarios?”.

  1. On a report, click Compare to (next to Re-run).
  2. Pick another completed report. The picker only lists reports with the same configuration - the same scenarios and personalities - since the agent is the thing being compared. Each option leads with the report name, with its agent and run time beneath. Search by agent or report name to narrow the list.

The view becomes a side-by-side diff. The report you picked is the reference (A) and the current report is the subject (B). Each metric becomes a three-column group: the A value, the B value, and a colored B - A delta - green where B improved, red where it got worse, and neutral for a change within the metric’s margin. Personalities are shown as rows in this mode. Anything that exists in only one report is kept and badged New or Removed.

Two comparison reports side by side with A, B, and delta columns per metric

Clear the comparison with the ✕ next to the picker to return to the single-report view.

Re-running comparisons

After making changes to an agent (or the underlying scenarios) you can re-run an existing comparison to get fresh results. Re-run opens the New comparison dialog pre-filled with the same configuration, so you can adjust it if needed and track progress over time.

What’s next

Running Tests

Custom Metrics