Comparing Agents
A/B test two agent configurations to see which one performs better across your scenarios.
Comparing Agent PerformanceWhat Compare does
Compare runs the same scenarios against two different agent configurations and shows side-by-side results. Use it to evaluate prompt changes, model upgrades, or any configuration tweak - so you can ship with confidence instead of guessing which version performs better.
Setting up a comparison
- Go to the Compare page in the left sidebar.
- Fill in the form:
- Agent A - your baseline or current agent configuration.
- Agent B - the new or experimental configuration.
- Personality - optionally select a personality to use for both sides.
- Scenarios - select one or more scenarios to include. Leave empty to run all scenarios.
- Click Run comparison.

Voxli runs each selected scenario against both agents and generates results.
Reading comparison results
Results are organized per scenario. Each scenario shows a table with:
- Test - each test in the scenario.
- Score A - how Agent A scored on this test.
- Score B - how Agent B scored on this test.
- Difference - the gap between the two scores. A positive difference means Agent B did better. A negative difference means Agent A did better.

Click on any score to open the full test result with the conversation transcript and assertion outcomes.
Additional metrics
Click Settings above the results table to toggle additional metric columns:
- Response time - average agent response time per test.
- Token usage - total tokens consumed per test.
- Cost - estimated cost per test.
Each metric shows the value for Agent A, Agent B, and the difference. Lower values are highlighted as improvements for these metrics.
Re-running comparisons
After making changes to either agent, you can re-run an existing comparison to get fresh results. This creates a new comparison with the same configuration so you can track progress over time.
What’s next
- Agents - learn about the different agent types and how to connect them.