How Metrics Are Summarized

Understand how the summary and average numbers in a report are calculated from the test results beneath them.

Two directions of rollup

A report rolls many individual results into single numbers in two directions.

Across a row - the summary column on the right of a scenario combines the configurations or personalities in that row into one number.
Down a column - the bottom average row, and any scenario rows, combine the tests and scenarios beneath them into one number.

In every case, the summary is computed from the underlying test results, not by averaging the cells you see on screen. This matters because a summary can legitimately differ from the midpoint of the cells above it.

Score

The summary score is the average of each test result’s score. Every result counts equally. A completed result keeps the score it earned, a failed result counts as zero, and results that are still pending, running, or canceled are left out.

Each result’s own score comes from its weighted assertion outcome. See Assertions for how that weighting works.

Hallucinations

The summary for hallucinations is the total count across all conversations in view. Each unsupported or contradicted claim adds to the running total, so the summary is a sum rather than an average.

Time, tokens, cost, and custom metrics

These metrics come from the messages and conversations inside each result, and how they are combined depends on the metric.

Averages, minimums, maximums, and percentiles (like a 95th percentile response time) are calculated over every message across all the conversations in view. So a summary “max” is the single slowest response found anywhere in scope, not the average of the per-column maxima. A summary “average” is the mean over every message, so it leans toward conversations that produced more messages.
Totals and counts (such as tokens used or estimated cost) are taken per conversation first and then averaged across conversations. This keeps the numbers comparable when different configurations ran different numbers of tests.

Why a summary can differ from the cells

Because summaries are computed from the underlying results, a summary number will not always match the average of the cells above it. That is expected.

Imagine two personality columns each report a slowest response time: 17.4 seconds and 17.0 seconds. The summary slowest is 17.4 seconds - the genuine maximum found anywhere in scope - not 17.2 seconds, which is just the average of the two cells. Averaging two maxima has no real meaning, so Voxli does not do it.

The same logic applies to averages. A summary average is weighted by every message or result, so a row where one configuration ran far more tests will pull the summary toward that configuration’s numbers.

Repetitions and uneven test counts

Every result and every message is weighted equally, so anything that changes how many results or messages a report contains will shift these numbers. Running repetitions, mixing personalities, or comparing configurations that covered different numbers of tests all change the mix. This is why a summary can move between two runs of the same scenario even when each individual result looks similar. See Advanced Run Options for how repetitions and multiple personalities multiply the number of results.

What’s next

Comparing Agents - read a side-by-side comparison and its summary numbers.
Advanced Run Options - how repetitions and personalities change the number of results.
Assertions - the weighting behind each test result’s score.