Running Tests
This guide shows you how to run a complete test suite against your AI agent using Voxli’s REST API.
Complete Example
Here’s a full Python script that executes all tests in a scenario:
"""Run Voxli tests against your chatbot or AI agent.1. Create a test run2. Get all tests for the scenario3. Simulate each test"""import osimport timeimport requestsdef poll_next_message(endpoint: str, headers: dict, timeout: int = 30) -> dict | None:"""Poll next-message until the tester is ready or the chat ends.Returns None when the chat is over, otherwise a dict with either a`message` (free text) or an `action` (an ActionInvocation) field."""start_time = time.time()while True:response = requests.post(endpoint, headers=headers)response.raise_for_status()data = response.json()if data["ready"]:if data.get("end_chat"):return Nonereturn {"message": data.get("message"), "action": data.get("action")}if time.time() - start_time > timeout:raise TimeoutError("Timed out waiting for message")time.sleep(1)# --- Configuration ---api_key = os.getenv("VOXLI_API_KEY")base_url = os.getenv("VOXLI_API_URL", "https://api.voxli.io")scenario_id = os.getenv("VOXLI_SCENARIO_ID")agent_id = os.getenv("VOXLI_AGENT_ID")headers = {"Authorization": f"Bearer {api_key}"}# 1. Create a test runrun = requests.post(f"{base_url}/runs/", headers=headers, json={"scenario": scenario_id,"agent": agent_id,"status": "running"}).json()run_id = run["id"]# 2. Get all tests for this scenariotests = requests.get(f"{base_url}/scenarios/{scenario_id}/tests", headers=headers).json()["data"]# 3. Simulate each testfor test in tests:# 3a. Create a test result entryresult = requests.post(f"{base_url}/test-results/", headers=headers, json={"test": test["id"],"run": run_id,"agent": agent_id}).json()result_id = result["id"]generate_endpoint = f"{base_url}/test-results/{result_id}/next-message"conversation_endpoint = f"{base_url}/test-results/{result_id}/conversation"# 3b. Get first turn from Voxliturn = poll_next_message(generate_endpoint, headers)# 3c. Conversation loopwhile turn is not None:# TODO: Replace with your agent's responsestart = time.monotonic()if turn.get("action"):# Tester invoked a registered action instead of typing. Apply# it in your system, then record the chatbot's follow-up.agent_response = your_agent.apply_action(turn["action"]["name"],turn["action"].get("arguments", {}),)else:agent_response = your_agent.process(turn["message"])response_time_ms = round((time.monotonic() - start) * 1000)# 3d. Record agent response (include metadata for performance tracking)requests.post(conversation_endpoint,headers=headers, json={"type": "message","content": agent_response,"metadata": {"responseTime": response_time_ms,"inputTokens": input_tokens,"outputTokens": output_tokens,"cost": cost,}})# 3e. Get next turn from Voxliturn = poll_next_message(generate_endpoint, headers)print(f"Test run {run_id} completed.")
How It Works
1. Create a Test Run: Initialize a new test run for your scenario with status: "running".
2. Fetch Tests: Retrieve all tests associated with the scenario.
3. Execute Each Test:
- Create a result entry by posting to
/test-results/with the test, run, and agent IDs - Start the conversation by calling
next-messageto get the first tester turn - Enter a conversation loop where you relay each turn between Voxli and your agent
- Continue until Voxli signals
end_chat: true
Each next-message response may return ready: false if the next turn is not yet available. Poll the endpoint with a short delay until it returns ready: true.
Each ready turn contains either message (free text from the tester) or action (an invocation of one of the actions your chatbot registered for this turn). Branch on which one is populated - they are mutually exclusive. See Tools, Events, and Actions for how to register actions.
The run is automatically marked as completed once all tests finish.
Message Metadata
When posting agent messages to the conversation endpoint, you can include a metadata object with performance metrics. Voxli reads the following recognized keys:
| Key | Type | Description |
|---|---|---|
responseTime | number | Time in milliseconds for the agent to respond |
inputTokens | number | Input/prompt token count for the LLM call |
outputTokens | number | Output/completion token count for the LLM call |
cost | number | Cost of the LLM call in USD |
import timestart = time.monotonic()agent_response = get_agent_response(tester_message) # your agent logicresponse_time_ms = round((time.monotonic() - start) * 1000)requests.post(conversation_endpoint, headers=headers, json={"type": "message","content": agent_response,"metadata": {"responseTime": response_time_ms,"inputTokens": input_tokens,"outputTokens": output_tokens,"cost": cost,}})
When available, these metrics are displayed as averages in test result details and comparison views. They help identify performance regressions and cost differences across agent configurations.