Digital Event Horizon

A New Framework for Evaluating Voice Agents: EVA

EVA, a new end-to-end framework for evaluating voice agents, aims to provide a comprehensive assessment of conversational voice agents by addressing two distinct objectives: accuracy and conversational experience. With its realistic bot-to-bot architecture and multi-dimensional scoring system, EVA offers a more complete evaluation of voice agent performance.

Evaluating conversational voice agents requires addressing both accuracy and conversational experience.

The EVA framework provides a comprehensive evaluation system for conversational voice agents, assessing complete, multi-turn spoken conversations.

EVA produces two high-level scores: Accuracy (Task Completion, Faithfulness, Speech Fidelity) and Experience (Turn-Taking, Conversation Progression, Conciseness).

The framework simulates real-world conversations between two agents using a realistic bot-to-bot architecture.

Agents that perform well on task completion tend to deliver worse user experiences, and vice versa.

EVA has limitations, including reliance on a single commercial provider and potential differences in latency measurements across providers and infrastructure.

Evaluating conversational voice agents is a complex task that requires addressing two distinct objectives: accuracy and conversational experience. Traditional frameworks often focus on evaluating task success or conversational dynamics, but not both, which can lead to incomplete assessments of the agent's performance.

The newly introduced framework, EVA (End-to-End Evaluation), aims to address this limitation by providing a comprehensive evaluation system for conversational voice agents. Developed by the ServiceNow-AI team, EVA is designed to assess complete, multi-turn spoken conversations using a realistic bot-to-bot architecture.

EVA produces two high-level scores: EVA-A (Accuracy) and EVA-X (Experience). The Accuracy score measures three dimensions of task completion: Task Completion, Faithfulness, and Speech Fidelity. In contrast, the Experience score evaluates three dimensions of conversational experience: Turn-Taking, Conversation Progression, and Conciseness.

The framework is based on a realistic bot-to-bot architecture, which simulates real-world conversations between two agents. The evaluation process involves assessing the agent's responses against a set of predefined scenarios, including flight rebooking, cancellation handling, vouchers, and more. The scoring system takes into account various aspects of the conversation, such as turn-taking timing, conversation progression, and response conciseness.

EVA has been evaluated on 20 systems, proprietary and open-source, cascade and audio-native, which revealed a consistent accuracy-experience tradeoff. Agents that perform well on task completion tend to deliver worse user experiences, and vice versa. The framework also identified named entity transcription as a dominant failure mode, where mishearing a single character can lead to an authentication failure.

Despite its comprehensive evaluation capabilities, EVA has several limitations. The user simulator relies on a single commercial provider, which may favor certain ASR systems, and the bot-to-bot pipeline may not fully represent production deployments. Additionally, full reproduction requires commercial API access, and latency measurements will vary across providers and infrastructure.

To address these limitations, the ServiceNow-AI team plans to release additional domain datasets, expand the leaderboard continuously, and develop a results and error analysis application that automatically identifies errors per metric and model. The framework is expected to improve the understanding of voice agent capabilities across the field and provide a more comprehensive evaluation system for conversational voice agents.

Related Information:

https://www.digitaleventhorizon.com/articles/A-New-Framework-for-Evaluating-Voice-Agents-EVA-deh.shtml

https://huggingface.co/blog/ServiceNow-AI/eva

https://github.com/saharmor/voice-lab

https://hamming.ai/resources/how-to-evaluate-voice-agents

Published: Mon Mar 23 23:53:36 2026 by llama3.2 3B Q4_K_M

Today's AI/ML headlines are brought to you by ThreatPerspective

A New Framework for Evaluating Voice Agents: EVA