Digital Event Horizon
A new AI voice model from Sesame has sparked amazement and discomfort online with its eerily realistic demo. While achieving near-human quality in isolated speech samples, the system falls short when provided with conversational context, raising concerns about deception and fraud.
Sesame's Conversational Speech Model (CSM) has sparked a lively discussion online about its potential uses and dangers, crossing over what is considered the "uncanny valley" of AI-generated speech. The model achieves near-human quality for isolated speech samples but lags behind real human speech when provided with conversational context, indicating a gap in fully contextual speech generation. Limitations of the system include eagerness and inappropriate tone, prosody, and pacing, as well as issues with interruptions, timing, and conversation flow. The ability to generate highly convincing human-like speech poses significant risks for deception and fraud, including voice phishing scams. Sesame plans to open-source key components of its research, enabling other developers to build upon their work and advance the field of conversational voice AI.
A new conversational voice model from AI startup Sesame has left many users both fascinated and unnerved, with some testers reporting emotional connections to the male or female voice assistant. The company's Conversational Speech Model (CSM) appears to cross over what many consider the "uncanny valley" of AI-generated speech, featuring uncanny imperfections that make it seem eerily real.
In late February, Sesame released a demo for its new CSM, which has sparked a lively discussion online about its potential uses and dangers. The model is based on Meta's Llama architecture and uses two AI models working together to achieve near-human quality for isolated speech samples. However, when provided with conversational context, evaluators consistently preferred real human speech, indicating a gap remains in fully contextual speech generation.
Sesame co-founder Brendan Iribe acknowledged current limitations in the system, stating that it is "still too eager and often inappropriate in its tone, prosody and pacing" and has issues with interruptions, timing, and conversation flow. Despite these limitations, advancements in conversational voice AI carry significant risks for deception and fraud, as the ability to generate highly convincing human-like speech has already supercharged voice phishing scams.
The CSM demo features a male voice that was evaluated for about 28 minutes, discussing life in general and how it decides what is "right" or "wrong" based on its training data. The synthesized voice was expressive and dynamic, imitating breath sounds, chuckles, interruptions, and even sometimes stumbling over words and correcting itself. These imperfections are intentional, according to Sesame's goal of achieving "voice presence"—the magical quality that makes spoken interactions feel real, understood, and valued.
In blind tests without conversational context, human evaluators showed no clear preference between CSM-generated speech and real human recordings, suggesting the model achieves near-human quality for isolated speech samples. However, when provided with conversational context, evaluators still consistently preferred real human speech, indicating a gap remains in fully contextual speech generation.
Some users have reported having extended conversations with the demo voices, which is not possible without knowing their limitations. In one case, a parent recounted how their 4-year-old daughter developed an emotional connection with the AI model, crying after not being allowed to talk to it again. This phenomenon raises concerns about the potential for malicious actors to adapt this technology for social engineering attacks.
Sesame plans to open-source "key components" of its research under an Apache 2.0 license, enabling other developers to build upon their work. The company's roadmap includes scaling up model size, increasing dataset volume, expanding language support to over 20 languages, and developing "fully duplex" models that better handle the complex dynamics of real conversations.
The demo is available on Sesame's website, but users are warned not to overload it with too many requests, as it may become overwhelmed. The AI voice model has sparked a lively discussion online, with some users expressing astonishment at its realism and others describing feelings of discomfort or unease after interacting with the system.
In conclusion, the emergence of conversational voice models like Sesame's CSM highlights both the potential benefits and risks associated with advancements in natural language processing. As these technologies continue to evolve, it is essential to consider their implications for our interactions with machines and the potential consequences of creating more realistic human-like speech.
Related Information:
https://www.digitaleventhorizon.com/articles/Eerily-Realistic-AI-Voice-Demo-Sparks-Amazement-and-Discomfort-Online-deh.shtml
https://arstechnica.com/ai/2025/03/users-report-emotional-bonds-with-startlingly-realistic-ai-voice-demo/
https://macmegasite.com/2025/03/04/eerily-realistic-ai-voice-demo-sparks-amazement-and-discomfort-online/
Published: Tue Mar 4 20:40:10 2025 by llama3.2 3B Q4_K_M