Avoid Being Fooled by Parlor Tricks
The Necessity of Real-World Environment Testing for Automatic Speech Recognition
- by S. Hamid Nawab, PhD, originally published in “Speech Technology”
Voice assistants developed by Amazon, Google, Xiaomi, Alibaba, and others are poised to take over the world. A report by Juniper Research estimates that 70 million U.S. households will have at least one voice assistant-enabled speaker by 2022. The same report says that the majority of voice-assisted activities will occur on smartphones, with voice assistants installed on over 5 billion smartphones worldwide by 2022.
High Stakes for Voice Assistants
In terms of using voice for commerce, the user experience has to work really, really well for consumers to keep using voice assistants for more than searches and dictation.
That’s why the voice assistant’s ability to perform in varied, and often difficult, sound environments will be a key pillar for the sector’s success. The sheer scale of distribution for voice assistants means they are going to be used in many different situations and environments, many of which require them to adapt to the variability of the scenarios, which is a huge risk for this emerging market.
When Sound Environment Models are Not Enough
When companies are developing their voice assistants, they create synthetic environments that mimic what the product may have to actually face in the real world. The mimicking of real-world situations is necessitated by the need to have testing control over quantifiable environmental factors, and generally dependent on the device matching an environmental sound profile to the scene when activated. The device then uses that sound profile to direct signal processing and noise cancellation activities to produce a clean signal for the automatic speech recognition (ASR) software to convert into commands and actions.
In real-world situations, the device, the target speaker, and multiple sources of background noise, as well as other voices, will be present and often moving relative to one another. A selected sound profile that was effective at the beginning of an interaction may be inadequate a moment later as the scene shifts, again and again. In the current generation of devices, the user is expected to control this environment for the voice assistant. Given that billions of users will be operating voice assistants with no training, voice assistants will likely deliver sub-optimal results and greatly hinder the widespread adoption and use of voice interfaces.
Voice Assistants Will Need to be Ready for Any Situation
What voice assistants will need to prevail in natural environments is their own ability to evaluate a soundscape and intelligently adapt to the soundscape as it changes in real time, without human assistance.
An apt comparison would be with driverless cars. In the case of regular, human-operated cars, testers evaluate acceleration, braking, handling, and crashworthiness – all relatively constrained scenarios. Even modern cars, with lane sensing and blind spot detection built in, essentially depend on the perceptual and cognitive abilities of their operators to successfully and safely get from A to B. With a driverless car, the computer is responsible for detecting potholes, other cars, street signals, weather and road conditions, jaywalking pedestrians, and more. The real-world complexity of public roads are essentially impossible to mimic in an artificial setting.
While limited aspects of driverless cars can be tested under artificial conditions, no one would consider a driverless car “fully vetted” without rigorous testing against real-world streets and highways. The same applies to the evaluation of voice processing software in real-world environments, and the software’s ability to navigate the complex environments therein.
Testing for Real-World Environments
If voice assistants must undertake the responsibility for navigating complex soundscapes, then it is necessary to change the way voice assistants are tested. Evaluation must now assess the perceptual and cognitive capabilities around voice recognition and signal processing that can no longer be expected to be the responsibility of the human operator.
The perceptual tasks voice assistants will be expected to master are:
- Follow the targeted voice over time, orientation, and distance
- Classify the genre and orientation of non-targeted sound sources
- Track the probability of correctness of the hypothesis about the audible scene
- Update any changes inferred in the audible scene
- Voice assistants will need to perform the following cognitive tasks:
- Use the latest information about the audible scene to adapt signal processing for source separation and acoustic echo cancellation
- Use trial-and-error signal processing to deal with situations with inadequate perceptual information about the scene
- Move the voice of the targeted speech into the perceptual foreground while pushing others into the perceptual background
- Repair targeted speech when damaged by aggressive signal processing
The implicit bar for performance when using voice is that if a colleague should be able to hear and understand the speaker’s voice in a given situation, the voice assistant should understand as well.
Therefore, the voice assistant should succeed in any such circumstance, or the user will be disappointed and frustrated.
Rather than attempting to simulate countless situations in synthetic environments, it’s much easier to simply record an enormous variety of situations. Therefore, the natural solution is to configure a large database of real-world recordings of targeted speech in everyday noisy environments with a mixture of far-field, near-field, and ambient sound sources. The database can then be used to assess individual voice assistants and compare their performance to each other.
It is vital that the database be able to challenge the capabilities of the voice assistant to handle the important perceptual and cognitive tasks for real-world environments listed above. Testing against such a database of varied, real-world scenarios is the only way to ensure a voice assistant is properly vetted and ready to consistently deliver satisfying consumer experiences in the real world.
- S. Hamid Nawab, PhD is Co-Founder and Chief Scientist of Yobe