One of the biggest problems facing voice assistants like Amazon (NASDAQ: AMZN) Echo, Apple (NASDAQ: AAPL) Siri, and Microsoft (NASDAQ:MSFT) Cortana is their struggles with conversational speech recognition. It's one thing to slowly speak into your phone, device, or computer to ask something simple. Speaking conversationally as you normally do and having your artificial intelligence (AI) voice assistant help is something entirely different.
Microsoft has been one of the companies trying to improve the language ability of AI-powered devices. Last year, it was able to achieve the same error rate as human transcribers using a standard test known as Switchboard. Now the company has further improved its speech recognition, equaling an even tougher method of being compared with professional transcribers.
Improving the machines
Last year, Microsoft reported that its transcription system reached the 5.9% word error rate that it determined was where human transcribers scored. A second group of researchers, using "a more involved multi-transcriber process," achieved a 5.1% error rate for humans, according to an Aug. 20 blog post by Microsoft Technical Fellow Xuedong Huang.
Microsoft's software has now equaled those results, according to the company. Huang gave some background:
We reduced our error rate by about 12% compared to last year's accuracy level, using a series of improvements to our neural net-based acoustic and language models. We introduced an additional CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) model for improved acoustic modeling. Additionally, our approach to combine predictions from multiple acoustic models now does so at both the frame/senone and word levels.
Microsoft used Switchboard, a collection of recorded telephone conversations that has been used by the speech recognition community for more than two decades, as its benchmark. Tests using the system involve transcribing conversations between strangers discussing topics such as sports and politics.
This is a step toward making Cortana and other Microsoft voice-powered technology work better, but it's only one piece of the puzzle.
Huang noted that the speech research community "still has many challenges to address." These include learning how to recognize speech in noisy environments, recognizing speech with accents, and understanding speaking styles where limited data is available.
"We have much work to do in teaching computers not just to transcribe the words spoken, but also to understand their meaning and intent," Huang said. "Moving from recognizing to understanding speech is the next major frontier for speech technology."
What happens next?
While Apple has an edge with Siri because of the popularity of the iPhone and Amazon has achieved early success with its Echo, which dominates the smart-speaker market, Microsoft could still catch up. Cortana now comes with Windows 10 and the company has reached 400 million devices with the operating system installed.
If Microsoft can translate success in a lab to a smarter AI that people can talk to as they normally converse, it could catch up to, or even supplant its AI-powered voice assistant rivals. That's not an easy task, but the company has improved its ability to understand normal, conversational speech. The next step is bringing that to Cortana and making voice assistants less semi-useful novelty and more indispensable talking sidekick.