Baidu Research has unveiled new research results from its Silicon Valley AI Lab (SVAIL). Results include the ability to accurately recognize both English and Mandarin with a single learning algorithm. The results are detailed in a paper: Deep Speech 2: End-to-End Speech Recognition in English and Mandarin.
SVAIL’s Deep Speech system, announced last year, initially focused on improving English speech recognition accuracy in noisy environments (for example, restaurants, cars and public transportation). Over the past year, SVAIL researchers improved Deep Speech’s performance in English and also trained it to transcribe Mandarin. The Mandarin version achieves high accuracy in many scenarios and is ready to be deployed on a large scale in real-world applications, such as web searches on mobile devices.
“SVAIL has demonstrated that our end-to-end deep learning approach can be used to recognize very different languages,” said Dr. Andrew Ng, Chief Scientist at Baidu. “Key to our approach is our use of high-performance computing techniques, which resulted in a 7x speedup compared to last year at this time. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly.”
Commenting on Deep Speech’s high-performance computing architecture, Dr. Bill Dally, Chief Scientist, Nvidia, said: “I am very impressed by the efficiency Deep Speech achieves by using batching to deploy DNNs for speech recognition on GPUs. Deep Speech also achieves remarkable throughput while training RNNs on clusters of 16 GPUs. ”
In the paper, SVAIL also reported that Deep Speech is learning to process English spoken in various accents from around the world. Currently, such processing is challenging for popular speech systems used by mobile devices. Deep Speech has made rapid improvement on a range of English accents, including Indian-accented English as well as accents from countries in Europe where English is not the first language.
I had a glimpse of Deep Speech’s potential when I previewed it in its infancy last year,” said Dr. Ian Lane, Assistant Research Professor of Engineering, Carnegie Mellon University. “Today, after a relatively short time, Deep Speech has made significant progress. Using a single end-to-end system, it handles not only English but Mandarin, and is on its way to being released into production. I’m intrigued by Baidu’s Batch Dispatch process and its capacity to shape the way large deep neural networks are deployed on GPUs in the cloud.”