Leading breakthroughs in speech recognition software at Microsoft, Google, IBM
Groundbreaking work on speech recognition software by the University of Toronto’s Department of Computer Science (DCS) is transforming Microsoft, Google and IBM.
At a conference in Asia recently, Microsoft’s Chief Research Officer demonstrated an almost instantaneous translation of spoken English to Chinese speech – with software that maintained the sound of the speaker’s voice. It was the latest in a series of breakthroughs in the field involving U of T faculty and students.
“A few years ago, researchers at Microsoft Research and the University of Toronto came together to develop another breakthrough in the field of speech recognition,” Rick Rashid told the crowd. “The idea that they had was to use a technology in a way patterned after the way the human brain works – it’s called deep neural networks.
“That one change, that particular breakthrough increased recognition rates by approximately thirty percent. That’s a big deal.”
The breakthrough involves better recognition by the computer of what are called phonemes – small units of sound that comprise speech – and it has led to a reduction in errors by the computer, said Rashid.
“That’s the difference between going from 20 to 25 per cent errors - or about one out of every five words - to roughly 15 per cent less errors or roughly one out of every seven or perhaps one out of every eight words,” Rashid said. “It’s still not perfect, there’s still a long way to go but I think you can see that we have already made a significant amount of progress in the recognition of speech.”
DCS research in speech recognition is conducted by Professors Geoffrey E. Hinton (Machine Learning) and Gerald Penn (Computational Linguistics), with this latest breakthrough drawing on Hinton's deep neural networks.
Graduate students Abdel-rahman Mohamed and George Dahl began collaborating in 2009, applying deep neural networks to speech recognition. (Artificial neural networks are simplified mathematical models of neural circuits in the human brain.)
“Even before I started my PhD at U of T with Gerald Penn, I was always thinking about how I might make a breakthrough in the speech recognition field,” said Mohamed, “bringing Automatic Speech Recognition (ASR) technology closer to the end users.”
Inspired by one of Hinton’s lectures on deep neural networks, Mohamed began applying them to speech - but deep neural networks required too much computing power for conventional computers – so Hinton and Mohamed enlisted Dahl. A student in Hinton’s lab, Dahl had discovered how to train and simulate neural networks efficiently using the same high-end graphics cards which make vivid computer games feasible on personal computers.
“They applied the same method to the problem of recognizing fragments of phonemes in very short windows of speech,” said Hinton. “They got significantly better results than previous methods on a standard three-hour benchmark.”
Dahl and Mohamed presented the results of their work at a 2009 Neural Information Processing Systems (NIPS) workshop to a mixed reaction.
“Many participants in the workshop were excited about our results,” recalled Dahl, “but at the time there was a lot of healthy skeptical concern that our results might not translate into similar gains on more realistic speech recognition problems.”
Researchers at Microsoft, however, were interested enough to invite both students to internships at Microsoft Research in Redmond the following year. There, Mohamed and Dahl successfully applied their methods to larger speech tasks, involving much larger vocabularies.
Fellow CS graduate student Navdeep Jaitly also became involved in the research, and worked with Google to implement it in their system. Google now uses a deep neural network for voice search in the Android 4.1 operating system, their answer to the iPhone’s Siri conversational agent.
“I was expecting this move,” said Mohamed, “given the great results our model achieved consistently on so many benchmarks.”
Dahl continued: “It is very gratifying, particularly because there was a lot of initial resistance from the speech community to using deep neural networks for acoustic modeling.”
Today, most top speech labs are embracing the technology, including IBM, a long-time leader in speech recognition research, with whom Mohamed has also worked on this topic. Penn’s speech lab has also since developed an alternative neural network model in collaboration with York University Professor Hui Jiang and graduate student Ossama Abdel-Hamid. Abdel-Hamid has also worked on neural networks at Microsoft Research.
And the U of T researchers say the new business opportunities they’ve helped create are just the beginning. Hinton’s lab has already applied deep neural networks to several other pattern recognition problems. And Penn’s speech lab is in the process of digitizing the last 23 years of CBC NewsWorld video to develop search algorithms for large collections of speech.
Unlike Google voice search, which uses voice queries for searching web pages of text, this work uses text queries to search through speech data for related news coverage or interviews.
“This is important not just for speech researchers,” said Penn, “but for journalists, historians and anyone else who is interested in documenting the Canadian perspective on world affairs. Having all of this data around is great, but it’s of limited application if we can’t somehow navigate or search through it for topics of interest.”