Speech Recognition Is About To Get Frighteningly Good.
by Jack Clark
Speech recognition is at the same point today as image recognition was in 2011, someone told me recently.
So hold on tight, because that means computers are about to become adept at hearing, understanding and anticipating what we say to them. In 2011/2012 Google was able to use 16,000 computers to train a neural network system to view pictures and develop representations of the concepts in them, like a cat . Today, Google web search lets me ask for pictures of “a dog in front of a sunset” and its system will find a picture like that even if it hasn’t been annotated properly. And it does this about hundred times more efficiently than before.
Speech is about to go through the same evolution, so expect the rise of voice-activated shopping assistants, devices that work well in situations packed with large amounts of ambient noise (edit – July 30 2015: IBM has developed a record-setting system that can transcribe conversational speech) and large knowledge graphs wired into large-scale speech recognition models.
Many companies are doing this research beyond the usual suspects of Google, Microsoft, and IBM. Chinese search giant Baidu published a paper this Autumn on “Deep Speech: Scaling up end-to-end speech recognition” (PDF), a fast, state-of-the-art speech recognition system that works well in noisy environments. It outperformed systems from Apple, Microsoft, Google and a startup named Wit.AI. (Part of its gains came from the use of a new dataset created by Baidu containing 9,600 people contributing a total of 5000 hours of speech.)
“You’re going to see speech recognition systems that have human or better-than-human accuracy become commercialized,” says Tim Tuttle, a former AI specialist at MIT and now startup founder.
More on that in this article I wrote for Bloomberg: “Speech Recognition Better Than a Human’s Exists. You Just Can’t Use It Yet” 
Please let bring the hype tone a bit down. You say “So hold on tight, because that means computers are about to become adept at hearing, understanding and anticipating what we say to them.” Hearing what was said does not in any way lead or imply we UNDERSTOOD what was said. Speech recognition thus, even if perfect, does not IN ANY WAY imply we solved the “language understanding” problem, which involves commonsense reasoning and making inferences to resolve all sorts of references and ambiguities. Something, by the way, that deep networks and all forms of data-driven and machine learning approaches will never be able to achieve. Language cannot be learned from observing patterns in data.