The Power Of Voice: A Conversation With The Head Of Google's Speech Technology

Jason Kincaid

February 13, 2011 at 4:55 PM

For all the whiz-bang graphics and nifty apps appearing on smart phones these days, there are still few things that feel more futuristic than pulling out your phone, uttering the words, "find directions to the Exploratorium", and having Google immediately do your bidding. The technology is becoming widely available via apps on the iPhone and deep integration into Android, and this is really only the beginning.

Earlier this month I had the chance to sit down with Mike Cohen, the man who leads all of Google's speech technology efforts, to get a look behind the curtain at why Google has invested so much into voice, and where things are going from here.

A Look Back

Before we discuss where we stand now, it's worth looking at Cohen's past, which also serves as a good history lesson on speech technology. Cohen has been at Google since 2004, but he's been straddling the intersection of voice and technology for decades, getting his start at the Stanford Research Institute in the early 1980s.

Cohen says that in the 1970s there were two main camps working on speech: linguists and engineers. The linguists were all about rules — they'd identify various trends in grammar and pronunciation and how each phoneme interacted with the others. The engineers were taking a different approach: rather than trying to painstakingly identify each rule manually, they set out to build complex statistical models that improved as more speech data was fed into them.

By the late 70s and early 80s, when Cohen started doing research at SRI, the engineers were in the lead. But there was a problem: the improvements seen in their models were starting to asymptote. Cohen explains that because these models were always the same, feeding them more data was eventually going to provide diminishing returns (for example, their models were bad at recognizing how pronunciation depends not only which words are being said, but also their context). The engineers needed to find a way to build richer models — so they finally began to collaborate with the linguists. And a research boom ensued.

By the early 90s speech technology had gotten sufficiently advanced that researchers could create the DARPA-funded Air Travel Information System (ATIS) — where a user could walk up to a terminal, say, "Show me the flights from Boston", and the computer would spit back the relevant data. The system could understand countless variations on such commands (you didn't have to memorize certain keywords) — pretty amazing given the fact that this system was built around the time Windows 95 came out.

Based on the success of the ATIS, Cohen decided that the technology was ready for commercialization, so he and three cofounders left to start Nuance. The company focused on building automated enterprise call systems, which it then sold to major businesses that had to deal with high inbound call volume — things like an automated stock quote system for Charles Schwab, and customer service for phone companies.

Given his history as a researcher, it isn't surprising that Cohen was looking at ways to improve Nuance's speech recognition software. And, as it turned out, the huge number of call recordings coming in were even more useful than the data he'd had access to while a researcher at SRI. He explains that there are things that can't be reproduced in a lab environment — a dog barking in the background, a child crying, and so on — that were present in these inbound phone calls, exposing Nuance to important new challenges in speech analysis.

But there was one big problem: despite the fact that its technology was dealing with a huge volume of data, Nuance would have to approach each of its enterprise customers and ask for access to this data for research purposes. Enterprises stood to gain because they'd reap any improvements in the technology, but some of them were wary anyway. Which set the stage for Cohen to finally make the jump to Google.

GOOG-411 and Beyond

In 2004 Google's voice efforts were basically non-existent. But Cohen saw an opportunity: even then it was clear that mobile was going to have a big impact on the future of technology. And because Google faces the end-user directly, any incoming voice data would be immediately accessible for research purposes. So he made the switch to the search giant, and began what became Google's free 411 voice service, GOOG-411.

The service launched in 2007, offering a straightforward and handy feature set: you'd call in, ask for some basic information like a business's phone number, and it would immediately give you that information free of charge. Cohen says the main motivation for launching GOOG-411 was the fact that it's useful, but it had an important secondary function: it allowed Google to begin building up a massive corpus of voice data. Remember the data models discussed earlier? Google's speech systems use similar concepts, but at a much larger scale.

GOOG-411 was killed off in October, but Google now has more inputs of voice data, including the microphone button seen throughout Android and the Google Mobile application for iPhone. And Google can look at text-based search queries to identify what terms appear most often after each other. All of which means Google can train its language models relatively quickly.

These days, Cohen says that Google uses 230 billion search queries to train the language model used by Google's speech recognizer. To give an idea of how large that volume of data is, he says the training would take 70 years to be completed on a single CPU (though Google obviously has far greater resources).

The technology is now used across a variety of products. YouTube automatically captions millions of videos. Google Voice attempts to transcribe inbound voice messages (with some pretty hilarious results). And voice search is going to play a much bigger role on mobile devices — don't be surprised if we start seeing cars with media centers running Android in the not-so-distant future. You can bet they'll be voice-enabled.

Cohen was happy to talk in broad terms about Google's voice efforts, but he was opaque when it came to sharing stats, upcoming features, and predictions. He wouldn't discuss the kind of voice search volume that Google sees, though he did acknowledge that it fluctuates widely depending on if a new voice-enabled feature has launched and if there has been recent coverage in the press.

When I asked him how long it would be before voice search would become accurate to the point where we take it for granted (and didn't have to check for typos), he declined to really offer a projection (he noted that he could say something like "five years", but that that's just research terminology for "I have no idea").

I also asked him what he thought about Apple's voice efforts — the company acquired Siri last year, and it seems obvious that it's going to begin incorporating voice into iOS. Again, Cohen didn't have much to say here (though this wasn't really surprising). He did say that Google has the natural advantage of having already released a product that gives it a massive volume of data, but ultimately it will come down to what Apple builds and who they partner with.

But while he wouldn't get into specifics, Cohen did share Google's long-term vision for this technology: it wants speech input to be completely ubiquitous. "We don't ever want there to be a scenario where speech would be valuable, if only it had been available — just like you can enter text with a keyboard anywhere, you should be able to do it with speech." And accuracy is a big part of that: "It needs to work so close to perfect that the choice isn't based on performance, but on end-user preference."

SportsYahoo Sports
Based on the odds, here's what the top 10 picks of the NFL Draft will be
What would a mock draft look like using just betting odds?
3d ago
SportsYahoo Sports
Broncos, Jets, Lions and Texans have new uniforms. Let's rank them
Which new uniforms are winners this season?
16h ago
BusinessYahoo Finance
Jamie Dimon is worried the US economy is headed back to the 1970s
JPMorgan's CEO is concerned the US economy could be in for a repeat of the stagflation that hampered the country during the 1970s.
1d ago
EntertainmentYahoo TV
Everyone's still talking about the 'SNL' Beavis and Butt-Head sketch. Cast members and experts explain why it's an instant classic.
Ryan Gosling, who starred in the skit, couldn't keep a straight face — and neither could some of the "Saturday Night Live" cast.
2d ago
SportsYahoo Sports
Dave McCarty, player on 2004 Red Sox championship team, dies 1 week after team's reunion
The Red Sox were already mourning the loss of Tim Wakefield from that 2004 team.
4d ago
SportsYahoo Sports
Ryan Garcia drops Devin Haney 3 times en route to stunning upset
The 25-year-old labeled "mentally fragile" by many delivered the upset for the ages.
4d ago
SportsYahoo Sports
Luka makes Clippers look old, Suns are in big trouble & a funeral for Lakers | Good Word with Goodwill
Vincent Goodwill and Tom Haberstroh break down last night’s NBA Playoffs action and preview several games for tonight and tomorrow.
11h ago
SportsYahoo Sports
WNBA Draft winners and losers: As you may have guessed, the Fever did pretty well. The Liberty? Perhaps not
Here are five franchises who stood out, for better or for worse.
9d ago
SportsYahoo Sports
Yankees' Nestor Cortés told by MLB his pump-fake pitch is illegal
Cortés' attempt didn't fool Andrés Giménez, who fouled off the pitch.
5d ago
SportsYahoo Sports
Arch Manning dominates in the Texas spring game, and Jaden Rashada enters the transfer portal
Dan Wetzel, Ross Dellenger & SI’s Pat Forde react to the huge performance this weekend by Texas QB Arch Manning, Michigan and Notre Dame's spring games, Jaden Rashada entering the transfer portal, and more
2d ago

News

Life

Entertainment

Finance

Sports

New on Yahoo

The Power Of Voice: A Conversation With The Head Of Google's Speech Technology

Recommended Stories

Based on the odds, here's what the top 10 picks of the NFL Draft will be

Broncos, Jets, Lions and Texans have new uniforms. Let's rank them

Jamie Dimon is worried the US economy is headed back to the 1970s

Everyone's still talking about the 'SNL' Beavis and Butt-Head sketch. Cast members and experts explain why it's an instant classic.

Dave McCarty, player on 2004 Red Sox championship team, dies 1 week after team's reunion

Ryan Garcia drops Devin Haney 3 times en route to stunning upset

Luka makes Clippers look old, Suns are in big trouble & a funeral for Lakers | Good Word with Goodwill

WNBA Draft winners and losers: As you may have guessed, the Fever did pretty well. The Liberty? Perhaps not

Yankees' Nestor Cortés told by MLB his pump-fake pitch is illegal

Arch Manning dominates in the Texas spring game, and Jaden Rashada enters the transfer portal