AI’s Next Big Step: Detecting Human Emotion and Expression

The AI field has made remarkable progress with incomplete data. Leading generative models like Claude, Gemini, GPT-4, and Llama can understand text but not emotion. These models can’t process your tone of voice, rhythm of speech, or emphasis on words. They can’t read your facial expressions. They are effectively unable to process any of the non-verbal information at the heart of communication. And to advance further, they’ll need to learn.

Though much of the AI sector is currently focused on making generative models larger via more data, compute, and energy, the field’s next leap may come from teaching emotional intelligence to the models. The problem is already captivating Mark Zuckerberg and attracting millions in startup funding, and there’s good reason to believe progress may be close.

“So much of the human brain is just dedicated to understanding people and understanding your expressions and emotions, and that’s its own whole modality, right?” Zuckerberg told podcaster Dwarkesh Patel last month. “You could say, okay, maybe it’s just video or an image. But it’s clearly a very specialized version of those two.”

One of Zuckerberg’s former employees might be the furthest along in teaching emotion to AI. Alan Cowen, CEO of Hume AI, is a former Meta and Google researcher who’s built AI technology that can read the tune, timber, and rhythm of your voice, as well as your facial expressions, to discern your emotions.

As you speak with Hume’s bot, EVI, it processes the emotions you’re showing — like excitement, surprise, joy, anger, and awkwardness — and expresses its responses with ‘emotions’ of its own. Yell at it, for instance, and it will get sheepish and try to diffuse the situation. It will display its calculations on screen, indicating what it’s reading in your voice and what it’s giving back. And it’s quite sticky. Across 100,000 unique conversations, the average interaction between humans and EVI is 10 minutes long, a company spokesperson said.

“Every word carries not just the phonetics, but also a ton of detail in its tune, rhythm, and timbre that is very informative in a lot of different ways,” Cowen told me on Big Technology Podcast last week. “You can predict a lot of things. You can predict whether somebody has depression or Parkinson’s to some extent, not perfectly… You can predict in a customer service call, whether somebody’s having a good or bad call much more accurately.”

Hume, which raised $50 million in March, already offers the technology that reads emotion in voices via its API, and it has working tech that reads facial expressions that it has yet to release. The idea is to deliver much more data to AI models than they would get by simply transcribing text, enabling them to do a better job of making the end user happy. “Pretty much any outcome,” Cowen said, “it benefits to include measures of voice modulation and not just language.”

Text is indeed a lacking communication medium. Whenever anything gets somewhat complicated in text interactions, humans tend to get on a call, send a voice note, or meet in person. We use emojis or write things like “heyy” in a text to connote some emotion, but they have their limits, Cowen said. Text is a good way to convey complex thoughts (as we’re doing here, for instance) but not to exchange them. To communicate effectively, we need non-verbal signals.

mark zuckerberg
Meta CEO Mark Zuckerberg (Getty Images)

Voice assistants like Siri and Alexa have been so disappointing, for instance, because they transcribe what people say and strip all emotion out when digesting the meaning. Generative AI bots’ ability to deliver quality experiences in their current form is notable, but it also shows how much better they can get, given how much information they lack.

To program ‘emotional intelligence’ into machine learning models, the Hume team had more than 1 million people use survey platforms and rate how they’re feeling, and connected that to their facial expressions and speech. “We had people recording themselves and rating their expressions, and what they’re feeling, and responding to music, and videos, and talking to other participants,” Cowen said. “Across all of this data, we just look at what’s consistent between different people.”

Today, Hume’s technology can predict how people will respond before it replies, and uses that to modulate its response. “This model basically acquires all of the abilities that come with understanding and predicting expression,” Cowen said. “It can predict if you’re going to laugh at something — which means it has to understand something about humor that it didn’t understand before — or it can predict if you’re going to be frustrated or if you’re going to be confused.”

The current set of AI products has been understandably limited given the incomplete information they’re working with, but that could change with emotional intelligence. AI friends or companions could become less painful to speak with, even as a New York Times columnist has already found a way to make friends with 18 of them. Elderly care, Cowen suggested, could improve with AI that looks out for people’s everyday problems, and is also there as a companion.

Ultimately, Cowen’s vision is to build AI into products, allowing an AI assistant to read your speech, emotion, and expressions, and guide you through the experience. Imagine a banking app, for instance, that takes you to the correct pages to transfer money, or adjusts your financial plan, as you speak with it. “When it’s really customer service, and it’s really about a product,” Cowen said, “the product should be part of the conversation, should be integrated with it.”

Increasingly, AI researchers are discussing the likelihood of slamming into a resource wall given the limits on the amount of data, compute, and energy they can throw at the problem. Model innovation, at least in the short term, seems like the most likely way to get around some of the constraints. And while programming emotional intelligence into AI may not be the exact way to advance the field, it should have a chance. And it shows a way forward, toward building deeper intelligence into this already impressive technology.

This article is from Big Technology, a newsletter by Alex Kantrowitz.

The post AI’s Next Big Step: Detecting Human Emotion and Expression appeared first on TheWrap.