MIT Neural Net Predicts What Happens Next on TV Shows

From Popular Mechanics

Researchers at MIT have developed an algorithm that will allow an artificial intelligence to predict what it will see next. It's a major step forward in the field of robotic predictive vision.

How did the algorithm learn to predict human interactions? By watching TV, of course. Using deep-learning techniques to create neural nets that MIT has already used to train robots to recreate sounds, researchers at Computer Science and Artificial Intelligence Laboratory (CSAIL) sat their algorithm in front of Desperate Housewives and The Office as well as various YouTube videos. It watched more than 600 hours of video, with researchers hoping that it would be able to predict four basic motions: a hug, a kiss, shaking hands, or a high five.

Predictive vision has previously been based on pixel-based interpretations of what comes next, breaking things down to their smallest components in the hope that a future composite can be created. Recent years have seen attempts to think bigger, looking at objects as a whole, and now, looking at bodies and faces. "Rather than saying that one pixel value is blue, the next one is red, and so on, visual representations reveal information about the larger image, such as a certain collection of pixels that represents a human face," says CSAIL Ph.D. student Carl Vondrick, who was the lead author of the new study.

A deep-learning algorithm's neural networks operate independently of one another and then collaborate on a conclusion. The system recognizes each of these reactions, evaluates them, and presents a conclusion. Based on facial recognition, for example, three networks might assume a kiss is eminent. But one neural network could contextualize the scene by recognizing that a character just entered the room and opt to predict a hug instead.

It's a complex procedure, and the CSAIL researchers recognize that they are in the early days. Shown a video of people one second away from completing the action, it correctly guessed what would happen 43 percent of the time. That might not sound like much, but it's a major step above the 36 percent previous models have used. (Humans in the same test can predict what will happen 71 percent of the time.)

The potential for a predictive algorithm could have repercussions beyond having a robot buddy for your next Netflix binge. A security camera could recognize an ongoing crime or emergencies where first responders are needed. Speaking both to how this technology works and its future uses, Vondrick notes that the" future is inherently ambiguous, so it's exciting to challenge ourselves to develop a system that uses these representations to anticipate all of the possibilities."

Source: MIT