Microsoft's VASA-1 takes AI-generated video one step closer to 'aw hell, we're all doomed'

 A promotional image for Microsoft's VASA-1 generative AI image.
A promotional image for Microsoft's VASA-1 generative AI image.

With generative AI being a key feature of all its new software and hardware projects, it should be no surprise that Microsoft has been developing its own machine learning models. VASA-1 is one such example, where a single image of a person and an audio track can be converted into a convincing video clip of said person speaking the recording.

Just a few years ago, anything created via generative AI was instantly identifiable, by several factors. With still images, it would be things like the number of fingers on a person's hand or even just something as simple as having the correct number of legs. AI-generated video was even worse, but at least it was very meme-worthy.

However, a research report from Microsoft shows that the obvious nature of generative AI is rapidly going to disappear. VASA-1 is a machine learning model that turns a single static image of a person's face into a short, realistic video, through the use of a speech audio track. The model examines the sound's changes in tone and pace and then creates a sequence of new images where the face is altered to match the speech.

I'm not doing it any justice with that description, because some of the examples posted by Microsoft are startlingly good. Others aren't so hot, though, and it's clear that the researchers selected the best examples to showcase what they've achieved. In particular, a short video demonstrating the use of the model in real-time highlights that it still has a long way to go before it becomes impossible to distinguish real reality from computer-generated reality.

But even so, the fact that this was all done on a desktop PC, albeit one using an RTX 4090, rather than a massive supercomputer shows that with access to such software, pretty much anyone could use generative AI to create a flawless deepfake. The researchers acknowledge this in the research report.

"It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans. We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection."

This is probably why Microsoft's research remains behind closed doors right now. That said, I can't imagine it will be long before someone manages to not only replicate the work but improve it, and potentially use it for some nefarious purpose. On the other hand, if VASA-1 can be used to detect deepfakes and it could be implemented in the form of a simple desktop application, then this would be a big step forward—or rather, a step away from a world where AI dooms us all. Yay!