New York
CNN
—
The Mona Lisa can now do more than smile, thanks to new artificial intelligence technology from Microsoft.
Last week, Microsoft researchers detailed a new AI model they’ve developed that can take a still image of a face and an audio clip of someone speaking and automatically create a realistic looking video of that person speaking. The videos — which can be made from photorealistic faces, as well as cartoons or artwork — are complete with compelling lip syncing and natural face and head movements.
In one demo video, researchers showed how they animated the Mona Lisa to recite a comedic rap by actor Anne Hathaway.
Outputs from the AI model, called VASA-1, are both entertaining and a bit jarring in their realness. Microsoft said the technology could be used for education or “improving accessibility for individuals with communication challenges,” or potentially to create virtual companions for humans. But it’s also easy to see how the tool could be abused and used to impersonate real people.
It’s a concern that goes beyond Microsoft: as more tools to create convincing AI-generated images, videos and audio emerge, experts worry that their misuse could lead to new forms of misinformation. Some also worry the technology could further disrupt creative industries from film to advertising.
For now, Microsoft said it doesn’t plan to release the VASA-1 model to the public immediately. The move is similar to how Microsoft partner OpenAI is handling concerns around its AI-generated video tool, Sora: OpenAI teased Sora in February, but has so far only made it available to some professional users and cybersecurity professors for testing purposes.
“We are opposed to any behavior to create misleading or harmful contents of real persons,” Microsoft researchers said in a blog post. But, they added, the company has “no plans to release” the product publicly “until we are certain that the technology will be used responsibly and in accordance with proper regulations.”
Microsoft’s new AI model was trained on numerous videos of people’s faces while speaking, and it’s designed to recognize natural face and head movements, including “lip motion, (non-lip) expression, eye gaze and blinking, among others,” researchers said. The result is a more lifelike video when VASA-1 animates a still photo.
For example, in one demo video set to a clip of someone sounding agitated, apparently while playing video games, the face speaking has furrowed brows and pursed lips.
The AI tool can also be directed to produce a video where the subject is looking in a certain direction or expressing a specific emotion.
When looking closely, there are still signs that the videos are machine-generated, such as infrequent blinking and exaggerated eyebrow movements. But Microsoft said it believes its model “significantly outperforms” other, similar tools and “paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.”