The algorithm works by first converting audio of an individual’s speech into realistic basic mouth shapes, having trained by ‘watching’ many hours of video of that person talking. Then, using a new mouth synthesis technique, the system grafts and blends the mouth shapes onto the head of the person in an existing reference video.
In a demonstration of the new algorithm, the researchers generated highly-realistic video of former President Barack Obama talking about various topics using audio clips of speeches and video addresses that were originally on different topics.
According to the researchers, the new machine learning tool makes significant progress in overcoming the “uncanny valley” problem, which has thwarted previous efforts to create realistic video from audio. In particular, says lead author of the paper on the algorithm, Supasorn Suwajanakorn, “People are particularly sensitive to any areas of your mouth that don’t look realistic. So you have to render the mouth region perfectly to get beyond the uncanny valley.”
“These type of results have never been shown before,” says Ira Kemelmacher-Shlizerman, an assistant professor at UW’s Paul G. Allen School of Computer Science & Engineering. “Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings, as well as futuristic ones such as being able to hold a conversation with a historical figure in virtual reality by creating visuals just from audio. This is the kind of breakthrough that will help enable those next steps.”
The former president was chosen as a subject because of the amount of available videos for the algorithm to learn from. Looking ahead, says Kemelmacher-Shlizerman, video chat tools will let anyone collect videos that could be used to train computer models.
Currently, the neural network trains on only one individual at a time, using their speaking voice as the only information used to control the synthesized video. In the future, the researchers hope to enable the algorithm to recognize a person’s voice and speech patterns with less data.
The researchers caution however that, “You can’t just take anyone’s voice and turn it into an Obama video.” They intentionally decided against going down that path, says paper co-author and Allen School professor Steve Seitz. “We’re simply taking real words that someone spoke and turning them into realistic video of that individual.”
For more, see “Synthesizing Obama: Learning Lip Sync from Audio.” (PDF)
Google AI fund invests in algorithm marketplace startup
AI’s ‘human side’ is focus of new Google initiative
IBM: Cognitive IoT coming soon
Always-on face recognition promised by new AI chip