Microsoft’s VASA-1 Could Enable Webcam-Free Video Calls
Get ready for AI to make videos from just your picture.
Microsoft Research recently unveiled VASA, a new AI framework demonstration capable of generating “hyper-realistic” talking faces from a single portrait and speech audio, possibly reducing the reliance on webcams.
The new technology introduces a shift in video conferencing, potentially making webcams obsolete by synthesizing lifelike facial expressions and speech. As experts delve into the practical applications of this technology, they also raise concerns about its possible misuse in creating deepfakes.
“According to research, more than half (66%) of organizations are eager to use AI for 2024 video projects, and though AI holds a lot of promise for video creators, hyper-realistic AI-generated avatars challenge the boundaries of ethical AI,” Chris Savage, the CEO of Wistia, a video marketing platform, told PYMNTS.
“Today, most of these AI-generated videos are used for internal educational purposes, which ultimately can improve communications for businesses” he added. “However, removing the human element within videos challenges the trust and integrity of the content within.”
Pictures to Video
The VASA system allows users to adjust the subject’s eye movements, perceived distance and expressed emotions. VASA-1, the first in a series of AI tools, can create specific facial expressions, syncing lip movements accurately and mimicking human-like head movements. Additionally, it provides a broad selection of emotions and can generate subtle facial details. Microsoft said the system is for demonstration only and doesn’t have release plans for it.
“Our research focuses on generating visual affective skills for virtual AI avatars, aiming for positive applications,” Microsoft wrote on its website. “It is not intended to create content that is used to mislead or deceive.
“However, like other related content generation techniques, it could still potentially be misused for impersonating humans. We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection,” Microsoft added. “Currently, the videos generated by this method still contain identifiable artifacts, and the numerical analysis shows that there’s still a gap to achieve the authenticity of real videos.”
On its research website, Microsoft describes how the technology works. The key advancements include a model that generates facial and head movements in a specialized facial area, developed using video data. The method produces high-quality videos with realistic facial movements and can create videos in real-time at 512×512 resolution, running up to 40 frames per second with very little delay. The technology allows for real-time conversations with avatars that behave like humans.
Growing Concerns About Authenticity
AI-powered video tools are fueling concerns about deepfakes. PYMNTS reported in February that the Federal Trade Commission (FTC) is considering a new set of regulations aimed at banning the impersonation of individuals. This initiative is in response to a rise in complaints regarding impersonation fraud. The FTC has expressed its determination to use all available resources to identify and prevent such fraud.
The agency has also highlighted that new technologies, like AI-generated deepfakes, could exacerbate fraud issues. This announcement follows another PYMNTS report that consumers faced a record $10 billion in fraud losses in 2023, marking a 14% increase from the previous year, based on FTC data.
Systems like VASA mean that organizations will have to be more careful during the hiring process, Savage said.
“AI replacing webcams is a very real possibility, and with virtual interviews being a common practice for organizations, how can we be sure the potential employee is who they say they are?” Savage noted. “Or, on the other hand, that the hiring company is legitimate? I foresee this being a bigger conversation over the next few years, along with the amount of trust folks place in everyday content.