AI Advances Video Analysis for eCommerce

June 5, 2024

154 4 minutes read

AI Advances Video Analysis for eCommerce — AI video analysis.jpg

Artificial intelligence (AI) is cutting through the noise in online videos, helping shoppers quickly find the information they need.

Researchers at MIT and IBM have developed an AI method that could help viewers navigate directly to the most relevant parts of a video. At the same time, Video Summarizer AI and Mindstamp focus on providing interactive and multilingual summaries of educational videos to improve learning productivity and accessibility.

“By submitting the audio transcript for a video to AI and augmenting that AI with additional metadata, viewers can have a ‘conversation’ with the video that results in immediate answers to their questions and dynamic links directly to relevant content,” Brett Lindenberg, CEO and founder of Mindstamp, a software company that makes interactive videos, told PMNTS.

PYMNTS previously reported, that as Amazon and Walmart seek to drive sales through content, Amazon Live has launched an interactive, shoppable channel called FAST Channel on Prime Video and Amazon Freevee. This channel allows viewers to shop and engage with the content they watch on their TVs using their mobile devices.

Tackling Challenges

The MIT researchers have created a novel approach to teach AI models to perform spatio-temporal grounding, which involves identifying the start and end times of specific actions within a video. Traditional methods for this task require extensive human annotation, which is costly, time-consuming, and can be subjective. The challenge lies in determining the precise boundaries of an action, such as deciding when the action of “cooking a pancake” begins — is it when the chef starts mixing the batter or when the batter is poured into the pan?

The MIT team uses unlabeled instructional videos and accompanying text transcripts from websites like YouTube as training data to overcome these issues. The training process is divided into two parts: first, a machine-learning model is taught to understand what actions occur at specific times throughout the video, creating a global representation. Second, the model is trained to focus on particular regions where the action occurs, generating a local representation. This allows the model to concentrate on relevant objects and actions rather than the entire scene.

The researchers also incorporate an additional component to mitigate misalignments between narration and video, such as when a chef talks about an action before actually performing it. To develop a more realistic solution, they focus on uncut videos that span several minutes, in contrast to most AI techniques that train using few-second, trimmed clips showing only one action.

Evaluating their approach required MIT researchers to create a new benchmark dataset using a novel annotation technique that effectively identifies multistep actions. Instead of drawing boxes around important objects, users mark the intersection of objects, such as when a knife edge cuts a tomato. This innovative method enables the model to learn from more natural, uncut videos and accurately pinpoint complex actions’ start and end times.

The MIT team’s approach has significant implications for various domains, from eCommerce to education. By eliminating the need for costly and time-consuming human annotation, their method enables AI models to learn from a vast array of unlabeled instructional videos, making the training process more efficient and allowing the models to generalize across different tasks and domains.

In eCommerce, this technology could help shoppers quickly find the information they need in product videos, such as demonstrations of specific features or assembly instructions. By identifying critical moments within the video, the AI model can provide users with links to relevant content, which may enhance the overall shopping experience.

Video Summaries for Education

Video Summarizer AI and Mindstamp are focusing on educational video content by providing multilingual and interactive summaries that aim to improve learning productivity and accessibility.

Video Summarizer AI’s creator, Klym Zhuravlov-Iuzefovych, stated that it enables an increase in the productivity of video-based learning, allowing students to interact with video lectures in their native language, potentially removing language barriers and promoting inclusivity.

Mindstamp’s AI-powered platform aims to create interactive elements within the video. Lindenberg explained, “By using AI to analyze videos, the AI can produce a series of interactive elements within the video, including questions to verify understanding, links to third-party data sources to add additional insight, links to further AI explanations of topics, and more. Effectively, the video becomes an interactive educational or training resource.”

Moreover, Lindenberg noted that “AI can identify critical pieces of a video and dynamically create chaptering, links, references and branching between videos,” which may further enhance the educational value of video content.

Video Summarizer AI is built on a custom GPT (generative pretrained transformer) model, designed to comprehend and summarize complex and lengthy educational material across various subjects and academic levels. The tool’s integration with the ChatGPT interface and OpenAI’s technology offers a user experience on desktop and mobile devices.

Beyond shopping and education, MIT research and tools like Mindstamp could streamline video-based employee training and telemedicine. As video content becomes increasingly central to online life, AI innovations from MIT, IBM, Video Summarizer AI, and Mindstamp could impact customer experience, learning productivity, and inclusivity.

While these technologies show potential, it is essential to approach claims about their effectiveness cautiously and await further research and real-world testing to fully understand their impact on eCommerce, education and other domains. As these technologies evolve and integrate, they may create a new era of user-friendly, efficient, and inclusive video-based experiences across various industries. However, more evidence is needed to substantiate these claims and determine the extent of their impact.