Looking for a specific action in the video? This AI-based method can search for you MIT News

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

The Internet is full of instructional videos that can teach curious viewers everything from cooking the perfect pancake to performing the life-saving Heimlich maneuver.

But pinpointing when and where a particular action takes place in a long video can be difficult. To streamline this process, scientists are trying to teach computers to perform this task. Ideally, a user can simply specify the action they're looking for, and an AI model will move to their location in the video.

However, training a machine learning model to do this typically requires a lot of expensive video data that has been laboriously labeled by hand.

A new, more efficient approach from researchers at MIT and the MIT-IBM Watson AI Lab trains a model to perform this task, called spatio-temporal grounding, using only videos and their automatic creation. Using duplicates.

Researchers teach a model to understand unlabeled video in two distinct ways: by looking at small details to determine where objects are located (spatial information) and by looking at the big picture to understand how processes When it happens (mundane information).

Compared to other AI methods, their method more accurately identifies actions in long videos with multiple activities. Interestingly, they found that simultaneously training spatial and temporal information improved a model at identifying each individual.

In addition to streamlining online learning and virtual training processes, the technique could also be useful in healthcare settings, for example by quickly finding key moments in videos of diagnostic procedures.

“We eliminate the challenge of trying to encode spatial and temporal information together and instead think of it as two experts working independently, which is a more intuitive way of encoding information. Our model, which combines these two separate branches, leads to excellent performance,” says Brian Chen, lead author of a paper on the technique.

Chen, a 2023 graduate of Columbia University who conducted the research with visiting student at the MIT-IBM Watson AI Lab, James Glass, senior research scientist, member of the MIT-IBM Watson AI Lab, and head of the paper Joined. of the Spoken Language Systems Group at the Computer Science and Artificial Intelligence Laboratory (CSAIL); Hilde Kuehne, a member of the MIT-IBM Watson AI Lab who is also affiliated with Goethe University Frankfurt. and others at MIT, Goethe University, MIT-IBM Watson AI Lab, and Quality Match GmbH. The research will be presented at the Conference on Computer Vision and Pattern Recognition.

Global and local education

Researchers typically teach models to perform spatio-temporal grounding using videos in which humans describe the start and end times of specific tasks.

Not only is this data expensive to generate, but it can be difficult for humans to know what to label. If the process is “baking a pancake,” does the process begin when the chef starts mixing the batter or when she pours it into the pan?

“This time, work might be about cooking, but the next time, it might be about fixing a car. There are so many different domains for people to interpret. But if we're all without labels. Some can learn, so it's a more general solution,” Chen says.

For their approach, researchers use unlabeled instructional videos and text transcripts from websites such as YouTube as training data. They do not require any special preparation.

He divided the training process into two parts. For one, they teach a machine learning model to look at the entire video to understand what actions occur at specific times. This high-level information is called a global representation.

For another, they teach the model to focus on a specific area in the parts of the video where the action is taking place. In a large kitchen, for example, the model may need to focus only on the wooden spoon the chef is using to mix the pancake batter, rather than the entire counter. This fine-grained information is called a local representation.

The researchers add an additional component to their framework to reduce misunderstandings between narration and video. Perhaps the chef talks about cooking the pancake first and does the process later.

To develop a more realistic solution, the researchers focused on uncut videos several minutes long. In contrast, most AI techniques train using a few seconds of clips that someone has clipped to show just one action.

A new benchmark

But when they came to evaluate their approach, the researchers couldn't find an effective standard for testing the model on these long, uncut videos — so they created one.

To build their benchmark dataset, the researchers devised a new annotation technique that works well for identifying multistep actions. They asked users to mark the intersection of objects, such as where the edge of a knife cuts a tomato, rather than drawing a box around important objects.

“It's more clearly defined and speeds up the interpretation process, reducing human effort and cost,” Chen says.

In addition, point annotation of multiple people on the same video can better capture actions over time, such as milk flow. Not all annotations will mark exactly the same point in the fluid flow.

When they used this benchmark to test their approach, the researchers found that it was more accurate at identifying actions than other AI techniques.

Their approach was also better at focusing on human-object interaction. For example, if the action is “serving pancakes,” many other methods may focus only on key items, such as a stack of pancakes sitting on the counter. Instead, their method focuses on the actual moment when the chef flips the pancake onto the plate.

Next, the researchers plan to extend their approach so that the model can automatically detect when text and narrative are disconnected, and shift attention from one modality to another. They also want to extend their framework to audio data, since there are usually strong correlations between actions and sounds.

This research is funded, in part, by the MIT-IBM Watson AI Lab.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment