Exploration-focused training allows robotics AI to quickly take on new tasks.

boonchai wedmakawand

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Reinforcement learning algorithms in systems like ChatGPT or Google's Gemini can work wonders, but they usually require millions of shots before they get good at a task. Therefore, transferring this performance to robots has always been difficult. You can't let a self-driving car crash 3,000 times before it learns that a crash is bad.

But now a team of researchers at Northwestern University has found a way around it. “This is what we think is going to be transformative in the development of embodied AI in the real world,” says Thomas Barvetta, who led the development of Maximum Diffusion Reinforcement Learning (MaxDiff RL), an algorithm designed specifically for robots. Prepared for

Introduction to Chaos

The problem of deploying most reinforcement learning algorithms to robots starts with the built-in assumption that the data they learn from is independent and uniformly distributed. In this context, independence means that the value of one variable does not depend on the value of another variable in the dataset—when you flip a coin twice, getting tails on the second try depends on the outcome of your first flip. It doesn't happen. . A uniform distribution means that the probability of seeing a particular outcome is the same. In the coin flip example, the probability of getting heads is the same as getting tails: 50 percent for each.

In virtual, decentralized systems, such as YouTube's recommendation algorithm, this kind of data is easy to obtain because most of the time it meets these requirements right off the bat. “You have a group of users of a website, and you get data from one of them, and then you get data from another website. Most likely, those two users are from the same household. I'm not, they're not highly correlated. They could be, but it's very unlikely,” says Todd Murphy, a professor of mechanical engineering at Northwestern.

The problem is, if those two users were related to each other and were in the same household, it's possible that one of them saw the video, it's possible that their family member saw it and Asked to see it. This would violate independence requirements and compromise learning.

“In a robot, it's usually not possible to have this independent, uniformly distributed data. You exist at a particular point in space and time when you're embodied, so your experiences are not necessarily the same. To solve this, her team designed an algorithm that randomly forces the robot to make as many adventures as possible in order to learn. Widest collection available.

Two flavors of entropy.

The idea itself is not new. About two decades ago, people in AI discovered algorithms, such as Maximum Entropy Reinforcement Learning (MaxEnt RL), that worked through random actions during training. “The hope was that when you take as many diverse actions as possible, you'll find a more diverse set of possible futures. The problem is that those actions don't exist in a vacuum,” Berrueta claims. Every action a robot takes has some kind of effect on its environment and its own state—ignoring these effects entirely is often problematic. Simply put, an autonomous car that was teaching itself how to drive using this approach might park gracefully in your driveway but would be just as likely to hit a wall at full speed.

To solve this, Berrueta's team moved away from maximizing the diversity of actions to maximizing the diversity of state transitions. Robots powered by MaxDiff RL did not randomly observe their robotic pairs to see what it would do. Instead, they visualize goals such as “Can I get to this point ahead of me” and then try to figure out what steps will get them there safely.

Berrueta and his colleagues achieved this through something called ergodicity, a mathematical concept that states that a point in a moving system will eventually visit all parts of space where the system moves. Basically, MaxDiff RL encouraged robots to achieve every available state. their environment. And the results of the first test in a simulated environment were quite surprising.

Racing Pool Noodles

“Reinforcement learning has standard benchmarks against which people run their algorithms so we have a good way to compare different algorithms on a standard framework,” says Alison Panowski, a researcher at Northwestern and co-author of the MaxDiff RL study. be.” One of these criteria is a synthetic swimmer: a three-link body resting on land in a viscous environment that needs to learn to swim as fast as possible in a certain direction.

In the swimming test, MaxDiff RL outperformed two other state-of-the-art reinforcement learning algorithms (NN-MPPI and SAC). Both of them needed several resets to figure out how to move the swimmers. To accomplish the task, they followed a standard AI learning process that was divided into a training phase where an algorithm goes through multiple failed attempts to gradually improve its performance, and a testing phase. Where he tries to perform the learned task. MaxDiff RL, by contrast, nails it, adapting its learned behaviors to new tasks instantly.

Earlier algorithms failed to learn because they got stuck trying the same options and never progressed to the point where they could learn that alternatives worked. “They tested the same data over and over again because they were doing something locally, and they assumed that was all they could do and stopped learning,” Pinosky explains. MaxDiff RL, on the other hand, continued to change states, explore, get more and more data to learn from, and eventually succeeded. And because, by design, it tries to capture every possible state, it can potentially complete all possible tasks within the environment.

But does that mean we can take MaxDiff RL, upload it to a self-driving car, and let it out on the road to figure everything out on its own? Not really.

WhatsApp Group Join Now
Telegram Group Join Now
Instagram Group Join Now

Leave a Comment