Reinforcement Learning via Motion Planning and Control

Motion planning underlies the navigational and manipulation capabilities of most robots in production today. However, many motion planners operate under strong assumptions that do generally not hold in the real world. In this project, you will investigate whether a reinforcement learning policy trained from motion planner demonstrations can generate robust and reliable movements under realistic real-world conditions.

Bildet kan inneholde: fotografi, produkt, gest, font, skjermdump.

An interactive software framework for real-time Model Prective Control: MJPC

Motion planners also tend to be slower than real-time in the presence of obstacles, limiting the practicality for high-throughput applications.

On the other hand, the field of reinforcement learning (RL) has sought to directly synthesize motion controllers through environment interactions, without any explicit planning, sampling or trajectory optimization. Although controllers trained with RL have demonstrated impressive results, the training process is subject to several challenges such as sample inefficiency.

These two approaches can be compared to Systems 1 and 2, introduced by psychologist Daniel Kahneman and popularized in his book Thinking Fast and Slow, where motion planning is the slow, deliberate, methodical type of thinking (System 2) and RL policies are the fast, habitual, reactive type of thinking (System 1).

Several studies have attempted to combine the two approaches and extract a reactive policy from demonstrations generated by motion planners. These policies have several advantages over conventional motion planners:

They are fast, capable of generating actions at 20 Hz or more.
They can theoretically handle partially observable environments such as occlusions, as well as stochastic dynamics.
They do not require a perception module for each observation modality.

Below are some suggested research directions:

Offline RL via Model Predictive Control. Model Predictive Control (MPC) is a motion planning and control method that iteratively optimizes a trajectory for a short time horizon at each time step. Training the RL policy entirely offline from demonstrations obtained through MPC can be much more efficient than training it online.
Refining the RL policy via Data Aggregation. Data aggregation is a technique in online RL where demonstration data is incorporated in the experience buffer. Providing examples of high-reward behavior can bias the algorithm towards better performing policies.
Learning to choose between RL and Motion Planning. Once an RL policy is trained, it could still be advantageous to rely on the motion planner under certain circumstances, such as highly risky maneuvers. This decision mechanism itself can be trained and implemented as a high-level controller.

We imagine these projects being carried out mainly using simulations within the MuJoCo framework, but we are also open to other ideas. It could also be a possibility to test out the developed methods on a hardware robotic platform.

Publisert 9. okt. 2023 15:58 - Sist endret 9. okt. 2023 15:58

Veileder(e)

Shin Watanabe Universitetet i Oslo

Reinforcement Learning via Motion Planning and Control

Veileder(e)

Omfang (studiepoeng)