Transfer in multi-objective environments

Course of study:

Artificial Intelligence

Kind of thesis:

Theoretical analysis and Numerical Simulation

Programming languages:

Python

Keywords:

Deep Reinforcement Learning, Transfer Learning, Successor features, Multi-objective environments

Problem:

The combination of reinforcement learning (RL) with deep learning is a promising approach to tackle important sequential decisionmaking problems that are currently intractable. One obstacle to overcome is the amount of data needed by learning systems of this type. Complex decision problems can be naturally decomposed into multiple tasks that unfold in sequence or in parallel. By associating each task with a reward function, this problem decomposition can be seamlessly accommodated within the standard reinforcement-learning formalism. If the reward function of a task can be well approximated as a linear combination of the reward functions of tasks previously solved, we can reduce a reinforcement-learning problem to a simpler linear regression [1].

Successor features (SF) is a value function representation that decouples the dynamics of the environment from the rewards, and generalized policy improvement (GPI) is a generalization of dynamic programming’s policy improvement operation that considers a set of policies rather than a single one. Put together, the two ideas lead to an approach that integrates seamlessly within the RL framework and allows the free exchange of information across tasks [3].

If reward functions are expressed linearly, and the agent has previously learned a set of policies for different tasks, successor features (SFs) can be exploited to combine such policies and identify reasonable solutions for new problems [2]. The paper [2] allows RL agents to combine existing policies and directly identify optimal policies for arbitrary new problems, without requiring any further interactions with the environment. It shows that the transfer learning problem tackled by SFs is equivalent to the problem of learning to optimize multiple objectives in RL.

We would like to investigate the relation between SF and multi-objective problems further, understanding the minimum set of policies that can deliver reasonable performance for different types of environments. In other words, empirically gain an intuition of what characteristics of different environments impact more on the size of this set of policies.

In the top figure, an agent must choose the path that gives more return depending on rewards given for collect triangles or squares [1]. In the bottom figure, a schematic representation of the algorithm that suggests how to add new policies to the set of optimum policies by solving different tasks.

Goal:

Identify empirically the connection of the size of the set of policies capable of delivering a reasonable performance when using SF and GPI with the characteristics of different multi-objective RL environments from the MO-Gymnasium library [4].

Preliminary work:

[1] investigates how to do transfer learning using SF and GPI, and the work of [2] focuses on the algorithm to find the set of policies to deliver the optimum solution when using SF and GPI.

Tasks:

This project can include

Understanding the theory surrounding SF and GPI

Understanding how problems using SF can be interpreted as multi-objective environments

Make some numerical simulations in python solving some multi-objective environments using SF, GPI, and the set of optimal policies.

Run empirical analysis on how the size of the set of optimal policies is inpacted by some environments from the MO-Gymnasium library.

The final tasks will be discussed with the supervisor. Please feel free to get in contact.

References

[1] Barreto, André, et al. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences 117.48, (2020): 30079-30087. Click here for a video presentation, and here for the Google DeepMind blog post about the topic.

[2] Alegre, Lucas Nunes, Ana Bazzan, and Bruno C. Da Silva. Optimistic linear support and successor features as a basis for optimal policy transfer. International Conference on Machine Learning. PMLR, 2022.

[3] Barreto, André, et al. Successor features for transfer in reinforcement learning. Advances in neural information processing systems 30, 2017.

[4] Alegre, Lucas N., et al. MO-Gymnasium (Software) Multi-Objective Gymnasium type environment, 2022.

Supervision

Supervisor: Rafael Fernandes Cunha
Room: 5161.0438 (Bernoulliborg)
Email: r.f.cunha@rug.nl