Transfer learning in multi-agent RL settings

Finding and simulating a set of policies to transfer learning in multi-agent reinforcement learning settings

Course of study:
Artificial Intelligence
Kind of thesis:
Theoretical analysis and Numerical Simulation
Programming languages:
Python
Keywords:
Multi-agent Reinforcement Learning, Transfer Learning, Successor features, Descentralized Partially-Observable Markov decision proccess (DEC-POMDP)


Problem:

Decentralized partially observable Markov decision processes (Dec-POMDPs) provide a general model for decision-making under uncertainty in decentralized settings, but are difficult to solve optimally. Transforming a Dec-POMDP into a continuous-state deterministic MDP with a piecewise-linear and convex value function, and using the fact that planning can be accomplished in a centralized offline manner, while execution can still be decentralized, we can use the occupancy MDP framework to solve this family of problems.

Successor features (SF) separate environmental dynamics from rewards, while generalized policy improvement (GPI) considers multiple policies. Together, they facilitate cross-task information exchange in the RL framework [2].

We can consider a environment with different tasks when only the reward function changes. If we try to combine SF with the occupancy MDP framework, we may be able to transfer learning among different tasks in Dec-POMDP problems.

This project will focus on the simulations aspects of this ideas, trying to get insight if this is possible by analysing the results of numerical experiments of algorithms that combine the tools previously described.
In the top figure, two agents must choose the path that gives more return depending on rewards given for collect triangles of different colors. In the bottom figure, a schematic representation of the algorithm that suggests how to add new policies to the set of optimum policies by solving different tasks.


Goal:

Suggest and run simulations of a new cooperative multi-agent reinforcement learning algorithm with a joint reward signal that combines the ideas of VDN, successor features, and the strategy to find a set of optimum policies as described in [4].


Preliminary work:

[1] proposes an algorithm to solve cooperative multi-agent reinforcement learning problems. [2] investigates how to do transfer learning using SF and GPI, and [4]’s work focuses on the algorithm to find the set of policies to deliver the optimum solution when using SF and GPI.


Tasks:

This project can include

  • Read the literature on SF, GPI, MARL, and multi-objective environments.
  • Choose a multi-agent environment and run some simulations using VDN.
  • Slightly modify a multi-agent environment to treat it as a multi-objective problem.
  • Propose an algorithm that combines VDN, SF, and the strategy used in [4] to identify the set of optimal policies.
  • Run simulations with the proposed algorithm and assess the results.

  • The final tasks will be discussed with the supervisor. Please feel free to get in contact.


    References

  • [1] Barreto, André, et al. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences 117.48, (2020): 30079-30087. Click here for a video presentation, and here for the Google DeepMind blog post about the topic.
  • [3] Dibangoye, Jilles Steeve, et al. Optimally solving Dec-POMDPs as continuous-state MDPs. Journal of Artificial Intelligence Research 55 (443-497), 2016.
  • [5] Barreto, André, et al. Successor features for transfer in reinforcement learning. Advances in neural information processing systems 30, 2017.
  • [6] Alegre, Lucas N., et al. MO-Gymnasium (Software) Multi-Objective Gymnasium type environment, 2022.

  • Supervision

    Supervisor: Rafael Fernandes Cunha
    Room: 5161.0438 (Bernoulliborg)
    Email: r.f.cunha@rug.nl

    Dibangoye, Jilles Steeve, et al. “Optimally solving Dec-POMDPs as continuous-state MDPs.” Journal of Artificial Intelligence Research 55 (2016): 443-497.