Task-agnostic exploration for sparse reward environments

Combine task-agnostic exploration techniques with RL algorithms to solve sparse reward environments

Course of study:
Artificial Intelligence
Kind of thesis:
Theoretical analysis and Numerical Simulation
Programming languages:
Python
Keywords:
Reinforcement Learning, Transfer Learning, Sparse Rewards, Exploration


Problem:

Exploration is a fundamental aspect of reinforcement learning (RL). Its effectiveness plays a crucial role in the performance of RL algorithms, particularly when confronted with sparse extrinsic rewards. In reinforcement learning (RL), exploration is generally divided into two settings. The first, task-driven exploration, involves a well-defined reward where the agent aims to explore to maximize long-term rewards. However, in real-life scenarios, external rewards are often sparse or completely unknown [1]. The second, task-agnostic exploration, occurs when an agent must explore a new environment without the guidance of any external reward

[1] describes a setup where the agent first learns to explore multiple environments in a task-agnostic manner without any extrinsic goal. Subsequently, the agent efficiently transfers this learned exploration policy to more effectively explore new environments while solving tasks. The researchers introduced the Change-Based Exploration Transfer (C-BET) algorithm, which merges exploration with both extrinsic and intrinsic goals. The C-BET algorithm can be integrated into various RL algorithms. In the paper, IMPALA was the chosen method

This project will focus on analyzing the C-BET algorithm and conducting simulations to understand its performance with both value-based and policy-based RL algorithms. What impacts do changes in hyperparameters have on the agent's performance?
Change-Based Exploration Transfer (C-BET) trains task-agnostic exploration agents that transfer to new environments. Here the agent learns that keys are interesting, as they allow further interaction with the environment (opening doors). Later, when tasked with reaching a box behind a door, the agent starts by picking up the key. (Extracted from [1])


On the left, C-BET pre-training scheme. The agent interacts with environments and learns using intrinsic rewards computed from state and change counts. On the right, C-BET transfer, the pre-trained exploration policy is fixed and guides task-specific policy learning in new environments (Extracted from [1]).

Goal:

Integrate the C-BET algorithm with both a value-based and a policy-based RL algorithm within the mini-grid environment [2], and then analyze the outcomes. How do these results compare to the IMPALA algorithm [3] utilized in the paper?


Preliminary work:

[1] introduces an algorithm designed to address sparse reward environments by employing task-agnostic exploration techniques. Simulations were conducted in the mini-grid environment, utilizing a blend of the C-BET and IMPALA algorithms. The study also presented a benchmark and metrics for evaluating task-agnostic exploration strategies.


Tasks:

This project can include

  • Review literature addressing solutions for sparse reward environments.
  • Grasp the intricacies of the C-BET algorithm and its relation to task-agnostic exploration.
  • Select both a value-based and a policy-based RL algorithm for integration with the C-BET approach.
  • Execute simulations using the proposed algorithm and evaluate the outcomes.

  • The final tasks will be discussed with the supervisor. Please feel free to get in contact.


    References

  • [1] Parisi, Simone, et al. Interesting object, curious agent: Learning task-agnostic exploration. Advances in Neural Information Processing Systems 34, (2021): 20516-20530.
  • [2] Chevalier-Boisvert, Maxime, et al. Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks arXiv preprint arXiv:2306.13831, 2023. Check also this page for the documentation.
  • [3] Espeholt, Lasse, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. International conference on machine learning. PMLR, 2018.
  • [4] Wan, Shanchuan, et al. DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards. arXiv preprint arXiv:2304.10770 (IJCAI), 2023.

  • Supervision

    Supervisor: Rafael Fernandes Cunha
    Room: 5161.0438 (Bernoulliborg)
    Email: r.f.cunha@rug.nl