Faithfulness of explanations in deep reinforcement learning

Course of study:

Artificial Intelligence

Kind of thesis:

Theoretical analysis and Numerical Simulation

Programming languages:

Python

Keywords:

Explanable Reinforcement Learning (XRL), Explanable AI, Input Attribution (IA)

Problem:

Neural Networks (NNs) are often cited as black-boxes, i.e., models which are not interpretable by human cognition and whose function learnt from the data is not directly accessible. For this reason, several techniques for trying to probe into one or more aspects of the inner functioning of NNs have been proposed, and this field has been dubbed Explainable AI (XAI). Input Attribution (IA) methods are tools to approximate which parts of an input are important in determining the model prediction.

NNs can also be employed as controllers of Reinforcement Learning (RL)-based agents, whereas the model is used to learn the best policy for navigating an environment in a given state. In this sense, IA methods, or any other explainability technique, can be a great tool for getting intuitions about the reasons why an agent executes a specific action. The field that looks for explainability in RL agent behaviors is called explainable RL (XRL). The goal of XRL is to elucidate the decision-making process of learning agents in sequential decision-making settings.

One of the most pressing challenges of XAI is connected to the assessment of the quality of the explanations provided: in fact, many XAI tools are mere approximations of the underlying decision process operated by the NN, and the explanations can be widely inaccurate in that regard. A straightforward way of evaluating the explanations is to consider the faithfulness of the explanation: for instance, we could ask ourselves whether the input parts identified via IA are actually important for the NN—by perturbing or masking these areas, one reasonably expects the action of the underlying RL agent to change.

In the top figure, there's the XRL taxonomy and its relationship to the RL process [4]. In the bottom figure, there are some reasons to study explainable AI.

Goal:

Are input attribution methods applied to deep reinforcement learning agents faithful to the policy learned?

Preliminary work:

In [1], you can find an introductory course on XAI. [2] offers a comprehensive introduction to the evaluation of XAI techniques. [3] provides an example where input attribution methods exhibit low faithfulness, and [4] presents a literature review of some XRL methods.

Tasks:

This project can include

Literature review on XAI and XRL.

Literature review on assessing quality of XAI tools.

Run simulations in a chosen atari gym environment.

Analyze results.

The final tasks will be discussed with the supervisor. Please feel free to get in contact.

References

[1] Course on XAI.

[2] Nauta, Meike, et al. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai. ACM Computing Surveys 55.13s: 1-42, 2023.

[3] Arrighi, Leonardo, et al. Explainable Automated Anomaly Recognition in Failure Analysis: is Deep Learning Doing it Correctly?. XAI conference (2023). Under publication., 2023.

[4] Milani, Stephanie, et al. A survey of explainable reinforcement learning. arXiv preprint arXiv:2202.08434, 2022.

[5] Milani, Stephanie, et al. A complete list of atari gym environments.

Supervision

Supervisor: Rafael Fernandes Cunha
Room: 5161.0438 (Bernoulliborg)
Email: r.f.cunha@rug.nl

Supervisor: Marco Zullich
Room: 5161.0438 (Bernoulliborg)
Email: m.zullich@rug.nl