Rafael Fernandes Cunha

I am a Lecturer in Artificial Intelligence at the University of Groningen, the Netherlands, where I teach reinforcement learning, among other AI topics, and supervise student research projects. I am also a PhD Candidate with a research focus on multi-agent reinforcement learning.

Over the past three years as a lecturer, I have supervised more than 20 bachelor's and master's thesis projects, several of which have been published at international venues such as NLDL 2025, and ECAI 2025 workshops. These projects span topics from multi-agent coordination to applications in cyber security and large language model post-training.

My PhD research focuses on multi-agent reinforcement learning in decentralized partially observable settings (Dec-POMDPs and POSGs). I analyze the mathematical structure of these problems to develop algorithms with improved convergence and performance guarantees. In previous work, I applied deep RL to optimize switching control in vehicle platoon systems, demonstrating reinforcement learning for dynamical systems control.

During my master's studies in Electrical Engineering, with a focus on control systems at UNICAMP, I acquired experience in the mathematical modeling of dynamical systems and the resolution of convex optimization problems. This groundwork has helped me understand RL problems on the algorithmic level and discern their connection to output feedback control type problems, for which there are established mathematical tools for analysis.

Here’s a broad overview of my current research interests:

Deep Reinforcement Learning
Multiagent Reinforcement Learning
Transfer Learning in RL
RL-based Post-training of LLMs with Analytical/Verifiable Rewards

Check out here the list of open projects that you can enroll in for your bachelor’s or master’s thesis.

Selected publications

2025

AAAI

Optimally solving simultaneous-move dec-POMDPs: The sequential central planning approach

Johan Peralez, Aurélien Delage, Jacopo Castellini, and 2 more authors

In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

Abs PDF

The centralized training for decentralized execution paradigm emerged as the state-of-the-art approach to ϵ-optimally solving decentralized partially observable Markov decision processes. However, scalability remains a significant issue. This paper presents a novel and more scalable alternative, namely the sequential-move centralized training for decentralized execution. This paradigm further pushes the applicability of the Bellman’s principle of optimality, raising three new properties. First, it allows a central planner to reason upon sufficient sequential-move statistics instead of prior simultaneous-move ones. Next, it proves that ϵ-optimal value functions are piecewise linear and convex in such sufficient sequential-move statistics. Finally, it drops the complexity of the backup operators from double exponential to polynomial at the expense of longer planning horizons. Besides, it makes it easy to use single-agent methods, eg, SARSA algorithm enhanced with these findings, while still preserving convergence guarantees. Experiments on two-as well as many-agent domains from the literature against ϵ-optimal simultaneous-move solvers confirm the superiority of our novel approach. This paradigm opens the door for efficient planning and reinforcement learning methods for multi-agent systems.
TMLR

Sparsity-Driven Plasticity in Multi-Task Reinforcement Learning

Aleksandar Todorov, Juan Cardenas-Cartagena, Rafael F. Cunha, and 2 more authors

Transactions on Machine Learning Research, 2025

Abs PDF

Plasticity loss, a diminishing capacity to adapt as training progresses, is a critical challenge in deep reinforcement learning. We examine this issue in multi-task reinforcement learning (MTRL), where higher representational flexibility is crucial for managing diverse and potentially conflicting task demands. We systematically explore how sparsification methods, particularly Gradual Magnitude Pruning (GMP) and Sparse Evolutionary Training (SET), enhance plasticity and consequently improve performance in MTRL agents. We evaluate these approaches across distinct MTRL architectures (shared backbone, Mixture of Experts, Mixture of Orthogonal Experts) on standardized MTRL benchmarks, comparing against dense baselines, and a comprehensive range of alternative plasticity-inducing or regularization methods. Our results demonstrate that both GMP and SET effectively mitigate key indicators of plasticity degradation, such as neuron dormancy and representational collapse. These plasticity improvements often correlate with enhanced multi-task performance, with sparse agents frequently outperforming dense counterparts and achieving competitive results against explicit plasticity interventions. Our findings offer insights into the interplay between plasticity, network sparsity, and MTRL designs, highlighting dynamic sparsification as a robust but context-sensitive tool for developing more adaptable MTRL systems.
NLDL

World model agents with change-based intrinsic motivation

Jeremias Ferrao, and Rafael Cunha

In 2025 Northern Lights Deep Learning Conference (NLDL), 2025

Abs arXiv HTML Code

Sparse reward environments pose a significant challenge for reinforcement learning due to the scarcity of feedback. Intrinsic motivation and transfer learning have emerged as promising strategies to address this issue. Change Based Exploration Transfer (CBET), a technique that combines these two approaches for model-free algorithms, has shown potential in addressing sparse feedback but its effectiveness with modern algorithms remains understudied. This paper provides an adaptation of CBET for world model algorithms like DreamerV3 and compares the performance of DreamerV3 and IMPALA agents, both with and without CBET, in the sparse reward environments of Crafter and Minigrid. Our tabula rasa results highlight the possibility of CBET improving DreamerV3’s returns in Crafter but the algorithm attains a suboptimal policy in Minigrid with CBET further reducing returns. In the same vein, our transfer learning experiments show that pre-training DreamerV3 with intrinsic rewards does not immediately lead to a policy that maximizes extrinsic rewards in Minigrid. Overall, our results suggest that CBET provides a positive impact on DreamerV3 in more complex environments like Crafter but may be detrimental in environments like Minigrid. In the latter case, the behaviours promoted by CBET in DreamerV3 may not align with the task objectives of the environment, leading to reduced returns and suboptimal policies.
SPAIML/ECAI WS

Seed Scheduling in Fuzz Testing as a Markov Decision Process

Rafael F Cunha, Luca Müller, Thomas Rooijakkers, and 2 more authors

In 1st International Workshop on Security and Privacy-Preserving AI/ML (SPAIML) at ECAI, 2025

Abs PDF

Coverage-guided Greybox Fuzzing (CGF) is an effective method for discovering software vulnerabilities. Traditional fuzzers, such as American Fuzzy Lop (AFL), rely on heuristics for critical tasks like seed scheduling, which often lack adaptability and may not optimally balance exploration with exploitation. This paper presents a novel approach to enhance seed scheduling in CGF by formalizing it as a Markov Decision Process (MDP). We detail the design of this MDP, including the state representation derived from fuzzer and coverage data, the action space encompassing seed selection and power assignment, and a reward function geared towards maximizing coverage and bug discovery. A Proximal Policy Optimization (PPO) agent is then trained to learn a scheduling policy from this MDP within the AFL++ fuzzer. Our investigation into this Deep Reinforcement Learning (DRL) based approach reveals that while the MDP formulation provides a structured framework, practical application faces significant challenges, including high computational demands for training and intensive hyperparameter tuning. The key contributions of this work are: (1) a concrete MDP formulation for the complex task of fuzzer seed scheduling, (2) an analysis of the inherent difficulties and trade-offs in applying DRL to this specific domain, and (3) insights gained from the agent’s learning process (or lack thereof), which inform the discussion on the suitability of DRL for this type of optimization problem in fuzzing. This research provides a foundational exploration of DRL for seed scheduling and highlights critical considerations for future advancements in intelligent fuzzing agents.

2023

TITS
Fuel-Efficient Switching Control for Platooning Systems With Deep Reinforcement Learning

Tiago R Goncalves, Rafael F Cunha, Vineeth S Varma, and 1 more author

IEEE Transactions on Intelligent Transportation Systems, 2023

Abs Bib

The wide appeal of fuel-efficient transport solutions is constantly increasing due to the major impact of the transportation industry on the environment. Platooning systems represent a relatively simple approach in terms of deployment toward fuel-efficient solutions. This paper addresses the reduction of fuel consumption in platooning systems attainable by dynamically switching between two control policies: Adaptive Cruise Control (ACC) and Cooperative Adaptive Cruise Control (CACC). The switching rule is dictated by a Deep Reinforcement Learning (DRL) technique to overcome unpredictable platoon disturbances and to learn appropriate transient shift times while maximizing fuel efficiency. However, due to safety and convergence issues of DRL, our algorithm establishes transition times and minimum periods of operation of ACC and CACC controllers instead of directly controlling vehicles. Numerical experiments show that the DRL agent outperforms both static ACC and CACC versions and the threshold logic control in terms of fuel efficiency while also being robust to perturbations and satisfying safety requirements.
@article{tiago2023fuel, title = {Fuel-Efficient Switching Control for Platooning Systems With Deep Reinforcement Learning}, author = {Goncalves, Tiago R and Cunha, Rafael F and Varma, Vineeth S and Elayoubi, Salah E}, journal = {IEEE Transactions on Intelligent Transportation Systems}, year = {2023}, doi = {10.1109/TITS.2023.3304977}, url = {https://ieeexplore.ieee.org/document/10229987}, publisher = {IEEE} }

2019

IJSS
Robust partial sampled-data state feedback control of Markov jump linear systems

Rafael F Cunha, Gabriela W Gabriel, and José C Geromel

International Journal of Systems Science, 2019

Abs Bib

This paper proposes new conditions for the design of a robust partial sampled-data state feedback control law for Markov jump linear systems (MJLS). Although, as usual, the control structure depends on the Markov mode, only the state variable is sampled in order to cope with a specific network control structure. For analysis, an equivalent hybrid system is proposed and a two-point boundary value problem (TPBVP) that ensures minimum Hoo or H2 cost is defined. For control synthesis, it is rewritten as a convex set of sufficient conditions leading to minimum guaranteed cost of the mentioned performance classes. The optimality conditions are expressed through differential linear matrix inequalities (DLMIs), a useful mathematical device that can be handled by means of any available LMI solver. Examples are included for illustration.
@article{cunha2019robust, title = {Robust partial sampled-data state feedback control of Markov jump linear systems}, author = {Cunha, Rafael F and Gabriel, Gabriela W and Geromel, Jos{\'e} C}, journal = {International Journal of Systems Science}, volume = {50}, number = {11}, pages = {2142--2152}, year = {2019}, publisher = {Taylor \& Francis} }