Publications
2025
- AAAIOptimally solving simultaneous-move dec-POMDPs: The sequential central planning approachJohan Peralez, Aurélien Delage, Jacopo Castellini, and 2 more authorsIn Proceedings of the AAAI Conference on Artificial Intelligence, 2025
The centralized training for decentralized execution paradigm emerged as the state-of-the-art approach to ϵ-optimally solving decentralized partially observable Markov decision processes. However, scalability remains a significant issue. This paper presents a novel and more scalable alternative, namely the sequential-move centralized training for decentralized execution. This paradigm further pushes the applicability of the Bellman’s principle of optimality, raising three new properties. First, it allows a central planner to reason upon sufficient sequential-move statistics instead of prior simultaneous-move ones. Next, it proves that ϵ-optimal value functions are piecewise linear and convex in such sufficient sequential-move statistics. Finally, it drops the complexity of the backup operators from double exponential to polynomial at the expense of longer planning horizons. Besides, it makes it easy to use single-agent methods, eg, SARSA algorithm enhanced with these findings, while still preserving convergence guarantees. Experiments on two-as well as many-agent domains from the literature against ϵ-optimal simultaneous-move solvers confirm the superiority of our novel approach. This paradigm opens the door for efficient planning and reinforcement learning methods for multi-agent systems.
- TMLRSparsity-Driven Plasticity in Multi-Task Reinforcement LearningAleksandar Todorov, Juan Cardenas-Cartagena, Rafael F. Cunha, and 2 more authorsTransactions on Machine Learning Research, 2025
Plasticity loss, a diminishing capacity to adapt as training progresses, is a critical challenge in deep reinforcement learning. We examine this issue in multi-task reinforcement learning (MTRL), where higher representational flexibility is crucial for managing diverse and potentially conflicting task demands. We systematically explore how sparsification methods, particularly Gradual Magnitude Pruning (GMP) and Sparse Evolutionary Training (SET), enhance plasticity and consequently improve performance in MTRL agents. We evaluate these approaches across distinct MTRL architectures (shared backbone, Mixture of Experts, Mixture of Orthogonal Experts) on standardized MTRL benchmarks, comparing against dense baselines, and a comprehensive range of alternative plasticity-inducing or regularization methods. Our results demonstrate that both GMP and SET effectively mitigate key indicators of plasticity degradation, such as neuron dormancy and representational collapse. These plasticity improvements often correlate with enhanced multi-task performance, with sparse agents frequently outperforming dense counterparts and achieving competitive results against explicit plasticity interventions. Our findings offer insights into the interplay between plasticity, network sparsity, and MTRL designs, highlighting dynamic sparsification as a robust but context-sensitive tool for developing more adaptable MTRL systems.
- NLDLWorld model agents with change-based intrinsic motivationJeremias Ferrao, and Rafael CunhaIn 2025 Northern Lights Deep Learning Conference (NLDL), 2025
Sparse reward environments pose a significant challenge for reinforcement learning due to the scarcity of feedback. Intrinsic motivation and transfer learning have emerged as promising strategies to address this issue. Change Based Exploration Transfer (CBET), a technique that combines these two approaches for model-free algorithms, has shown potential in addressing sparse feedback but its effectiveness with modern algorithms remains understudied. This paper provides an adaptation of CBET for world model algorithms like DreamerV3 and compares the performance of DreamerV3 and IMPALA agents, both with and without CBET, in the sparse reward environments of Crafter and Minigrid. Our tabula rasa results highlight the possibility of CBET improving DreamerV3’s returns in Crafter but the algorithm attains a suboptimal policy in Minigrid with CBET further reducing returns. In the same vein, our transfer learning experiments show that pre-training DreamerV3 with intrinsic rewards does not immediately lead to a policy that maximizes extrinsic rewards in Minigrid. Overall, our results suggest that CBET provides a positive impact on DreamerV3 in more complex environments like Crafter but may be detrimental in environments like Minigrid. In the latter case, the behaviours promoted by CBET in DreamerV3 may not align with the task objectives of the environment, leading to reduced returns and suboptimal policies.
- SLLM/ECAI WSBoosting Accuracy and Efficiency of Budget Forcing in LLMs via Reinforcement Learning for Mathematical ReasoningRavindra Aribowo Tarunokusumo, and Rafael Fernandes CunhaIn SLLM- Buck The Trend: Make LLMs Specific and Reduce the cost of Intelligence at ECAI, 2025
Test-time scaling methods have seen a rapid increase in popularity for its computational efficiency and parameter-independent training to improve reasoning performance on Large Language Models. One such method is called budget forcing, a decoding intervention strategy which allocates extra compute budget for thinking and elicits the inherent self-correcting behavior of the model. However, this relies on supervised fine-tuning (SFT) on long-context reasoning traces which causes performance degradation on smaller models due to verbose responses. For this reason, we offer a framework integrating reinforcement learning (RL) to improve token efficiency and boost the performance of a 1.5B model for mathematical reasoning. We demonstrate this using only 1.5K training samples and found that our SFT+RL model performed better on the GSM8K dataset with varying compute budgets. Our main findings showed an overall higher accuracy while significantly reducing its token usage by over 40% compared to the SFT model, revealing how RL can recover the losses due to long-context training and altogether improving performance in mathematical reasoning.
- SLLM/ECAI WSNo Supervision, No Problem: Pure Reinforcement Learning Improves Mathematical Reasoning in Small Language ModelsIn SLLM- Buck The Trend: Make LLMs Specific and Reduce the cost of Intelligence at ECAI, 2025
This study explores whether pure reinforcement learning (RL), without supervised fine-tuning (SFT), can improve the mathematical reasoning ability of small language models. Using Group Relative Policy Optimization (GRPO), four pre-trained Qwen model variants were post-trained using only RL on a subset of the GSM8K dataset. Models specialized in mathematical reasoning, such as Qwen2.5-Math-1.5B and Qwen2-Math-1.5B, achieved significant improvements in pass@1 accuracy compared to their baselines. General-purpose models showed modest improvements, while a smaller 0.5B model suffered a performance drop, revealing capacity limitations when optimizing multiple objectives. Notably, a direct comparison showed that pure RL outperformed the conventional SFT-to-RL approach in both accuracy and training efficiency under a fixed maximum output token limit. The experimental results demonstrate that pure RL can effectively improve reasoning ability when sufficient domain specialization and model capacity are present, potentially eliminating the need for costly SFT in resource-limited settings.
- SPAIML/ECAI WSSeed Scheduling in Fuzz Testing as a Markov Decision ProcessRafael F Cunha, Luca Müller, Thomas Rooijakkers, and 2 more authorsIn 1st International Workshop on Security and Privacy-Preserving AI/ML (SPAIML) at ECAI, 2025
Coverage-guided Greybox Fuzzing (CGF) is an effective method for discovering software vulnerabilities. Traditional fuzzers, such as American Fuzzy Lop (AFL), rely on heuristics for critical tasks like seed scheduling, which often lack adaptability and may not optimally balance exploration with exploitation. This paper presents a novel approach to enhance seed scheduling in CGF by formalizing it as a Markov Decision Process (MDP). We detail the design of this MDP, including the state representation derived from fuzzer and coverage data, the action space encompassing seed selection and power assignment, and a reward function geared towards maximizing coverage and bug discovery. A Proximal Policy Optimization (PPO) agent is then trained to learn a scheduling policy from this MDP within the AFL++ fuzzer. Our investigation into this Deep Reinforcement Learning (DRL) based approach reveals that while the MDP formulation provides a structured framework, practical application faces significant challenges, including high computational demands for training and intensive hyperparameter tuning. The key contributions of this work are: (1) a concrete MDP formulation for the complex task of fuzzer seed scheduling, (2) an analysis of the inherent difficulties and trade-offs in applying DRL to this specific domain, and (3) insights gained from the agent’s learning process (or lack thereof), which inform the discussion on the suitability of DRL for this type of optimization problem in fuzzing. This research provides a foundational exploration of DRL for seed scheduling and highlights critical considerations for future advancements in intelligent fuzzing agents.
- SPAIML/ECAI WSCollaborative Reinforcement Learning for Cyber Defense: Analysis of Strategies, and PoliciesDavide Rigoni, Rafael F Cunha, Frank Fransen, and 3 more authorsIn 1st International Workshop on Security and Privacy-Preserving AI/ML (SPAIML) at ECAI, 2025
As cybersecurity threats grow in scale and sophistication, traditional defenses increasingly struggle to detect and counter them. Recent work applies reinforcement learning (RL) to develop adaptive defensive agents, but challenges remain, particularly in how agents learn, the environments used, and the strategies they adopt. These issues are amplified in multi-agent settings, where coordination becomes especially complex. This paper presents an empirical analysis of collaborative RL for cybersecurity defense, focusing on environment models, RL methods, and agent policies. Specifically, it compares several multi-agent RL algorithms in the context of CAGE Challenge 4 to identify effective defense configurations. The study also evaluates the learned policies to assess their real-world applicability and highlight gaps between agent behavior and practical defense strategies.
- FDGBridging Faithfulness of Explanations and Deep Reinforcement Learning: A Grad-CAM Analysis of Space InvadersLeendert Johannes Tanis, Rafael Fernandes Cunha, and Marco ZullichIn Proceedings of the 20th International Conference on the Foundations of Digital Games, 2025
With the increasing pervasiveness of artificial neural networks in real-world applications, regulations have started enforcing greater transparency in the predictive dynamics of these models. In recent years, research in Explainable AI (XAI) has gained traction, with the goal of enhancing the interpretability of black box models. However, while most of the efforts are aimed at supervised learning on tabular or image data, comparatively less interest has been posed to reinforcement learning (RL). One of the main issues connected to explainability is represented by the potential misalignment between the explanations and the models’ true decision-making process, potentially providing, e.g., misleading feature importance outputs. In the present work, we investigate the faithfulness of the explanations generated for an Advantage Actor-Critic model trained to play the Space Invaders Atari 2600 game. We generate feature importance heatmaps by means of Grad-CAM, a popular XAI tool designed for convolutional neural networks with image-based inputs. We then evaluate the faithfulness of these explanations via the Iterative Removal of Features (IROF) metric. While the absence of a baseline for IROF scores complicates result interpretation, our findings indicate that faithfulness estimates exhibit considerable noise. We further observe an overall decrease in faithfulness as the game progresses towards its end. Our study highlights limitations of current XAI techniques in RL contexts and suggests that future work should explore alternative methodologies better suited to RL tasks.
2023
- Fuel-Efficient Switching Control for Platooning Systems With Deep Reinforcement LearningTiago R Goncalves, Rafael F Cunha, Vineeth S Varma, and 1 more authorIEEE Transactions on Intelligent Transportation Systems, 2023
The wide appeal of fuel-efficient transport solutions is constantly increasing due to the major impact of the transportation industry on the environment. Platooning systems represent a relatively simple approach in terms of deployment toward fuel-efficient solutions. This paper addresses the reduction of fuel consumption in platooning systems attainable by dynamically switching between two control policies: Adaptive Cruise Control (ACC) and Cooperative Adaptive Cruise Control (CACC). The switching rule is dictated by a Deep Reinforcement Learning (DRL) technique to overcome unpredictable platoon disturbances and to learn appropriate transient shift times while maximizing fuel efficiency. However, due to safety and convergence issues of DRL, our algorithm establishes transition times and minimum periods of operation of ACC and CACC controllers instead of directly controlling vehicles. Numerical experiments show that the DRL agent outperforms both static ACC and CACC versions and the threshold logic control in terms of fuel efficiency while also being robust to perturbations and satisfying safety requirements.
@article{tiago2023fuel, title = {Fuel-Efficient Switching Control for Platooning Systems With Deep Reinforcement Learning}, author = {Goncalves, Tiago R and Cunha, Rafael F and Varma, Vineeth S and Elayoubi, Salah E}, journal = {IEEE Transactions on Intelligent Transportation Systems}, year = {2023}, doi = {10.1109/TITS.2023.3304977}, url = {https://ieeexplore.ieee.org/document/10229987}, publisher = {IEEE} } - arXivOn convex optimal value functions for POSGsRafael F Cunha, Jacopo Castellini, Johan Peralez, and 1 more authorIn , 2023
Multi-agent planning and reinforcement learning can be challenging when agents cannot see the state of the world or communicate with each other due to communication costs, latency, or noise. Partially Observable Stochastic Games (POSGs) provide a mathematical framework for modelling such scenarios. This paper aims to improve the efficiency of planning and reinforcement learning algorithms for POSGs by identifying the underlying structure of optimal state-value functions. The approach involves reformulating the original game from the perspective of a trusted third party who plans on behalf of the agents simultaneously. From this viewpoint, the original POSGs can be viewed as Markov games where states are occupancy states, \ie posterior probability distributions over the hidden states of the world and the stream of actions and observations that agents have experienced so far. This study mainly proves that the optimal state-value function is a convex function of occupancy states expressed on an appropriate basis in all zero-sum, common-payoff, and Stackelberg POSGs.
@inproceedings{cunha2023convex, title = {On convex optimal value functions for POSGs}, author = {Cunha, Rafael F and Castellini, Jacopo and Peralez, Johan and Dibangoye, Jilles S}, year = {2023} }
2022
- Reducing fuel consumption in platooning systems through reinforcement learningRafael F Cunha, Tiago R Goncalves, Vineeth S Varma, and 2 more authorsIn IFAC Conference on Intelligent Control and Automation Sciences (ICONS), 2022
Fuel efficiency in platooning systems is a central topic of interest because of its significant economic and environmental impact on the transportation industry. In platoon systems, Adaptive Cruise Control (ACC) is widely adopted because it can guarantee string stability while requiring only radar or lidar measurements. A key parameter in ACC is the desired time gap between the platoon’s neighboring vehicles. A small time gap results in a short inter-vehicular distance, which is fuel efficient when the vehicles are moving at constant speeds due to air drag reductions. On the other hand, when the vehicles accelerate and brake a lot, a bigger time gap is more fuel efficient. This motivates us to find a policy that minimizes fuel consumption by conveniently switching between two desired time gap parameters. Thus, one can interpret this formulation as a dynamic system controlled by a switching ACC, and the learning problem reduces to finding a switching rule that is fuel efficient. We apply a Reinforcement Learning (RL) algorithm to find a time switching policy between two desired time gap parameters of an ACC controller to reach our goal. We adopt the proximal policy optimization (PPO) algorithm to learn the appropriate transient shift times that minimize the platoon’s fuel consumption when it faces stochastic traffic conditions. Numerical simulations show that the PPO algorithm outperforms both static time gap ACC and a threshold-based switching control in terms of the average fuel efficiency.
@inproceedings{cunha2022reducing, title = {Reducing fuel consumption in platooning systems through reinforcement learning}, author = {Cunha, Rafael F and Goncalves, Tiago R and Varma, Vineeth S and Elayoubi, Salah E and Cao, Ming}, booktitle = {IFAC Conference on Intelligent Control and Automation Sciences (ICONS)}, volume = {55}, number = {15}, pages = {99--104}, year = {2022}, publisher = {Elsevier} }
2021
- On imitation dynamics in population games with Markov switchingRafael F Cunha, Lorenzo Zino, and Ming CaoIn 2021 European Control Conference (ECC), 2021
Imitation dynamics in population games are a class of evolutionary game-theoretic models, widely used to study decision-making processes in social groups. Different from other models, imitation dynamics allow players to have minimal information on the structure of the game they are playing, and are thus suitable for many applications, including traffic management, marketing, and disease control. In this work, we study a general case of imitation dynamics where the structure of the game and the imitation mechanisms change in time due to external factors, such as weather conditions or social trends. These changes are modeled using a continuous-time Markov jump process. We present tools to identify the dominant strategy that emerges from the dynamics through methodological analysis of the function parameters. Numerical simulations are provided to support our theoretical findings.
@inproceedings{cunha2021imitation, title = {On imitation dynamics in population games with Markov switching}, author = {Cunha, Rafael F and Zino, Lorenzo and Cao, Ming}, booktitle = {2021 European Control Conference (ECC)}, pages = {722--727}, year = {2021}, organization = {IEEE}, }
2019
- Benelux MeetingChanging the Perception of Social Power in Opinion DynamicsRafael Fernandes Cunha, Ben Ye, and Ming CaoIn 38th Benelux Meeting on Systems and Control, 2019
@inproceedings{cunha2019changing, title = {Changing the Perception of Social Power in Opinion Dynamics}, author = {Cunha, Rafael Fernandes and Ye, Ben and Cao, Ming}, booktitle = {38th Benelux Meeting on Systems and Control}, year = {2019} } - Robust partial sampled-data state feedback control of Markov jump linear systemsRafael F Cunha, Gabriela W Gabriel, and José C GeromelInternational Journal of Systems Science, 2019
This paper proposes new conditions for the design of a robust partial sampled-data state feedback control law for Markov jump linear systems (MJLS). Although, as usual, the control structure depends on the Markov mode, only the state variable is sampled in order to cope with a specific network control structure. For analysis, an equivalent hybrid system is proposed and a two-point boundary value problem (TPBVP) that ensures minimum Hoo or H2 cost is defined. For control synthesis, it is rewritten as a convex set of sufficient conditions leading to minimum guaranteed cost of the mentioned performance classes. The optimality conditions are expressed through differential linear matrix inequalities (DLMIs), a useful mathematical device that can be handled by means of any available LMI solver. Examples are included for illustration.
@article{cunha2019robust, title = {Robust partial sampled-data state feedback control of Markov jump linear systems}, author = {Cunha, Rafael F and Gabriel, Gabriela W and Geromel, Jos{\'e} C}, journal = {International Journal of Systems Science}, volume = {50}, number = {11}, pages = {2142--2152}, year = {2019}, publisher = {Taylor \& Francis} }
2018
- Partial sampled-data state feedback control of markov jump linear systemsRafael F Cunha, Gabriela W Gabriel, and José C GeromelIFAC-PapersOnLine, 2018
This paper aims at designing a partial sampled-data state feedback control law for Markov jump linear systems (MJLS). The interesting feature of the control structure is that only the state variable is sampled, while the stochastic parameter that defines the Markov mode of the system used for control purposes is free to change at any time between samples. The main goal is to provide sufficient convex conditions for the existence of a solution for this class of control design problems in the context of Hoo and H2 performances, which are expressed through Differential Linear Matrix Inequalities (DLMI). The proposed method is mplemented using LMIs facilities and provides a minimum guaranteed cost control in only one shot. An example is solved for illustration and comparison.
@article{cunha2018partial, title = {Partial sampled-data state feedback control of markov jump linear systems}, author = {Cunha, Rafael F and Gabriel, Gabriela W and Geromel, Jos{\'e} C}, journal = {IFAC-PapersOnLine}, volume = {51}, number = {25}, pages = {222--227}, year = {2018}, publisher = {Elsevier} }