| dc.description.abstract | Safety is a critical concern in reinforcement learning (RL) and learning-based systems more broadly, as ensuring reliable and safe decision-making is essential for their deployment in real-world applications. Traditional approaches to address safety often rely on techniques such as reward shaping, carefully curated training data, or explicit handcrafted rules to avoid unsafe actions. More recent advancements have adopted the Constrained Markov Decision Process (CMDP) framework, which trains agents while explicitly enforcing constraints on auxiliary measures such as safety or risk. However, these methods often suffer from significant constraint violations. This thesis identifies the root cause of such violations as stemming from the pursuit of maximal task performance in each policy update. Given the inherent limitations of sample-based constraint assessments in RL, where data is limited and approximation errors are inevitable, these methods often fail near constraint boundaries, leading to excessive violations. To address this, we propose a novel constrained reinforcement learning algorithm that dynamically adjusts its conservativeness during policy updates. By incorporating the risk of constraint violation into the update process, our method can shift focus toward constraint satisfaction when violations are likely, while still striving to improve task performance whenever feasible. Our algorithm reduces constraint violations by up to 99% compared to state-of-the-art baselines while achieving comparable task performance. In the second part of this thesis, we extend CMDPs to address multi-goal, long-horizon problems. We augment the CMDP formulation to incorporate goals, enabling it to handle multiple goals while preserving goal-independent constraint specifications in CMDP. To tackle the complexity of long-horizon tasks with high-dimensional inputs (e.g., visual observations), we propose a method that integrates planning with safe reinforcement learning. By leveraging deep reinforcement learning, we acquire the essential components for planning, including a low-dimensional state-space representation and planning heuristics. The planning algorithm then decomposes long-horizon problems into a sequence of shorter, easier subgoal-reaching tasks. The learned agents safely navigate toward these subgoals step by step, ultimately reaching the final goal. We evaluate our method on both single-agent and multi-agent tasks. In 2D navigation, our approach demonstrated up to 74.2% risk reduction, while in visual navigation, it achieved up to 49.3% risk reduction, all while reaching comparable or better success rates. | |