Towards Achieving Sub-linear Regret and Hard Constraint Violation in Model-free RL
Document Type
Conference Proceeding
Publication Date
1-1-2024
Abstract
We study the constrained Markov decision processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. Existing approaches have primarily focused on soft constraint violation, which allows compensation across episodes, making it easier to satisfy the constraints. In contrast, we consider a stronger hard constraint violation metric, where only positive constraint violations are accumulated. Our main result is the development of the first model-free, simulator-free algorithm that achieves a sub-linear regret and a sub-linear hard constraint violation simultaneously, even in large-scale systems. In particular, we show that Õ(pd3H4K) regret and Õ(pd3H4K) hard constraint violation bounds can be achieved, where K is the number of episodes, d is the dimension of the feature mapping, H is the length of the episode. Our results are achieved via novel adaptations of the primal-dual LSVI-UCB algorithm, i.e., it searches for the dual variable that balances between regret and constraint violation within every episode, rather than updating it at the end of each episode. This turns out to be crucial for our theoretical guarantees when dealing with hard constraint violations.
Identifier
85194197257 (Scopus)
Publication Title
Proceedings of Machine Learning Research
e-ISSN
26403498
First Page
1054
Last Page
1062
Volume
238
Grant
CNS-2112471
Fund Ref
Ohio State University
Recommended Citation
Ghosh, Arnob; Zhou, Xingyu; and Shroff, Ness, "Towards Achieving Sub-linear Regret and Hard Constraint Violation in Model-free RL" (2024). Faculty Publications. 988.
https://digitalcommons.njit.edu/fac_pubs/988