Light-weight ensemble Q-network joint implicit constraints for offline reinforcement learning

Citation Author(s):
Yesen
Chen
Submitted by:
Yesen Chen
Last updated:
Wed, 09/25/2024 - 07:14
DOI:
10.21227/w2jq-zy36
License:
29 Views
Categories:
Keywords:
0
0 ratings - Please login to submit your rating.

Abstract 

Offline reinforcement learning aims to learn policies from a limited dataset without interacting with the environment. However, the restricted nature of the dataset limits the agent's understanding of the environment, leading to out-of-distribution (OOD) behavior and extrapolation errors. Conventional research can be categorized into four main approaches: Q-value penalties, policy constraints, uncertainty estimation, and importance sampling. Most existing methods impose overly strict penalties. Therefore, this paper proposes an algorithm that encourages agents to explore unknown state-action pairs, relying on precise evaluations of OOD actions.First, to address the challenge of assessing Q-values for OOD actions, we discuss the equivalence of uncertainty quantification based on an ensemble of Q-function networks to avoid the additional computational overhead associated with simulating OOD sampling. Furthermore, due to fitting errors inherent in neural networks and the inability to effectively leverage relevant reward information, methods such as behavior cloning struggle to learn better policies.We propose an approach that utilizes a high-confidence Q function derived from uncertainty quantification to encourage agents to exploit bad datasets, while implicitly constraining policies and enhancing policy improvements. Specifically, we map the behavioral process into Q-space, thereby constraining the learning policy while guiding the policy selection towards high-confidence and high-Q-value OOD actions based on the gradient of the prior Q-function. This enables policy constraints to effectively utilize reward information and enhances algorithm performance by addressing fitting errors. Ultimately, we develop two algorithm variants, SF (SCORE-FAST) and SB (SCORE-BETTER). Theoretical analysis and experimental results demonstrate that SF achieves high performance with rapid convergence, while SB attains state-of-the-art performance.Our code will be published at https://github.com/dksen/SF-SB/tree/main. If interested, please contact the corresponding author.

Instructions: 

The experiment directory contains the training process data obtained from the experiment on the D4RLbenchmark.

Comments

trainning data

Submitted by Yesen Chen on Wed, 09/25/2024 - 07:15