EMDSAC-ft: Bridging the Gap in Offline-to-Online Reinforcement Learning through Value Distribution Learning

Citation Author(s):
yesen
chen
Submitted by:
Yesen Chen
Last updated:
Mon, 10/28/2024 - 03:07
DOI:
10.21227/yyer-m215
License:
74 Views
Categories:
Keywords:
0
0 ratings - Please login to submit your rating.

Abstract 

Offline-to-online is a key strategy for advancing reinforcement learning towards practical applications. This approach not only reduces the risks and costs associated with online exploration, but also accelerates the agent’s adaptation to real-world environments. It consists of two phases: offline-training and fine-tuning. However, offline-training and fine-tuning have different problems. In offline-training, the main difficulty is how to learn an excellent policy in a limited and incompletely distributed dataset. Conventional value-based reinforcement learning only learns the expectation of the reward function under the standard Bellman operator. However, in a limited offline dataset, only learning the expectation of the reward function will cause the Q function to show uncertainty due to the inherent randomness of the environment. Therefore, we propose an offline reinforcement learning algorithm EMDSAC for ensemble learning value distribution, which first uses the ensemble model framework to punish the uncertainty caused by out-of-distribution (OOD) actions, and secondly uses value distribution reinforcement learning to alleviate the uncertainty induced by the randomness of the environment. In theoretical analysis, we eliminate suboptimality in offline reinforcement learning on distributional perspective. In fine-tuning, we first analyze the reasons behind the catastrophic performance drop. Then, we propose to eliminate the uneven distribution of pessimism (UDP[1]) in the learned value distribution during the policy evaluation phase, and to employ the True Trust Region Policy Improvement (TTRPI) method during the policy improvement phase. The former reduces the bias between the learned value distribution and the true value distribution, while the latter dynamically controls the extent of policy updates based on the accuracy of the learned value distribution. This ultimately leads to the development of the EMDSAC-ft algorithm. Our experiments show that EMDSAC achieves model-free reinforcement learning state-of-the-art (SoTA) performance on the D4RL benchmark. Compared with previous online fine-tuning algorithms, EMDSAC-ft is faster and improves the average performance on suboptimal datasets by more than 45%. Our code will be published at dksen/EMDSAC-ft. If interested, please contact the corresponding author.



[1]UDP (Uneven Distribution of Pessimism) involves applying a value penalty during pessimistic policy evaluation, represented by the equation  ADDIN NE.Ref.{DBC728F0-8BC9-49DF-A9E5-6D97EDE92C8C}[22]. pessimism is represented by .UDP refers to the variation of  across different state-action pairs. The value-penalty framework introduces a regularization term, , for out-of-distribution (OOD) actions. This term is influenced by the uneven distribution of the dataset, leading to the formation of UDP.Γ(s,a) Γ(s,a) Γ(s,a)

Instructions: 

实验数据

Funding Agency: 
Natural Science Foundation of Zhejiang Province
Grant Number: 
No. LTGC23F030001