Offline-to-online is a key strategy for advancing reinforcement learning towards practical applications. This approach not only reduces the risks and costs associated with online exploration, but also accelerates the agent’s adaptation to real-world environments. It consists of two phases: offline-training and fine-tuning. However, offline-training and fine-tuning have different problems. In offline-training, the main difficulty is how to learn an excellent policy in a limited and incompletely distributed dataset.
Offline reinforcement learning aims to learn policies from a limited dataset without interacting with the environment. However, the restricted nature of the dataset limits the agent's understanding of the environment, leading to out-of-distribution (OOD) behavior and extrapolation errors. Conventional research can be categorized into four main approaches: Q-value penalties, policy constraints, uncertainty estimation, and importance sampling. Most existing methods impose overly strict penalties.