Citation Author(s):
Beihang University
Submitted by:
Last updated:
Mon, 06/17/2024 - 12:10
Research Article Link:
0 ratings - Please login to submit your rating.


Due to the difficulty in obtaining real samples and ground truth, the generalization performance and the fine-tuned performance are critical for the feasibility of stereo matching methods in real-world applications. However, the diverse datasets exhibit substantial discrepancies in disparity distribution and density, thus presenting a formidable challenge to the generalization and fine-tuning of the model. In this paper, we propose a novel stereo matching method, called SR-Stereo, which mitigates the distributional differences across different datasets by predicting the disparity clips and uses a loss weight related to the regression target scale to improve the accuracy of the disparity clips. Moreover, this stepwise regression architecture can be easily extended to existing iteration-based methods to improve the performance without changing the structure. In addition, to mitigate the edge blurring of the fine-tuned model on sparse ground truth, we propose Domain Adaptation Based on Pre-trained Edges (DAPE). Specifically, we use the predicted disparity and RGB image to estimate the edge map of the target domain image. The edge map is filtered to generate edge map background pseudo-labels, which together with the sparse ground truth disparity on the target domain are used as a supervision to jointly fine-tune the pre-trained stereo matching model. These proposed methods are extensively evaluated on SceneFlow, KITTI, Middbury 2014 and ETH3D. The SR-Stereo achieves competitive disparity estimation performance and state-of-the-art cross-domain generalisation performance. Meanwhile, the proposed DAPE significantly improves the disparity estimation performance of fine-tuned models, especially in the textureless and detail regions.


All ablation versions of SR-Stereo are trained on SceneFlow with a batch size of 4 for 50k steps, while the final version of SR-Stereo is trained on SceneFlow with a batch size of 8 for 200k steps. The final model and ablation experiments are conducted using a one-cycle learning rate schedule with learning rates of 0.0002 and 0.0001, respectively. We evaluate the generalization performance of the proposed method by directly testing on the 27 training pairs from ETH3D and the 15 training pairs from Middlebury 2014. For the experiments related to the edge estimator, we jointly train the stereo model and the proposed edge estimator on SceneFlow with a batch size of 4 for 50k steps, using a one-cycle learning rate schedule with a learning rate of 0.0001. We use the pre-trained stereo model and edge estimator to generate edge pseudo-labels for target domains. For different datasets, we adopted different settings for the fine-tuning process. For the KITTI, we adopt a batch size of 4 and fine-tune the model for 50,000 steps with an initial learning rate of 0.0001. As for the ETH3D, we use a batch size of 2 and fine-tune the model for 2,000 steps, also with an initial learning rate of 0.0001. In the case of the Middlebury 2014, we utilize a batch size of 2 and fine-tune the model for 4,000 steps, starting with an initial learning rate of 0.00002.