High-performance OpenCL-based GEMM Optimization

Citation Author(s):
Shengle
Lin
Hunan University
Submitted by:
Shengle Lin
Last updated:
Tue, 04/16/2024 - 04:44
DOI:
10.21227/0cxd-6706
Data Format:
Links:
License:
0
0 ratings - Please login to submit your rating.

Abstract 

    OpenCL has become the favored framework for emerging heterogeneous devices and FPGAs, owing to its versatility and portability.

    However, OpenCL-based math libraries still face challenges in fully leveraging device performance.

    When deploying high-performance arithmetic applications on these devices, the most important hot function is General Matrix-matrix Multiplication (GEMM).

    This study presents a meticulously optimized OpenCL GEMM kernel.

    Our enhanced GEMM kernel emphasizes two key improvements: 1) a three-level double buffer pipeline that efficiently overlaps data fetching with floating-point computations; 

    2) a fine-grained prefetching strategy of private memory to increase device occupancy by optimizing register unit utilization.

    Furthermore, this work presents a Bayesian Optimization (BO) tuner for kernel auto-tuning.

    Experimental results demonstrate considerable optimization improvement and performance advantages achieved on diverse OpenCL devices.

    Additionally, the BO tuner demonstrates superior efficiency and robustness, outperforming contemporary tuning methods.

Instructions: 

Just show the results and figures in Manuscript