Datasets
Standard Dataset
High-performance OpenCL-based GEMM Optimization
- Citation Author(s):
- Submitted by:
- Shengle Lin
- Last updated:
- Tue, 04/16/2024 - 04:44
- DOI:
- 10.21227/0cxd-6706
- Data Format:
- Links:
- License:
- Categories:
- Keywords:
Abstract
OpenCL has become the favored framework for emerging heterogeneous devices and FPGAs, owing to its versatility and portability.
However, OpenCL-based math libraries still face challenges in fully leveraging device performance.
When deploying high-performance arithmetic applications on these devices, the most important hot function is General Matrix-matrix Multiplication (GEMM).
This study presents a meticulously optimized OpenCL GEMM kernel.
Our enhanced GEMM kernel emphasizes two key improvements: 1) a three-level double buffer pipeline that efficiently overlaps data fetching with floating-point computations;
2) a fine-grained prefetching strategy of private memory to increase device occupancy by optimizing register unit utilization.
Furthermore, this work presents a Bayesian Optimization (BO) tuner for kernel auto-tuning.
Experimental results demonstrate considerable optimization improvement and performance advantages achieved on diverse OpenCL devices.
Additionally, the BO tuner demonstrates superior efficiency and robustness, outperforming contemporary tuning methods.
Just show the results and figures in Manuscript
Dataset Files
- Results.zip (2.56 MB)
- CLBLASt-modified-master.zip (1.06 MB)