Datasets
Standard Dataset
High-performance OpenCL-based GEMM Optimization
- Citation Author(s):
- Submitted by:
- Shengle Lin
- Last updated:
- Tue, 04/16/2024 - 04:44
- DOI:
- 10.21227/0cxd-6706
- Data Format:
- Links:
- License:
- Categories:
- Keywords:
Abstract
OpenCL has become the favored framework for emerging heterogeneous devices and FPGAs, owing to its versatility and portability.
However, OpenCL-based math libraries still face challenges in fully leveraging device performance.
When deploying high-performance arithmetic applications on these devices, the most important hot function is General Matrix-matrix Multiplication (GEMM).
This study presents a meticulously optimized OpenCL GEMM kernel.
Our enhanced GEMM kernel emphasizes two key improvements: 1) a three-level double buffer pipeline that efficiently overlaps data fetching with floating-point computations;
2) a fine-grained prefetching strategy of private memory to increase device occupancy by optimizing register unit utilization.
Furthermore, this work presents a Bayesian Optimization (BO) tuner for kernel auto-tuning.
Experimental results demonstrate considerable optimization improvement and performance advantages achieved on diverse OpenCL devices.
Additionally, the BO tuner demonstrates superior efficiency and robustness, outperforming contemporary tuning methods.
Dataset Files
- Results.zip (2.56 MB)
- CLBLASt-modified-master.zip (1.06 MB)