BLAS 1: Performance Auto-Tuning in Memory-Bound CUDABLAS Kernel


Toshiyuki Imamura

13:40:00 - 14:05:00

101 , Mathematics Research Center Building (ori. New Math. Bldg.)

Development of a GPGPU numerical kernel code is a typical example of performance auto-tuning. Rapid change in a GPU generation needs economically (human power, electronic power consumption, etc.) efficient tuning process from old to new generation. In this study, we developed high performance level 2 kernels, which calculate a product of a matrix and a vector. In case of the SYMV kernel, the required memory throughput, Byte/flop defined by ‘required memory to be transferred in a unit of Byte / ‘floating point operations in flops can be reduced up to the half of that of the GEMV kernel by taking account symmetry into optimization. However, still SYMV kernels implemented in CUDA include numerous parameter spaces to be searched. Sieving the number of parameter set and selecting appropriate faster kernel code is significant. Based on a champion ranking scheme and d-spline interpolation, we eventually obtained 110 and 53 GFLOPS on SSYMV and DSYMV computation, respectively, on a Tesla K20 (a brand new Kepler core-architecture). These results are equivalent to the performance of 80% of the upper bound estimated by bandwidth benchmark. We conclude that better optimization has been carried out by automatic tuning.