Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks
Yaohung M. Tsai
2016-12-19 10:30 - 12:30
Room 430 , Astronomy and Mathematics Building
We present portable and highly-optimized Deep Neural Network (DNN) algorithm and its implementation techniques. Our approach combines in novel ways existing HPC techniques such as autotuning, data layout, and low- level optimizations that, when applied simultaneously, achieve performance that matches and exceeds what is possible with either reverse engineering and manual assembly coding or proprietary vendor libraries. The former was done inside the maxDNN implementation and the latter is represented by cuDNN. Our work may be directly applied to the most time consuming part DNN workflow, namely the training process which often needs restart when it stagnates due to, among other reasons, diminishing gradients and getting stuck in local minima. We used for our performance tests a consumer-grade GPU with the latest High Bandwidth Memory (HBM) stack which can match a number of server grade hardware at fraction of the price which attests to the portability of our approach and implementation.