How to Avoid Global Synchronization by Domino Scheme for Sparse Triangular Solve, Incomplete Cholesky Factorization and Incomplete LU Factorization?
10:10:00 - 11:00:00
GPU has shown its power on several applications, including FFT, BLAS3 and molecule dynamics which are computational-intensive and have regular structure to take advantage of wide I/O. However GPU so far does not perform well on sequential-nature problems, for example, sparse linear algebra. There are three operations we want to focus in this talk, sparse triangular solve, incomplete Cholesky factorization and incomplete LU factorization, which are heavily used as preconditioners of a linear system. There are several issues we want to address, including 1) How to reproduce the result without atomics, 2) How to keep one kernel to track dependence graph, 3) How to keep small working space because GPU has limited device memory.