Matrix Multiplication Using CUDA with or without tiling Tiling helps increase the performance by using a device shared memory Here are some links to understand tiling https://www.youtube.com/watch?v=tGu5DyIlofY http://www.umiacs.umd.edu/~ramani/cmsc828e_gpusci/Lecture5.pdf