- implment with C++ and OpenMP
- Use Floyed-Warshall and multiple thread to speedup the performance
- implement with CUDA and single GPU
- use Blocked-Floyed-Warshall algorithm
- Optimize coalesced memory access for more cache access
- Load data from global memory to local memory to reduce access latency
- Prevent bank conflict
- Same optimization technique as hw3-2
- Multiple GPU to share the calculation
- Only copy few data one time to reduct memory copy time