Skip to content

Commit

Permalink
2023-11-12 10:33:12 PM
Browse files Browse the repository at this point in the history
  • Loading branch information
alxhill committed Nov 13, 2023
1 parent ecfea12 commit 9f53f56
Showing 1 changed file with 21 additions and 0 deletions.
21 changes: 21 additions & 0 deletions topics/metal-splats.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,28 @@
- [Metal Sample Code](https://developer.apple.com/metal/sample-code/)
- [Performing Calculations on a GPU - sample code](https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu)
- [Simple GPU Sorting Tutorial in O(n^2)](https://www.alanzucconi.com/2017/12/13/gpu-sorting-1/)
- [NVIDIA Improved GPU Sorting chapter from a book](https://developer.nvidia.com/gpugems/gpugems2/part-vi-simulation-and-numerical-algorithms/chapter-46-improved-gpu-sorting)

## 2023-11-12

### Algorithms / Reference

The [NVIDIA book chapter](https://developer.nvidia.com/gpugems/gpugems2/part-vi-simulation-and-numerical-algorithms/chapter-46-improved-gpu-sorting) mentions two algorithms for sorting. The first is the O(n^2) technique covered in the [other blog post](https://www.alanzucconi.com/2017/12/13/gpu-sorting-1), which can be implemented trivially with a single shader (or with two that alternate, but I eventually realised you can just shift the offsets and grid size and re-use the same kernel the whole time.)

```metal
kernel void slow_sort(device unsigned int* data, uint index [[thread_position_in_grid]])
{
uint idx = index*2;
uint left = data[idx];
uint right = data[idx+1];

data[idx] = min(left, right);
data[idx+1] = max(left, right);
}
```
### Development Log
* Found a simple O(n^2) parallel sort that iteratively swaps values in each pass. Well suited to GPUs and a good sorting litmus test.
* Had some fun getting the implementation working - usual C++ sharp edges, e.g had an m_data_buffer private member that hid the value in the parent class causing segfaults, and some other general structure / who-does-what kind of problems. Overall, it works! However, it's currently hilariously slow compared to my CPU radix sort implementation:
Expand Down Expand Up @@ -44,6 +63,8 @@ sort_radix() execution time: 96485 µs
slow_sort_gpu() execution time: 26948016 µs
```
Looks like the vast majority of the time is the compute, 9 microseconds vs 112 microseconds difference in encoding is a little over 10x for a data scale increase of 16.
## 2023-11-11
* Got the code into a more usable structure moving forwards - can now easily test many separate GPU functions without starting from scratch each time. Also having fun hitting the many rough edges of C++ - e.g (you can't call virtual methods from a constructor)[https://stackoverflow.com/questions/14549489/how-to-fix-pure-virtual-function-called-runtime-error], but it just fails at runtime with no compiler warnings if you do despite this being a Known Limitation.
Expand Down

0 comments on commit 9f53f56

Please sign in to comment.