2023-11-12 10:33:12 PM

alxhill · Nov 13, 2023 · 9f53f56 · 9f53f56
1 parent ecfea12
commit 9f53f56
Showing 1 changed file with 21 additions and 0 deletions.
diff --git a/topics/metal-splats.md b/topics/metal-splats.md
@@ -6,9 +6,28 @@
 - [Metal Sample Code](https://developer.apple.com/metal/sample-code/)
 - [Performing Calculations on a GPU - sample code](https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu)
 - [Simple GPU Sorting Tutorial in O(n^2)](https://www.alanzucconi.com/2017/12/13/gpu-sorting-1/)
+- [NVIDIA Improved GPU Sorting chapter from a book](https://developer.nvidia.com/gpugems/gpugems2/part-vi-simulation-and-numerical-algorithms/chapter-46-improved-gpu-sorting)
 
 ## 2023-11-12
 
+### Algorithms / Reference
+
+The [NVIDIA book chapter](https://developer.nvidia.com/gpugems/gpugems2/part-vi-simulation-and-numerical-algorithms/chapter-46-improved-gpu-sorting) mentions two algorithms for sorting. The first is the O(n^2) technique covered in the [other blog post](https://www.alanzucconi.com/2017/12/13/gpu-sorting-1), which can be implemented trivially with a single shader (or with two that alternate, but I eventually realised you can just shift the offsets and grid size and re-use the same kernel the whole time.)
+
+```metal
+kernel void slow_sort(device unsigned int* data, uint index [[thread_position_in_grid]])
+{
+    uint idx = index*2;
+    uint left = data[idx];
+    uint right = data[idx+1];
+
+    data[idx] = min(left, right);
+    data[idx+1] = max(left, right);
+}
+```
+
+### Development Log
+
 * Found a simple O(n^2) parallel sort that iteratively swaps values in each pass. Well suited to GPUs and a good sorting litmus test.
 * Had some fun getting the implementation working - usual C++ sharp edges, e.g had an m_data_buffer private member that hid the value in the parent class causing segfaults, and some other general structure / who-does-what kind of problems. Overall, it works! However, it's currently hilariously slow compared to my CPU radix sort implementation:
 
@@ -44,6 +63,8 @@ sort_radix() execution time: 96485 µs
 slow_sort_gpu() execution time: 26948016 µs
 ```
 
+Looks like the vast majority of the time is the compute, 9 microseconds vs 112 microseconds difference in encoding is a little over 10x for a data scale increase of 16.
+
 ## 2023-11-11
 
 * Got the code into a more usable structure moving forwards - can now easily test many separate GPU functions without starting from scratch each time. Also having fun hitting the many rough edges of C++ - e.g (you can't call virtual methods from a constructor)[https://stackoverflow.com/questions/14549489/how-to-fix-pure-virtual-function-called-runtime-error], but it just fails at runtime with no compiler warnings if you do despite this being a Known Limitation.