From 9f53f56ee61bf06cbf25780109e65a187384d360 Mon Sep 17 00:00:00 2001 From: Alexander Hill Date: Sun, 12 Nov 2023 22:33:12 -0500 Subject: [PATCH] 2023-11-12 10:33:12 PM --- topics/metal-splats.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/topics/metal-splats.md b/topics/metal-splats.md index da68a2c..adb73b0 100644 --- a/topics/metal-splats.md +++ b/topics/metal-splats.md @@ -6,9 +6,28 @@ - [Metal Sample Code](https://developer.apple.com/metal/sample-code/) - [Performing Calculations on a GPU - sample code](https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu) - [Simple GPU Sorting Tutorial in O(n^2)](https://www.alanzucconi.com/2017/12/13/gpu-sorting-1/) +- [NVIDIA Improved GPU Sorting chapter from a book](https://developer.nvidia.com/gpugems/gpugems2/part-vi-simulation-and-numerical-algorithms/chapter-46-improved-gpu-sorting) ## 2023-11-12 +### Algorithms / Reference + +The [NVIDIA book chapter](https://developer.nvidia.com/gpugems/gpugems2/part-vi-simulation-and-numerical-algorithms/chapter-46-improved-gpu-sorting) mentions two algorithms for sorting. The first is the O(n^2) technique covered in the [other blog post](https://www.alanzucconi.com/2017/12/13/gpu-sorting-1), which can be implemented trivially with a single shader (or with two that alternate, but I eventually realised you can just shift the offsets and grid size and re-use the same kernel the whole time.) + +```metal +kernel void slow_sort(device unsigned int* data, uint index [[thread_position_in_grid]]) +{ + uint idx = index*2; + uint left = data[idx]; + uint right = data[idx+1]; + + data[idx] = min(left, right); + data[idx+1] = max(left, right); +} +``` + +### Development Log + * Found a simple O(n^2) parallel sort that iteratively swaps values in each pass. Well suited to GPUs and a good sorting litmus test. * Had some fun getting the implementation working - usual C++ sharp edges, e.g had an m_data_buffer private member that hid the value in the parent class causing segfaults, and some other general structure / who-does-what kind of problems. Overall, it works! However, it's currently hilariously slow compared to my CPU radix sort implementation: @@ -44,6 +63,8 @@ sort_radix() execution time: 96485 µs slow_sort_gpu() execution time: 26948016 µs ``` +Looks like the vast majority of the time is the compute, 9 microseconds vs 112 microseconds difference in encoding is a little over 10x for a data scale increase of 16. + ## 2023-11-11 * Got the code into a more usable structure moving forwards - can now easily test many separate GPU functions without starting from scratch each time. Also having fun hitting the many rough edges of C++ - e.g (you can't call virtual methods from a constructor)[https://stackoverflow.com/questions/14549489/how-to-fix-pure-virtual-function-called-runtime-error], but it just fails at runtime with no compiler warnings if you do despite this being a Known Limitation.