From 9f53f56ee61bf06cbf25780109e65a187384d360 Mon Sep 17 00:00:00 2001
From: Alexander Hill <me@alxhill.com>
Date: Sun, 12 Nov 2023 22:33:12 -0500
Subject: [PATCH] 2023-11-12 10:33:12 PM

---
 topics/metal-splats.md | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/topics/metal-splats.md b/topics/metal-splats.md
index da68a2c..adb73b0 100644
--- a/topics/metal-splats.md
+++ b/topics/metal-splats.md
@@ -6,9 +6,28 @@
 - [Metal Sample Code](https://developer.apple.com/metal/sample-code/)
 - [Performing Calculations on a GPU - sample code](https://developer.apple.com/documentation/metal/performing_calculations_on_a_gpu)
 - [Simple GPU Sorting Tutorial in O(n^2)](https://www.alanzucconi.com/2017/12/13/gpu-sorting-1/)
+- [NVIDIA Improved GPU Sorting chapter from a book](https://developer.nvidia.com/gpugems/gpugems2/part-vi-simulation-and-numerical-algorithms/chapter-46-improved-gpu-sorting)
 
 ## 2023-11-12
 
+### Algorithms / Reference
+
+The [NVIDIA book chapter](https://developer.nvidia.com/gpugems/gpugems2/part-vi-simulation-and-numerical-algorithms/chapter-46-improved-gpu-sorting) mentions two algorithms for sorting. The first is the O(n^2) technique covered in the [other blog post](https://www.alanzucconi.com/2017/12/13/gpu-sorting-1), which can be implemented trivially with a single shader (or with two that alternate, but I eventually realised you can just shift the offsets and grid size and re-use the same kernel the whole time.)
+
+```metal
+kernel void slow_sort(device unsigned int* data, uint index [[thread_position_in_grid]])
+{
+    uint idx = index*2;
+    uint left = data[idx];
+    uint right = data[idx+1];
+
+    data[idx] = min(left, right);
+    data[idx+1] = max(left, right);
+}
+```
+
+### Development Log
+
 * Found a simple O(n^2) parallel sort that iteratively swaps values in each pass. Well suited to GPUs and a good sorting litmus test.
 * Had some fun getting the implementation working - usual C++ sharp edges, e.g had an m_data_buffer private member that hid the value in the parent class causing segfaults, and some other general structure / who-does-what kind of problems. Overall, it works! However, it's currently hilariously slow compared to my CPU radix sort implementation:
 
@@ -44,6 +63,8 @@ sort_radix() execution time: 96485 µs
 slow_sort_gpu() execution time: 26948016 µs
 ```
 
+Looks like the vast majority of the time is the compute, 9 microseconds vs 112 microseconds difference in encoding is a little over 10x for a data scale increase of 16.
+
 ## 2023-11-11
 
 * Got the code into a more usable structure moving forwards - can now easily test many separate GPU functions without starting from scratch each time. Also having fun hitting the many rough edges of C++ - e.g (you can't call virtual methods from a constructor)[https://stackoverflow.com/questions/14549489/how-to-fix-pure-virtual-function-called-runtime-error], but it just fails at runtime with no compiler warnings if you do despite this being a Known Limitation.