Precompute AST arity #17234

bdice · 2024-11-01T15:02:24Z

Description

This PR precomputes AST arity on the host, to reduce the complexity in device-side arity lookup.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

bdice · 2024-11-04T22:35:54Z

I'm realizing that by pre-computing the arity, we can move a huge amount of code from being CUDF_HOST_DEVICE in operator.hpp to only needing host versions in operator.cpp. I began moving that code in b69dfc2. I have more work to do here. cc: @lamarrr for awareness

I think this entire section can be made host-only. (Partly done in b69dfc2 but there's more to do.) https://github.com/rapidsai/cudf/blob/branch-24.12/cpp/include/cudf/ast/detail/operators.hpp#L958-L1230

vyasr · 2024-11-04T23:16:54Z

Does that remove a whole level of dispatching as well? If so that could be a big plus for both runtime performance, binary size, and compile time.

bdice · 2024-11-05T15:10:34Z

Does that remove a whole level of dispatching as well? If so that could be a big plus for both runtime performance, binary size, and compile time.

The arity precomputation on host removes an operator dispatcher from the device side, but it was just a lookup (single dispatch). It wasn’t like an operator dispatch before a type dispatch (not a double dispatch). The other changes I noted above just clean up our dispatching logic so we don’t have to support the operator dispatch on device at all, and can make it host-only code instead of host-device.

I am hopeful this may reduce compile time, binary size, and kernel complexity. I haven’t gotten far enough to measure that yet.

…oid where possible.

bdice · 2024-11-13T04:32:16Z

I think 5598f68 should build successfully. We'll need to verify if build times have changed at all with this PR. I shifted a fair bit of code into a narrower scope, in an anonymous namespace of a C++ file, so hopefully it's not worse.

However, I also did some refactoring of the AST machinery to make it possible for dispatched functors to return non-void rather than passing output parameters by reference (except in cases where some operator/type double dispatches are invalid and thus returning void is required to avoid incomplete type deduction). The goal was to make the operator dispatcher act more like the type dispatcher. I am unsure if that change will affect mixed join kernels at all.

…ute-arity

lamarrr · 2024-11-18T09:44:57Z

cpp/include/cudf/ast/detail/operators.hpp

 {
  switch (op) {
    case ast_operator::ADD:
-      f.template operator()<ast_operator::ADD>(std::forward<Ts>(args)...);
-      break;
+      return f.template operator()<ast_operator::ADD>(std::forward<Ts>(args)...);


is one thread per row a good utilization of the waves?

could we also benefit from using indices into function pointers instead of the switch statements for dynamic operator dispatch?

is one thread per row a good utilization of the waves?

Would you propose that each thread takes on multiple rows? My hope was that loads of adjacent rows would be coalesced across threads, but I don't know if that is occurring. We would have to read the kernel PTX/SASS to know. It would be worth trying just to see what we get. This is something I would consider in a follow-up PR.

could we also benefit from using indices into function pointers instead of the switch statements for dynamic operator dispatch?

That idea has been raised before. I quote from a DM with @jrhemstad (which I don't think he'll mind):

The basic idea is instead of doing a bunch of nested switches in device code to route us to the code path we want, we could just create an array of function pointers ahead of time in host code and pass those to a kernel where we execute those functions in order without needing any switchs in device code.
I had always thought this would be impossible because you can't take the address of a device function from host code.
Two key realizations:

You can create a lookup table of device functions automatically as a __device__ array.

Instead of populating an array of function pointers in host code, you just populate an array of indices into that lookup table.
So for AST, instead of all the nasty nested switches, as part of the linearization we could just populate that array of indices into the function pointer lookup table and then the kernel would just be a matter of marching through those indices and executing their associated function pointer.
Very minimal demo:

__device__ void foo(){} __device__ void bar(){} __device__ void baz(){} using func_ptr = void(*)(); __device__ func_ptr ptrs[] = {foo, bar, baz}; __global__ void kernel(int i){ auto f = ptrs[i]; f(); }

https://godbolt.org/z/Kn75jPaP3

We wondered whether __noinline__ could be used on the dispatcher (or operators) to emulate this behavior. Let's file an issue to document this idea for a follow-up PR.

I did run a micro-benchmark of this approach (using a simple sum kernel) but it was slower than enums, I suspect the compiler does more aggressive branch analysis with enums, and function pointers could inhibit it from optimizing the code further.

How much did the performance change? Can you share your micro-benchmark and the results?

One of the biggest challenges we face with the AST is compile time and kernel size -- if the function-pointer dispatch helps with either of those, it could still be worth evaluating.

via: https://github.com/lamarrr/cudf/tree/static-branch-substitution

The performance is mostly same

…-precompute-arity

…ute-arity

bdice · 2024-11-21T23:21:12Z

We need benchmarks to see the performance impact of this before merging. Marking as do-not-merge for now. @lamarrr, would you be able to help benchmark this PR? Happy to have your feedback on the design as well.

lamarrr · 2024-11-22T15:36:39Z

Yeah, sure! I'll get on that now

lamarrr · 2024-11-22T16:26:11Z

Here are the results:

ast_int32_imbalanced_unique

[0] NVIDIA RTX A6000

tree_levels	num_rows	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
1	100000	15.177 us	97.46%	15.015 us	55.84%	-0.162 us	-1.07%	PASS
5	100000	23.997 us	110.30%	22.560 us	42.51%	-1.437 us	-5.99%	PASS
10	100000	31.107 us	13.79%	31.796 us	20.26%	0.690 us	2.22%	PASS
1	1000000	30.429 us	21.79%	30.770 us	19.21%	0.341 us	1.12%	PASS
5	1000000	72.765 us	10.39%	74.723 us	11.40%	1.958 us	2.69%	PASS
10	1000000	127.444 us	7.44%	130.387 us	7.39%	2.943 us	2.31%	PASS
1	10000000	184.867 us	3.47%	186.027 us	5.29%	1.160 us	0.63%	PASS
5	10000000	597.433 us	2.05%	611.563 us	1.78%	14.130 us	2.37%	FAIL
10	10000000	1.132 ms	0.66%	1.160 ms	1.10%	27.977 us	2.47%	FAIL
1	100000000	1.740 ms	0.64%	1.750 ms	0.64%	10.123 us	0.58%	PASS
5	100000000	5.929 ms	1.19%	6.085 ms	1.08%	155.313 us	2.62%	FAIL
10	100000000	11.374 ms	1.07%	11.663 ms	1.14%	288.505 us	2.54%	FAIL

ast_int32_imbalanced_reuse

[0] NVIDIA RTX A6000

tree_levels	num_rows	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
1	100000	13.929 us	35.72%	14.389 us	36.29%	0.460 us	3.30%	PASS
5	100000	18.648 us	33.13%	19.073 us	38.41%	0.425 us	2.28%	PASS
10	100000	25.369 us	29.94%	24.949 us	26.60%	-0.420 us	-1.66%	PASS
1	1000000	26.730 us	26.46%	26.949 us	12.31%	0.219 us	0.82%	PASS
5	1000000	60.008 us	8.36%	58.455 us	11.88%	-1.553 us	-2.59%	PASS
10	1000000	102.482 us	4.72%	98.432 us	9.50%	-4.050 us	-3.95%	PASS
1	10000000	155.714 us	9.43%	158.175 us	4.17%	2.460 us	1.58%	PASS
5	10000000	490.404 us	1.95%	473.640 us	2.00%	-16.764 us	-3.42%	FAIL
10	10000000	909.730 us	0.95%	874.943 us	1.13%	-34.787 us	-3.82%	FAIL
1	100000000	1.476 ms	0.60%	1.487 ms	0.50%	11.009 us	0.75%	FAIL
5	100000000	4.867 ms	1.21%	4.681 ms	1.31%	-185.519 us	-3.81%	FAIL
10	100000000	9.148 ms	1.30%	8.731 ms	1.21%	-417.752 us	-4.57%	FAIL

ast_double_imbalanced_unique

[0] NVIDIA RTX A6000

tree_levels	num_rows	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
1	100000	16.162 us	35.00%	16.527 us	30.68%	0.365 us	2.26%	PASS
5	100000	23.942 us	25.39%	24.112 us	22.96%	0.169 us	0.71%	PASS
10	100000	34.436 us	22.42%	34.384 us	21.06%	-0.053 us	-0.15%	PASS
1	1000000	48.227 us	13.15%	48.245 us	14.88%	0.018 us	0.04%	PASS
5	1000000	94.543 us	11.26%	95.067 us	12.56%	0.524 us	0.55%	PASS
10	1000000	156.942 us	6.67%	156.277 us	6.29%	-0.666 us	-0.42%	PASS
1	10000000	345.818 us	3.53%	345.830 us	3.71%	0.011 us	0.00%	PASS
5	10000000	789.261 us	1.74%	787.628 us	1.72%	-1.634 us	-0.21%	PASS
10	10000000	1.387 ms	0.81%	1.390 ms	0.83%	2.962 us	0.21%	PASS
1	100000000	3.329 ms	1.39%	3.328 ms	1.43%	-0.117 us	-0.00%	PASS
5	100000000	7.755 ms	1.12%	7.734 ms	1.02%	-21.457 us	-0.28%	PASS
10	100000000	13.758 ms	1.09%	13.678 ms	0.95%	-80.508 us	-0.59%	PASS

ast_int32_imbalanced_unique_nulls

[0] NVIDIA RTX A6000

tree_levels	num_rows	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
1	100000	43.104 us	21.96%	44.137 us	20.12%	1.034 us	2.40%	PASS
5	100000	51.437 us	18.71%	52.801 us	26.40%	1.364 us	2.65%	PASS
10	100000	65.826 us	19.01%	67.030 us	23.61%	1.204 us	1.83%	PASS
1	1000000	118.907 us	13.38%	119.349 us	15.00%	0.442 us	0.37%	PASS
5	1000000	186.932 us	11.27%	190.496 us	11.41%	3.564 us	1.91%	PASS
10	1000000	269.068 us	10.05%	273.007 us	7.51%	3.939 us	1.46%	PASS
1	10000000	864.898 us	2.86%	865.000 us	3.06%	0.103 us	0.01%	PASS
5	10000000	1.491 ms	2.21%	1.510 ms	2.02%	19.276 us	1.29%	PASS
10	10000000	2.297 ms	2.28%	2.322 ms	2.10%	25.242 us	1.10%	PASS
1	100000000	8.365 ms	1.11%	8.338 ms	0.50%	-27.090 us	-0.32%	PASS
5	100000000	14.547 ms	0.32%	14.666 ms	0.32%	118.646 us	0.82%	FAIL
10	100000000	22.620 ms	0.89%	22.912 ms	0.79%	292.253 us	1.29%	FAIL

ast_int32_imbalanced_reuse_nulls

[0] NVIDIA RTX A6000

tree_levels	num_rows	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
1	100000	41.942 us	27.71%	42.125 us	20.35%	0.184 us	0.44%	PASS
5	100000	46.372 us	23.03%	47.295 us	20.58%	0.923 us	1.99%	PASS
10	100000	55.577 us	20.71%	54.925 us	21.53%	-0.652 us	-1.17%	PASS
1	1000000	111.559 us	17.30%	111.080 us	12.90%	-0.480 us	-0.43%	PASS
5	1000000	149.289 us	12.79%	144.739 us	10.71%	-4.550 us	-3.05%	PASS
10	1000000	202.369 us	9.99%	194.493 us	11.71%	-7.875 us	-3.89%	PASS
1	10000000	811.055 us	3.72%	804.482 us	3.24%	-6.573 us	-0.81%	PASS
5	10000000	1.173 ms	2.15%	1.126 ms	2.36%	-46.689 us	-3.98%	FAIL
10	10000000	1.661 ms	2.81%	1.571 ms	2.17%	-90.794 us	-5.46%	FAIL
1	100000000	7.766 ms	1.21%	7.715 ms	1.27%	-51.383 us	-0.66%	PASS
5	100000000	11.347 ms	0.93%	10.941 ms	1.33%	-406.722 us	-3.58%	FAIL
10	100000000	16.195 ms	0.99%	15.311 ms	1.10%	-884.157 us	-5.46%	FAIL

ast_double_imbalanced_unique_nulls

[0] NVIDIA RTX A6000

tree_levels	num_rows	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
1	100000	45.440 us	23.87%	45.786 us	23.08%	0.346 us	0.76%	PASS
5	100000	53.445 us	28.02%	53.190 us	21.44%	-0.255 us	-0.48%	PASS
10	100000	68.426 us	18.24%	68.853 us	21.21%	0.427 us	0.62%	PASS
1	1000000	123.605 us	12.41%	123.134 us	12.46%	-0.470 us	-0.38%	PASS
5	1000000	190.372 us	10.05%	190.849 us	11.79%	0.477 us	0.25%	PASS
10	1000000	284.655 us	9.17%	286.273 us	10.26%	1.617 us	0.57%	PASS
1	10000000	905.699 us	4.08%	900.406 us	2.63%	-5.292 us	-0.58%	PASS
5	10000000	1.555 ms	2.00%	1.560 ms	2.22%	4.467 us	0.29%	PASS
10	10000000	2.415 ms	2.27%	2.418 ms	2.18%	3.176 us	0.13%	PASS
1	100000000	8.769 ms	1.81%	8.717 ms	1.09%	-51.672 us	-0.59%	PASS
5	100000000	15.197 ms	0.23%	15.184 ms	0.33%	-12.483 us	-0.08%	PASS
10	100000000	23.878 ms	0.85%	23.797 ms	0.13%	-81.292 us	-0.34%	FAIL

ast_string_equal_logical_and

[0] NVIDIA RTX A6000

string_width	num_rows	tree_levels	hit_rate	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
32	32768	1	50	20.761 us	27.28%	20.880 us	35.85%	0.119 us	0.57%	PASS
64	32768	1	50	32.291 us	17.86%	32.272 us	17.59%	-0.019 us	-0.06%	PASS
128	32768	1	50	67.945 us	10.18%	67.358 us	9.93%	-0.587 us	-0.86%	PASS
256	32768	1	50	122.168 us	7.58%	119.795 us	9.46%	-2.374 us	-1.94%	PASS
32	262144	1	50	46.635 us	14.64%	46.221 us	11.77%	-0.414 us	-0.89%	PASS
64	262144	1	50	107.387 us	7.83%	107.180 us	8.82%	-0.208 us	-0.19%	PASS
128	262144	1	50	2.192 ms	2.21%	2.200 ms	2.23%	7.995 us	0.36%	PASS
256	262144	1	50	6.857 ms	1.77%	6.869 ms	1.85%	11.164 us	0.16%	PASS
32	2097152	1	50	225.755 us	4.72%	223.288 us	5.67%	-2.467 us	-1.09%	PASS
64	2097152	1	50	667.850 us	1.95%	664.990 us	1.68%	-2.860 us	-0.43%	PASS
128	2097152	1	50	16.229 ms	2.00%	16.275 ms	2.08%	45.848 us	0.28%	PASS
256	2097152	1	50	50.074 ms	5.92%	50.028 ms	4.75%	-45.382 us	-0.09%	PASS
32	32768	2	50	28.089 us	30.82%	27.587 us	27.68%	-0.502 us	-1.79%	PASS
64	32768	2	50	48.268 us	14.53%	47.903 us	19.13%	-0.365 us	-0.76%	PASS
128	32768	2	50	261.278 us	4.96%	262.300 us	5.53%	1.022 us	0.39%	PASS
256	32768	2	50	707.282 us	1.80%	708.874 us	1.82%	1.593 us	0.23%	PASS
32	262144	2	50	82.936 us	7.50%	80.844 us	10.18%	-2.092 us	-2.52%	PASS
64	262144	2	50	1.072 ms	20.03%	1.054 ms	22.45%	-18.581 us	-1.73%	PASS
128	262144	2	50	4.784 ms	1.46%	4.788 ms	1.46%	4.607 us	0.10%	PASS
256	262144	2	50	14.408 ms	0.80%	14.415 ms	0.94%	7.452 us	0.05%	PASS
32	2097152	2	50	538.231 us	2.33%	522.847 us	3.55%	-15.384 us	-2.86%	FAIL
64	2097152	2	50	8.702 ms	3.13%	8.804 ms	3.20%	102.208 us	1.17%	PASS
128	2097152	2	50	38.978 ms	0.89%	39.048 ms	0.94%	69.933 us	0.18%	PASS
256	2097152	2	50	122.582 ms	0.48%	122.466 ms	0.11%	-115.344 us	-0.09%	PASS
32	32768	3	50	34.454 us	14.55%	34.354 us	26.88%	-0.100 us	-0.29%	PASS
64	32768	3	50	64.146 us	6.80%	63.659 us	11.72%	-0.487 us	-0.76%	PASS
128	32768	3	50	398.379 us	4.74%	399.936 us	4.16%	1.557 us	0.39%	PASS
256	32768	3	50	1.126 ms	1.01%	1.128 ms	1.25%	1.686 us	0.15%	PASS
32	262144	3	50	119.123 us	10.12%	116.759 us	7.62%	-2.365 us	-1.98%	PASS
64	262144	3	50	1.705 ms	23.84%	1.574 ms	28.80%	-130.880 us	-7.68%	PASS
128	262144	3	50	7.173 ms	1.56%	7.173 ms	1.20%	-0.691 us	-0.01%	PASS
256	262144	3	50	21.807 ms	0.89%	21.827 ms	0.77%	20.710 us	0.09%	PASS
32	2097152	3	50	834.016 us	1.29%	812.645 us	2.24%	-21.370 us	-2.56%	FAIL
64	2097152	3	50	13.711 ms	3.09%	13.730 ms	3.17%	19.364 us	0.14%	PASS
128	2097152	3	50	58.919 ms	0.74%	58.835 ms	0.49%	-83.682 us	-0.14%	PASS
256	2097152	3	50	186.027 ms	0.33%	186.195 ms	0.32%	168.343 us	0.09%	PASS
32	32768	4	50	41.428 us	17.85%	40.868 us	17.56%	-0.560 us	-1.35%	PASS
64	32768	4	50	79.369 us	12.55%	78.958 us	12.24%	-0.412 us	-0.52%	PASS
128	32768	4	50	551.605 us	2.92%	554.330 us	3.42%	2.724 us	0.49%	PASS
256	32768	4	50	1.659 ms	1.34%	1.663 ms	1.37%	4.140 us	0.25%	PASS
32	262144	4	50	158.178 us	7.64%	155.319 us	9.23%	-2.859 us	-1.81%	PASS
64	262144	4	50	2.280 ms	24.75%	2.222 ms	24.35%	-57.495 us	-2.52%	PASS
128	262144	4	50	9.568 ms	1.33%	9.578 ms	1.09%	10.492 us	0.11%	PASS
256	262144	4	50	29.272 ms	0.72%	29.318 ms	0.78%	46.565 us	0.16%	PASS
32	2097152	4	50	1.128 ms	2.00%	1.100 ms	2.14%	-27.869 us	-2.47%	FAIL
64	2097152	4	50	18.691 ms	2.60%	19.086 ms	2.66%	395.011 us	2.11%	PASS
128	2097152	4	50	78.861 ms	0.49%	79.000 ms	0.78%	139.065 us	0.18%	PASS
256	2097152	4	50	250.303 ms	0.30%	250.352 ms	0.27%	49.465 us	0.02%	PASS
32	32768	1	100	21.799 us	18.00%	21.944 us	36.36%	0.144 us	0.66%	PASS
64	32768	1	100	36.086 us	18.22%	36.046 us	15.81%	-0.040 us	-0.11%	PASS
128	32768	1	100	92.405 us	7.09%	92.043 us	7.59%	-0.363 us	-0.39%	PASS
256	32768	1	100	166.217 us	3.31%	166.175 us	6.34%	-0.042 us	-0.03%	PASS
32	262144	1	100	52.261 us	9.60%	52.150 us	12.48%	-0.111 us	-0.21%	PASS
64	262144	1	100	164.659 us	6.58%	164.236 us	5.95%	-0.423 us	-0.26%	PASS
128	262144	1	100	2.611 ms	1.49%	2.610 ms	1.37%	-1.259 us	-0.05%	PASS
256	262144	1	100	8.480 ms	1.34%	8.491 ms	1.24%	11.622 us	0.14%	PASS
32	2097152	1	100	274.135 us	4.72%	272.989 us	4.04%	-1.145 us	-0.42%	PASS
64	2097152	1	100	957.997 us	1.35%	956.725 us	1.29%	-1.272 us	-0.13%	PASS
128	2097152	1	100	20.679 ms	1.73%	20.766 ms	2.24%	86.797 us	0.42%	PASS
256	2097152	1	100	67.829 ms	3.20%	68.371 ms	3.15%	541.952 us	0.80%	PASS
32	32768	2	100	30.616 us	26.36%	30.430 us	22.11%	-0.186 us	-0.61%	PASS
64	32768	2	100	64.740 us	17.85%	64.255 us	11.57%	-0.485 us	-0.75%	PASS
128	32768	2	100	398.788 us	4.40%	397.924 us	5.04%	-0.865 us	-0.22%	PASS
256	32768	2	100	1.114 ms	1.58%	1.117 ms	1.63%	2.897 us	0.26%	PASS
32	262144	2	100	109.722 us	9.98%	108.079 us	5.96%	-1.643 us	-1.50%	PASS
64	262144	2	100	1.552 ms	7.57%	1.580 ms	6.70%	28.360 us	1.83%	PASS
128	262144	2	100	5.480 ms	1.06%	5.497 ms	1.55%	16.834 us	0.31%	PASS
256	262144	2	100	16.948 ms	1.11%	16.976 ms	0.87%	27.304 us	0.16%	PASS
32	2097152	2	100	660.413 us	2.12%	649.709 us	1.69%	-10.704 us	-1.62%	PASS
64	2097152	2	100	11.867 ms	17.59%	11.397 ms	23.52%	-470.161 us	-3.96%	PASS
128	2097152	2	100	45.869 ms	0.22%	45.837 ms	0.02%	-31.223 us	-0.07%	FAIL
256	2097152	2	100	143.324 ms	0.43%	143.189 ms	0.05%	-134.118 us	-0.09%	FAIL
32	32768	3	100	37.998 us	18.92%	37.817 us	18.54%	-0.181 us	-0.48%	PASS
64	32768	3	100	93.221 us	12.52%	92.438 us	12.86%	-0.783 us	-0.84%	PASS
128	32768	3	100	579.144 us	4.80%	580.545 us	5.01%	1.402 us	0.24%	PASS
256	32768	3	100	1.802 ms	1.09%	1.803 ms	0.86%	1.223 us	0.07%	PASS
32	262144	3	100	157.729 us	7.45%	154.741 us	8.92%	-2.988 us	-1.89%	PASS
64	262144	3	100	2.311 ms	5.97%	2.345 ms	5.45%	33.181 us	1.44%	PASS
128	262144	3	100	8.205 ms	1.13%	8.227 ms	1.14%	21.224 us	0.26%	PASS
256	262144	3	100	25.703 ms	0.81%	25.774 ms	0.72%	70.655 us	0.27%	PASS
32	2097152	3	100	988.391 us	2.11%	955.634 us	1.34%	-32.756 us	-3.31%	FAIL
64	2097152	3	100	19.102 ms	14.04%	18.071 ms	22.45%	-1030.878 us	-5.40%	PASS
128	2097152	3	100	68.718 ms	0.03%	68.721 ms	0.02%	3.143 us	0.00%	PASS
256	2097152	3	100	217.484 ms	0.38%	217.459 ms	0.42%	-25.203 us	-0.01%	PASS
32	32768	4	100	46.401 us	22.79%	45.750 us	14.41%	-0.650 us	-1.40%	PASS
64	32768	4	100	116.467 us	7.70%	115.589 us	8.78%	-0.877 us	-0.75%	PASS
128	32768	4	100	758.928 us	3.85%	763.480 us	3.78%	4.551 us	0.60%	PASS
256	32768	4	100	2.500 ms	1.69%	2.505 ms	1.33%	4.055 us	0.16%	PASS
32	262144	4	100	209.410 us	5.00%	203.088 us	4.73%	-6.322 us	-3.02%	PASS
64	262144	4	100	3.059 ms	6.10%	3.103 ms	3.30%	43.562 us	1.42%	PASS
128	262144	4	100	10.958 ms	1.06%	10.987 ms	1.04%	28.601 us	0.26%	PASS
256	262144	4	100	34.085 ms	0.66%	34.189 ms	0.43%	103.608 us	0.30%	PASS
32	2097152	4	100	1.345 ms	1.82%	1.272 ms	0.97%	-73.382 us	-5.46%	FAIL
64	2097152	4	100	25.505 ms	14.87%	23.395 ms	24.86%	-2109.875 us	-8.27%	PASS
128	2097152	4	100	91.736 ms	0.50%	91.646 ms	0.03%	-89.549 us	-0.10%	FAIL
256	2097152	4	100	287.704 ms	0.23%	287.792 ms	0.35%	87.230 us	0.03%	PASS

Summary

Total Matches: 168
- Pass (diff <= min_noise): 144
- Unknown (infinite noise): 0
- Failure (diff > min_noise): 24

bdice · 2024-11-22T16:58:28Z

@lamarrr Looks like benchmarks are either a slight improvement for large trees, or no impact (edit: most are single-digit swings, but maybe not all are wins on second look). Can we determine the impact on binary size and compile time as well?

lamarrr · 2024-11-22T17:29:45Z

@lamarrr Looks like benchmarks are either a slight improvement for large trees, or no impact (edit: most are single-digit swings, but maybe not all are wins on second look). Can we determine the impact on binary size and compile time as well?

There's about a 3MB reduction in binary size of libcudf.so (note that this is with lineinfo and in RelWithDebInfo mode)

main: 1,520,188,704 bytes
ast-precompute-arity: 1,517,889,336 bytes

lamarrr · 2024-11-26T20:51:27Z

Build Metrics Report:

rapidsai:branch-25.02

file	build time	binary size
src/ast/expression_parser.cpp.o	6.457 s	794.296 KB
src/join/mixed_join_size_kernel.cu.o	2:05 min	494.592 KB
src/join/mixed_join_semi.cu.o	28.462 s	1.528 MB
src/join/mixed_join_size_kernel_nulls.cu.o	4:44 min	539.512 KB
src/join/mixed_join_kernel.cu.o	113.738 s	484.872 KB
src/join/mixed_join.cu.o	20.055 s	1.498 MB
src/join/mixed_join_kernels_semi.cu.o	73.705 s	362.616 KB
src/join/mixed_join_kernel_nulls.cu.o	3:49 min	525.120 KB

bdice:ast-precompute-arity

file	build time	binary size
src/ast/operators.cpp.o	4:28 min	537.048 KB
src/ast/expression_parser.cpp.o	1.902 s	123.224 KB
src/join/mixed_join_size_kernel.cu.o	2:02 min	491.776 KB
src/join/mixed_join_semi.cu.o	23.890 s	1.529 MB
src/join/mixed_join_size_kernel_nulls.cu.o	4:28 min	537.048 KB
src/join/mixed_join_kernel.cu.o	108.065 s	481.608 KB
src/join/mixed_join.cu.o	15.857 s	1.500 MB
src/join/mixed_join_kernels_semi.cu.o	67.704 s	361.912 KB
src/join/mixed_join_kernel_nulls.cu.o	3:49 min	524.872 KB

bdice · 2024-11-26T23:25:23Z

Mixed join kernels take more like ~15 minutes to compile. Build times that are 2-3 seconds are due to a sccache hit. We want to know the times when sccache misses.

Try altering something trivial in one of the headers to generate an sccache miss, or try disabling sccache as a compiler launcher. That will give us better numbers.

lamarrr · 2024-11-27T22:18:26Z

I've edited the types.h header to generate an sccache miss, the results are same as unsetting the launcher.

lamarrr · 2024-12-05T16:49:42Z

All Arch Build Metrics Report:

rapidsai:branch-25.02

file	build time	binary size
src/ast/expression_parser.cpp.o	7.092 s	794.296 KB
src/join/mixed_join_size_kernel.cu.o	10:20 min	3.199 MB
src/join/mixed_join_semi.cu.o	104.804 s	2.356 MB
src/join/mixed_join_size_kernel_nulls.cu.o	21:27 min	3.653 MB
src/join/mixed_join_kernel.cu.o	8:47 min	3.213 MB
src/join/mixed_join.cu.o	57.511 s	1.831 MB
src/join/mixed_join_kernels_semi.cu.o	5:08 min	2.121 MB
src/join/mixed_join_kernel_nulls.cu.o	19:27 min	3.645 MB

bdice:ast-precompute-arity

file	build time	binary size
src/ast/operators.cpp.o	553 ms	255.992 KB
src/ast/expression_parser.cpp.o	596 ms	123.224 KB
src/join/mixed_join_size_kernel.cu.o	11:31 min	3.182 MB
src/join/mixed_join_semi.cu.o	109.986 s	2.357 MB
src/join/mixed_join_size_kernel_nulls.cu.o	25:40 min	3.636 MB
src/join/mixed_join_kernel.cu.o	9:29 min	3.202 MB
src/join/mixed_join.cu.o	50.106 s	1.833 MB
src/join/mixed_join_kernels_semi.cu.o	5:57 min	2.113 MB
src/join/mixed_join_kernel_nulls.cu.o	21:32 min	3.635 MB

lamarrr · 2024-12-05T16:51:02Z

All Arch Build Metrics Report:

rapidsai:branch-25.02

file build time binary size
src/ast/expression_parser.cpp.o 7.092 s 794.296 KB
src/join/mixed_join_size_kernel.cu.o 10:20 min 3.199 MB
src/join/mixed_join_semi.cu.o 104.804 s 2.356 MB
src/join/mixed_join_size_kernel_nulls.cu.o 21:27 min 3.653 MB
src/join/mixed_join_kernel.cu.o 8:47 min 3.213 MB
src/join/mixed_join.cu.o 57.511 s 1.831 MB
src/join/mixed_join_kernels_semi.cu.o 5:08 min 2.121 MB
src/join/mixed_join_kernel_nulls.cu.o 19:27 min 3.645 MB

bdice:ast-precompute-arity

file build time binary size
src/ast/operators.cpp.o 553 ms 255.992 KB
src/ast/expression_parser.cpp.o 596 ms 123.224 KB
src/join/mixed_join_size_kernel.cu.o 11:31 min 3.182 MB
src/join/mixed_join_semi.cu.o 109.986 s 2.357 MB
src/join/mixed_join_size_kernel_nulls.cu.o 25:40 min 3.636 MB
src/join/mixed_join_kernel.cu.o 9:29 min 3.202 MB
src/join/mixed_join.cu.o 50.106 s 1.833 MB
src/join/mixed_join_kernels_semi.cu.o 5:57 min 2.113 MB
src/join/mixed_join_kernel_nulls.cu.o 21:32 min 3.635 MB

@bdice it seems the all-arch build-time increases, but the executable size decreases

lamarrr

Looks good and ready to me

cpp/include/cudf/ast/detail/expression_parser.hpp

fixed ast program's buffer alignment and sizing

lamarrr · 2025-01-10T14:25:54Z

/merge

Precompute AST arity

3062e8c

bdice requested a review from a team as a code owner November 1, 2024 15:02

bdice requested review from harrism and kingcrimsontianyu November 1, 2024 15:02

github-actions bot assigned bdice Nov 1, 2024

bdice marked this pull request as draft November 1, 2024 15:02

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 1, 2024

bdice removed request for harrism and kingcrimsontianyu November 1, 2024 15:02

bdice added 2 commits November 1, 2024 08:03

Make ast_operator_arity host-only.

a6e2b58

Start moving code to operators.cpp.

b69dfc2

github-actions bot added the CMake CMake build issue label Nov 4, 2024

GregoryKimball assigned lamarrr and unassigned bdice Nov 4, 2024

Move more code to operators.cpp, make dispatch machinery return non-v…

5598f68

…oid where possible.

github-actions bot assigned bdice Nov 13, 2024

Merge branch 'branch-24.12' into ast-precompute-arity

3f18172

bdice and others added 3 commits November 13, 2024 05:55

Go back to void dispatching for arity and return type.

3270c10

Merge remote-tracking branch 'upstream/branch-24.12' into ast-precomp…

ea35ca3

…ute-arity

Merge branch 'branch-24.12' into ast-precompute-arity

a889380

lamarrr reviewed Nov 18, 2024

View reviewed changes

bdice added 3 commits November 20, 2024 07:41

Use cudf::size_type for operator arities.

68213c6

Merge branch 'ast-precompute-arity' of github.com:bdice/cudf into ast…

5248b2d

…-precompute-arity

Merge remote-tracking branch 'upstream/branch-25.02' into ast-precomp…

e57d90d

…ute-arity

bdice changed the base branch from branch-24.12 to branch-25.02 November 20, 2024 15:48

Merge branch 'branch-25.02' into ast-precompute-arity

48e33c3

Merge remote-tracking branch 'upstream/branch-25.02' into ast-precomp…

f7dfb54

…ute-arity

bdice marked this pull request as ready for review November 21, 2024 23:20

bdice requested a review from a team as a code owner November 21, 2024 23:20

bdice added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 21, 2024

bdice added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 21, 2024

Merge branch 'branch-25.02' into ast-precompute-arity

4d58e9f

lamarrr removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Dec 9, 2024

lamarrr self-requested a review December 9, 2024 14:37

lamarrr approved these changes Dec 9, 2024

View reviewed changes

bdice commented Dec 9, 2024

View reviewed changes

cpp/include/cudf/ast/detail/expression_parser.hpp Outdated Show resolved Hide resolved

lamarrr and others added 7 commits December 10, 2024 13:55

fixed ast program's buffer alignment and sizing

1d9c665

Merge pull request #5 from lamarrr/ast-precompute-buffer-fix

4b32284

fixed ast program's buffer alignment and sizing

Merge branch 'branch-25.02' into ast-precompute-arity

c8c6cb7

Merge branch 'branch-25.02' into ast-precompute-arity

a729188

Merge branch 'branch-25.02' into ast-precompute-arity

fb17b71

Merge branch 'branch-25.02' into ast-precompute-arity

814aaed

updated copyright year

f6c087d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precompute AST arity #17234

Precompute AST arity #17234

bdice commented Nov 1, 2024

bdice commented Nov 4, 2024 •

edited

Loading

vyasr commented Nov 4, 2024

bdice commented Nov 5, 2024 •

edited

Loading

bdice commented Nov 13, 2024 •

edited

Loading

lamarrr Nov 18, 2024

lamarrr Nov 18, 2024

bdice Nov 20, 2024 •

edited

Loading

lamarrr Nov 21, 2024

bdice Nov 21, 2024

lamarrr Nov 25, 2024

bdice commented Nov 21, 2024

lamarrr commented Nov 22, 2024

lamarrr commented Nov 22, 2024 •

edited

Loading

bdice commented Nov 22, 2024 •

edited

Loading

lamarrr commented Nov 22, 2024

lamarrr commented Nov 26, 2024 •

edited

Loading

bdice commented Nov 26, 2024

lamarrr commented Nov 27, 2024

lamarrr commented Dec 5, 2024

lamarrr commented Dec 5, 2024

rapidsai:branch-25.02

bdice:ast-precompute-arity

lamarrr left a comment

lamarrr commented Jan 10, 2025

Precompute AST arity #17234

Are you sure you want to change the base?

Precompute AST arity #17234

Conversation

bdice commented Nov 1, 2024

Description

Checklist

bdice commented Nov 4, 2024 • edited Loading

vyasr commented Nov 4, 2024

bdice commented Nov 5, 2024 • edited Loading

bdice commented Nov 13, 2024 • edited Loading

lamarrr Nov 18, 2024

Choose a reason for hiding this comment

lamarrr Nov 18, 2024

Choose a reason for hiding this comment

bdice Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

lamarrr Nov 21, 2024

Choose a reason for hiding this comment

bdice Nov 21, 2024

Choose a reason for hiding this comment

lamarrr Nov 25, 2024

Choose a reason for hiding this comment

bdice commented Nov 21, 2024

lamarrr commented Nov 22, 2024

lamarrr commented Nov 22, 2024 • edited Loading

ast_int32_imbalanced_unique

[0] NVIDIA RTX A6000

ast_int32_imbalanced_reuse

[0] NVIDIA RTX A6000

ast_double_imbalanced_unique

[0] NVIDIA RTX A6000

ast_int32_imbalanced_unique_nulls

[0] NVIDIA RTX A6000

ast_int32_imbalanced_reuse_nulls

[0] NVIDIA RTX A6000

ast_double_imbalanced_unique_nulls

[0] NVIDIA RTX A6000

ast_string_equal_logical_and

[0] NVIDIA RTX A6000

Summary

bdice commented Nov 22, 2024 • edited Loading

lamarrr commented Nov 22, 2024

lamarrr commented Nov 26, 2024 • edited Loading

rapidsai:branch-25.02

bdice:ast-precompute-arity

bdice commented Nov 26, 2024

lamarrr commented Nov 27, 2024

lamarrr commented Dec 5, 2024

rapidsai:branch-25.02

bdice:ast-precompute-arity

lamarrr commented Dec 5, 2024

rapidsai:branch-25.02

bdice:ast-precompute-arity

lamarrr left a comment

Choose a reason for hiding this comment

lamarrr commented Jan 10, 2025

bdice commented Nov 4, 2024 •

edited

Loading

bdice commented Nov 5, 2024 •

edited

Loading

bdice commented Nov 13, 2024 •

edited

Loading

bdice Nov 20, 2024 •

edited

Loading

lamarrr commented Nov 22, 2024 •

edited

Loading

bdice commented Nov 22, 2024 •

edited

Loading

lamarrr commented Nov 26, 2024 •

edited

Loading