Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precompute AST arity #17234

Open
wants to merge 21 commits into
base: branch-25.02
Choose a base branch
from
Open

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Nov 1, 2024

Description

This PR precomputes AST arity on the host, to reduce the complexity in device-side arity lookup.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@bdice bdice requested a review from a team as a code owner November 1, 2024 15:02
@bdice bdice marked this pull request as draft November 1, 2024 15:02
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 1, 2024
@github-actions github-actions bot added the CMake CMake build issue label Nov 4, 2024
@bdice
Copy link
Contributor Author

bdice commented Nov 4, 2024

I'm realizing that by pre-computing the arity, we can move a huge amount of code from being CUDF_HOST_DEVICE in operator.hpp to only needing host versions in operator.cpp. I began moving that code in b69dfc2. I have more work to do here. cc: @lamarrr for awareness

I think this entire section can be made host-only. (Partly done in b69dfc2 but there's more to do.) https://github.com/rapidsai/cudf/blob/branch-24.12/cpp/include/cudf/ast/detail/operators.hpp#L958-L1230

@vyasr
Copy link
Contributor

vyasr commented Nov 4, 2024

Does that remove a whole level of dispatching as well? If so that could be a big plus for both runtime performance, binary size, and compile time.

@GregoryKimball GregoryKimball assigned lamarrr and unassigned bdice Nov 4, 2024
@bdice
Copy link
Contributor Author

bdice commented Nov 5, 2024

Does that remove a whole level of dispatching as well? If so that could be a big plus for both runtime performance, binary size, and compile time.

The arity precomputation on host removes an operator dispatcher from the device side, but it was just a lookup (single dispatch). It wasn’t like an operator dispatch before a type dispatch (not a double dispatch). The other changes I noted above just clean up our dispatching logic so we don’t have to support the operator dispatch on device at all, and can make it host-only code instead of host-device.

I am hopeful this may reduce compile time, binary size, and kernel complexity. I haven’t gotten far enough to measure that yet.

@bdice
Copy link
Contributor Author

bdice commented Nov 13, 2024

I think 5598f68 should build successfully. We'll need to verify if build times have changed at all with this PR. I shifted a fair bit of code into a narrower scope, in an anonymous namespace of a C++ file, so hopefully it's not worse.

However, I also did some refactoring of the AST machinery to make it possible for dispatched functors to return non-void rather than passing output parameters by reference (except in cases where some operator/type double dispatches are invalid and thus returning void is required to avoid incomplete type deduction). The goal was to make the operator dispatcher act more like the type dispatcher. I am unsure if that change will affect mixed join kernels at all.

{
switch (op) {
case ast_operator::ADD:
f.template operator()<ast_operator::ADD>(std::forward<Ts>(args)...);
break;
return f.template operator()<ast_operator::ADD>(std::forward<Ts>(args)...);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is one thread per row a good utilization of the waves?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we also benefit from using indices into function pointers instead of the switch statements for dynamic operator dispatch?

Copy link
Contributor Author

@bdice bdice Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is one thread per row a good utilization of the waves?

Would you propose that each thread takes on multiple rows? My hope was that loads of adjacent rows would be coalesced across threads, but I don't know if that is occurring. We would have to read the kernel PTX/SASS to know. It would be worth trying just to see what we get. This is something I would consider in a follow-up PR.

could we also benefit from using indices into function pointers instead of the switch statements for dynamic operator dispatch?

That idea has been raised before. I quote from a DM with @jrhemstad (which I don't think he'll mind):

The basic idea is instead of doing a bunch of nested switches in device code to route us to the code path we want, we could just create an array of function pointers ahead of time in host code and pass those to a kernel where we execute those functions in order without needing any switchs in device code.
I had always thought this would be impossible because you can't take the address of a device function from host code.
Two key realizations:

  • You can create a lookup table of device functions automatically as a __device__ array.
  • Instead of populating an array of function pointers in host code, you just populate an array of indices into that lookup table.
    So for AST, instead of all the nasty nested switches, as part of the linearization we could just populate that array of indices into the function pointer lookup table and then the kernel would just be a matter of marching through those indices and executing their associated function pointer.
    Very minimal demo:
__device__ void foo(){}
__device__ void bar(){}
__device__ void baz(){}
using func_ptr = void(*)();

__device__ func_ptr ptrs[] = {foo, bar, baz};

__global__ void kernel(int i){
    auto f = ptrs[i];
    f();
}

https://godbolt.org/z/Kn75jPaP3

We wondered whether __noinline__ could be used on the dispatcher (or operators) to emulate this behavior. Let's file an issue to document this idea for a follow-up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did run a micro-benchmark of this approach (using a simple sum kernel) but it was slower than enums, I suspect the compiler does more aggressive branch analysis with enums, and function pointers could inhibit it from optimizing the code further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much did the performance change? Can you share your micro-benchmark and the results?

One of the biggest challenges we face with the AST is compile time and kernel size -- if the function-pointer dispatch helps with either of those, it could still be worth evaluating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdice bdice changed the base branch from branch-24.12 to branch-25.02 November 20, 2024 15:48
@bdice bdice marked this pull request as ready for review November 21, 2024 23:20
@bdice bdice requested a review from a team as a code owner November 21, 2024 23:20
@bdice bdice added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 21, 2024
@bdice
Copy link
Contributor Author

bdice commented Nov 21, 2024

We need benchmarks to see the performance impact of this before merging. Marking as do-not-merge for now. @lamarrr, would you be able to help benchmark this PR? Happy to have your feedback on the design as well.

@bdice bdice added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 21, 2024
@lamarrr
Copy link
Contributor

lamarrr commented Nov 22, 2024

Yeah, sure! I'll get on that now

@lamarrr
Copy link
Contributor

lamarrr commented Nov 22, 2024

Here are the results:

ast_int32_imbalanced_unique

[0] NVIDIA RTX A6000

tree_levels num_rows Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
1 100000 15.177 us 97.46% 15.015 us 55.84% -0.162 us -1.07% PASS
5 100000 23.997 us 110.30% 22.560 us 42.51% -1.437 us -5.99% PASS
10 100000 31.107 us 13.79% 31.796 us 20.26% 0.690 us 2.22% PASS
1 1000000 30.429 us 21.79% 30.770 us 19.21% 0.341 us 1.12% PASS
5 1000000 72.765 us 10.39% 74.723 us 11.40% 1.958 us 2.69% PASS
10 1000000 127.444 us 7.44% 130.387 us 7.39% 2.943 us 2.31% PASS
1 10000000 184.867 us 3.47% 186.027 us 5.29% 1.160 us 0.63% PASS
5 10000000 597.433 us 2.05% 611.563 us 1.78% 14.130 us 2.37% FAIL
10 10000000 1.132 ms 0.66% 1.160 ms 1.10% 27.977 us 2.47% FAIL
1 100000000 1.740 ms 0.64% 1.750 ms 0.64% 10.123 us 0.58% PASS
5 100000000 5.929 ms 1.19% 6.085 ms 1.08% 155.313 us 2.62% FAIL
10 100000000 11.374 ms 1.07% 11.663 ms 1.14% 288.505 us 2.54% FAIL

ast_int32_imbalanced_reuse

[0] NVIDIA RTX A6000

tree_levels num_rows Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
1 100000 13.929 us 35.72% 14.389 us 36.29% 0.460 us 3.30% PASS
5 100000 18.648 us 33.13% 19.073 us 38.41% 0.425 us 2.28% PASS
10 100000 25.369 us 29.94% 24.949 us 26.60% -0.420 us -1.66% PASS
1 1000000 26.730 us 26.46% 26.949 us 12.31% 0.219 us 0.82% PASS
5 1000000 60.008 us 8.36% 58.455 us 11.88% -1.553 us -2.59% PASS
10 1000000 102.482 us 4.72% 98.432 us 9.50% -4.050 us -3.95% PASS
1 10000000 155.714 us 9.43% 158.175 us 4.17% 2.460 us 1.58% PASS
5 10000000 490.404 us 1.95% 473.640 us 2.00% -16.764 us -3.42% FAIL
10 10000000 909.730 us 0.95% 874.943 us 1.13% -34.787 us -3.82% FAIL
1 100000000 1.476 ms 0.60% 1.487 ms 0.50% 11.009 us 0.75% FAIL
5 100000000 4.867 ms 1.21% 4.681 ms 1.31% -185.519 us -3.81% FAIL
10 100000000 9.148 ms 1.30% 8.731 ms 1.21% -417.752 us -4.57% FAIL

ast_double_imbalanced_unique

[0] NVIDIA RTX A6000

tree_levels num_rows Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
1 100000 16.162 us 35.00% 16.527 us 30.68% 0.365 us 2.26% PASS
5 100000 23.942 us 25.39% 24.112 us 22.96% 0.169 us 0.71% PASS
10 100000 34.436 us 22.42% 34.384 us 21.06% -0.053 us -0.15% PASS
1 1000000 48.227 us 13.15% 48.245 us 14.88% 0.018 us 0.04% PASS
5 1000000 94.543 us 11.26% 95.067 us 12.56% 0.524 us 0.55% PASS
10 1000000 156.942 us 6.67% 156.277 us 6.29% -0.666 us -0.42% PASS
1 10000000 345.818 us 3.53% 345.830 us 3.71% 0.011 us 0.00% PASS
5 10000000 789.261 us 1.74% 787.628 us 1.72% -1.634 us -0.21% PASS
10 10000000 1.387 ms 0.81% 1.390 ms 0.83% 2.962 us 0.21% PASS
1 100000000 3.329 ms 1.39% 3.328 ms 1.43% -0.117 us -0.00% PASS
5 100000000 7.755 ms 1.12% 7.734 ms 1.02% -21.457 us -0.28% PASS
10 100000000 13.758 ms 1.09% 13.678 ms 0.95% -80.508 us -0.59% PASS

ast_int32_imbalanced_unique_nulls

[0] NVIDIA RTX A6000

tree_levels num_rows Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
1 100000 43.104 us 21.96% 44.137 us 20.12% 1.034 us 2.40% PASS
5 100000 51.437 us 18.71% 52.801 us 26.40% 1.364 us 2.65% PASS
10 100000 65.826 us 19.01% 67.030 us 23.61% 1.204 us 1.83% PASS
1 1000000 118.907 us 13.38% 119.349 us 15.00% 0.442 us 0.37% PASS
5 1000000 186.932 us 11.27% 190.496 us 11.41% 3.564 us 1.91% PASS
10 1000000 269.068 us 10.05% 273.007 us 7.51% 3.939 us 1.46% PASS
1 10000000 864.898 us 2.86% 865.000 us 3.06% 0.103 us 0.01% PASS
5 10000000 1.491 ms 2.21% 1.510 ms 2.02% 19.276 us 1.29% PASS
10 10000000 2.297 ms 2.28% 2.322 ms 2.10% 25.242 us 1.10% PASS
1 100000000 8.365 ms 1.11% 8.338 ms 0.50% -27.090 us -0.32% PASS
5 100000000 14.547 ms 0.32% 14.666 ms 0.32% 118.646 us 0.82% FAIL
10 100000000 22.620 ms 0.89% 22.912 ms 0.79% 292.253 us 1.29% FAIL

ast_int32_imbalanced_reuse_nulls

[0] NVIDIA RTX A6000

tree_levels num_rows Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
1 100000 41.942 us 27.71% 42.125 us 20.35% 0.184 us 0.44% PASS
5 100000 46.372 us 23.03% 47.295 us 20.58% 0.923 us 1.99% PASS
10 100000 55.577 us 20.71% 54.925 us 21.53% -0.652 us -1.17% PASS
1 1000000 111.559 us 17.30% 111.080 us 12.90% -0.480 us -0.43% PASS
5 1000000 149.289 us 12.79% 144.739 us 10.71% -4.550 us -3.05% PASS
10 1000000 202.369 us 9.99% 194.493 us 11.71% -7.875 us -3.89% PASS
1 10000000 811.055 us 3.72% 804.482 us 3.24% -6.573 us -0.81% PASS
5 10000000 1.173 ms 2.15% 1.126 ms 2.36% -46.689 us -3.98% FAIL
10 10000000 1.661 ms 2.81% 1.571 ms 2.17% -90.794 us -5.46% FAIL
1 100000000 7.766 ms 1.21% 7.715 ms 1.27% -51.383 us -0.66% PASS
5 100000000 11.347 ms 0.93% 10.941 ms 1.33% -406.722 us -3.58% FAIL
10 100000000 16.195 ms 0.99% 15.311 ms 1.10% -884.157 us -5.46% FAIL

ast_double_imbalanced_unique_nulls

[0] NVIDIA RTX A6000

tree_levels num_rows Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
1 100000 45.440 us 23.87% 45.786 us 23.08% 0.346 us 0.76% PASS
5 100000 53.445 us 28.02% 53.190 us 21.44% -0.255 us -0.48% PASS
10 100000 68.426 us 18.24% 68.853 us 21.21% 0.427 us 0.62% PASS
1 1000000 123.605 us 12.41% 123.134 us 12.46% -0.470 us -0.38% PASS
5 1000000 190.372 us 10.05% 190.849 us 11.79% 0.477 us 0.25% PASS
10 1000000 284.655 us 9.17% 286.273 us 10.26% 1.617 us 0.57% PASS
1 10000000 905.699 us 4.08% 900.406 us 2.63% -5.292 us -0.58% PASS
5 10000000 1.555 ms 2.00% 1.560 ms 2.22% 4.467 us 0.29% PASS
10 10000000 2.415 ms 2.27% 2.418 ms 2.18% 3.176 us 0.13% PASS
1 100000000 8.769 ms 1.81% 8.717 ms 1.09% -51.672 us -0.59% PASS
5 100000000 15.197 ms 0.23% 15.184 ms 0.33% -12.483 us -0.08% PASS
10 100000000 23.878 ms 0.85% 23.797 ms 0.13% -81.292 us -0.34% FAIL

ast_string_equal_logical_and

[0] NVIDIA RTX A6000

string_width num_rows tree_levels hit_rate Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
32 32768 1 50 20.761 us 27.28% 20.880 us 35.85% 0.119 us 0.57% PASS
64 32768 1 50 32.291 us 17.86% 32.272 us 17.59% -0.019 us -0.06% PASS
128 32768 1 50 67.945 us 10.18% 67.358 us 9.93% -0.587 us -0.86% PASS
256 32768 1 50 122.168 us 7.58% 119.795 us 9.46% -2.374 us -1.94% PASS
32 262144 1 50 46.635 us 14.64% 46.221 us 11.77% -0.414 us -0.89% PASS
64 262144 1 50 107.387 us 7.83% 107.180 us 8.82% -0.208 us -0.19% PASS
128 262144 1 50 2.192 ms 2.21% 2.200 ms 2.23% 7.995 us 0.36% PASS
256 262144 1 50 6.857 ms 1.77% 6.869 ms 1.85% 11.164 us 0.16% PASS
32 2097152 1 50 225.755 us 4.72% 223.288 us 5.67% -2.467 us -1.09% PASS
64 2097152 1 50 667.850 us 1.95% 664.990 us 1.68% -2.860 us -0.43% PASS
128 2097152 1 50 16.229 ms 2.00% 16.275 ms 2.08% 45.848 us 0.28% PASS
256 2097152 1 50 50.074 ms 5.92% 50.028 ms 4.75% -45.382 us -0.09% PASS
32 32768 2 50 28.089 us 30.82% 27.587 us 27.68% -0.502 us -1.79% PASS
64 32768 2 50 48.268 us 14.53% 47.903 us 19.13% -0.365 us -0.76% PASS
128 32768 2 50 261.278 us 4.96% 262.300 us 5.53% 1.022 us 0.39% PASS
256 32768 2 50 707.282 us 1.80% 708.874 us 1.82% 1.593 us 0.23% PASS
32 262144 2 50 82.936 us 7.50% 80.844 us 10.18% -2.092 us -2.52% PASS
64 262144 2 50 1.072 ms 20.03% 1.054 ms 22.45% -18.581 us -1.73% PASS
128 262144 2 50 4.784 ms 1.46% 4.788 ms 1.46% 4.607 us 0.10% PASS
256 262144 2 50 14.408 ms 0.80% 14.415 ms 0.94% 7.452 us 0.05% PASS
32 2097152 2 50 538.231 us 2.33% 522.847 us 3.55% -15.384 us -2.86% FAIL
64 2097152 2 50 8.702 ms 3.13% 8.804 ms 3.20% 102.208 us 1.17% PASS
128 2097152 2 50 38.978 ms 0.89% 39.048 ms 0.94% 69.933 us 0.18% PASS
256 2097152 2 50 122.582 ms 0.48% 122.466 ms 0.11% -115.344 us -0.09% PASS
32 32768 3 50 34.454 us 14.55% 34.354 us 26.88% -0.100 us -0.29% PASS
64 32768 3 50 64.146 us 6.80% 63.659 us 11.72% -0.487 us -0.76% PASS
128 32768 3 50 398.379 us 4.74% 399.936 us 4.16% 1.557 us 0.39% PASS
256 32768 3 50 1.126 ms 1.01% 1.128 ms 1.25% 1.686 us 0.15% PASS
32 262144 3 50 119.123 us 10.12% 116.759 us 7.62% -2.365 us -1.98% PASS
64 262144 3 50 1.705 ms 23.84% 1.574 ms 28.80% -130.880 us -7.68% PASS
128 262144 3 50 7.173 ms 1.56% 7.173 ms 1.20% -0.691 us -0.01% PASS
256 262144 3 50 21.807 ms 0.89% 21.827 ms 0.77% 20.710 us 0.09% PASS
32 2097152 3 50 834.016 us 1.29% 812.645 us 2.24% -21.370 us -2.56% FAIL
64 2097152 3 50 13.711 ms 3.09% 13.730 ms 3.17% 19.364 us 0.14% PASS
128 2097152 3 50 58.919 ms 0.74% 58.835 ms 0.49% -83.682 us -0.14% PASS
256 2097152 3 50 186.027 ms 0.33% 186.195 ms 0.32% 168.343 us 0.09% PASS
32 32768 4 50 41.428 us 17.85% 40.868 us 17.56% -0.560 us -1.35% PASS
64 32768 4 50 79.369 us 12.55% 78.958 us 12.24% -0.412 us -0.52% PASS
128 32768 4 50 551.605 us 2.92% 554.330 us 3.42% 2.724 us 0.49% PASS
256 32768 4 50 1.659 ms 1.34% 1.663 ms 1.37% 4.140 us 0.25% PASS
32 262144 4 50 158.178 us 7.64% 155.319 us 9.23% -2.859 us -1.81% PASS
64 262144 4 50 2.280 ms 24.75% 2.222 ms 24.35% -57.495 us -2.52% PASS
128 262144 4 50 9.568 ms 1.33% 9.578 ms 1.09% 10.492 us 0.11% PASS
256 262144 4 50 29.272 ms 0.72% 29.318 ms 0.78% 46.565 us 0.16% PASS
32 2097152 4 50 1.128 ms 2.00% 1.100 ms 2.14% -27.869 us -2.47% FAIL
64 2097152 4 50 18.691 ms 2.60% 19.086 ms 2.66% 395.011 us 2.11% PASS
128 2097152 4 50 78.861 ms 0.49% 79.000 ms 0.78% 139.065 us 0.18% PASS
256 2097152 4 50 250.303 ms 0.30% 250.352 ms 0.27% 49.465 us 0.02% PASS
32 32768 1 100 21.799 us 18.00% 21.944 us 36.36% 0.144 us 0.66% PASS
64 32768 1 100 36.086 us 18.22% 36.046 us 15.81% -0.040 us -0.11% PASS
128 32768 1 100 92.405 us 7.09% 92.043 us 7.59% -0.363 us -0.39% PASS
256 32768 1 100 166.217 us 3.31% 166.175 us 6.34% -0.042 us -0.03% PASS
32 262144 1 100 52.261 us 9.60% 52.150 us 12.48% -0.111 us -0.21% PASS
64 262144 1 100 164.659 us 6.58% 164.236 us 5.95% -0.423 us -0.26% PASS
128 262144 1 100 2.611 ms 1.49% 2.610 ms 1.37% -1.259 us -0.05% PASS
256 262144 1 100 8.480 ms 1.34% 8.491 ms 1.24% 11.622 us 0.14% PASS
32 2097152 1 100 274.135 us 4.72% 272.989 us 4.04% -1.145 us -0.42% PASS
64 2097152 1 100 957.997 us 1.35% 956.725 us 1.29% -1.272 us -0.13% PASS
128 2097152 1 100 20.679 ms 1.73% 20.766 ms 2.24% 86.797 us 0.42% PASS
256 2097152 1 100 67.829 ms 3.20% 68.371 ms 3.15% 541.952 us 0.80% PASS
32 32768 2 100 30.616 us 26.36% 30.430 us 22.11% -0.186 us -0.61% PASS
64 32768 2 100 64.740 us 17.85% 64.255 us 11.57% -0.485 us -0.75% PASS
128 32768 2 100 398.788 us 4.40% 397.924 us 5.04% -0.865 us -0.22% PASS
256 32768 2 100 1.114 ms 1.58% 1.117 ms 1.63% 2.897 us 0.26% PASS
32 262144 2 100 109.722 us 9.98% 108.079 us 5.96% -1.643 us -1.50% PASS
64 262144 2 100 1.552 ms 7.57% 1.580 ms 6.70% 28.360 us 1.83% PASS
128 262144 2 100 5.480 ms 1.06% 5.497 ms 1.55% 16.834 us 0.31% PASS
256 262144 2 100 16.948 ms 1.11% 16.976 ms 0.87% 27.304 us 0.16% PASS
32 2097152 2 100 660.413 us 2.12% 649.709 us 1.69% -10.704 us -1.62% PASS
64 2097152 2 100 11.867 ms 17.59% 11.397 ms 23.52% -470.161 us -3.96% PASS
128 2097152 2 100 45.869 ms 0.22% 45.837 ms 0.02% -31.223 us -0.07% FAIL
256 2097152 2 100 143.324 ms 0.43% 143.189 ms 0.05% -134.118 us -0.09% FAIL
32 32768 3 100 37.998 us 18.92% 37.817 us 18.54% -0.181 us -0.48% PASS
64 32768 3 100 93.221 us 12.52% 92.438 us 12.86% -0.783 us -0.84% PASS
128 32768 3 100 579.144 us 4.80% 580.545 us 5.01% 1.402 us 0.24% PASS
256 32768 3 100 1.802 ms 1.09% 1.803 ms 0.86% 1.223 us 0.07% PASS
32 262144 3 100 157.729 us 7.45% 154.741 us 8.92% -2.988 us -1.89% PASS
64 262144 3 100 2.311 ms 5.97% 2.345 ms 5.45% 33.181 us 1.44% PASS
128 262144 3 100 8.205 ms 1.13% 8.227 ms 1.14% 21.224 us 0.26% PASS
256 262144 3 100 25.703 ms 0.81% 25.774 ms 0.72% 70.655 us 0.27% PASS
32 2097152 3 100 988.391 us 2.11% 955.634 us 1.34% -32.756 us -3.31% FAIL
64 2097152 3 100 19.102 ms 14.04% 18.071 ms 22.45% -1030.878 us -5.40% PASS
128 2097152 3 100 68.718 ms 0.03% 68.721 ms 0.02% 3.143 us 0.00% PASS
256 2097152 3 100 217.484 ms 0.38% 217.459 ms 0.42% -25.203 us -0.01% PASS
32 32768 4 100 46.401 us 22.79% 45.750 us 14.41% -0.650 us -1.40% PASS
64 32768 4 100 116.467 us 7.70% 115.589 us 8.78% -0.877 us -0.75% PASS
128 32768 4 100 758.928 us 3.85% 763.480 us 3.78% 4.551 us 0.60% PASS
256 32768 4 100 2.500 ms 1.69% 2.505 ms 1.33% 4.055 us 0.16% PASS
32 262144 4 100 209.410 us 5.00% 203.088 us 4.73% -6.322 us -3.02% PASS
64 262144 4 100 3.059 ms 6.10% 3.103 ms 3.30% 43.562 us 1.42% PASS
128 262144 4 100 10.958 ms 1.06% 10.987 ms 1.04% 28.601 us 0.26% PASS
256 262144 4 100 34.085 ms 0.66% 34.189 ms 0.43% 103.608 us 0.30% PASS
32 2097152 4 100 1.345 ms 1.82% 1.272 ms 0.97% -73.382 us -5.46% FAIL
64 2097152 4 100 25.505 ms 14.87% 23.395 ms 24.86% -2109.875 us -8.27% PASS
128 2097152 4 100 91.736 ms 0.50% 91.646 ms 0.03% -89.549 us -0.10% FAIL
256 2097152 4 100 287.704 ms 0.23% 287.792 ms 0.35% 87.230 us 0.03% PASS

Summary

  • Total Matches: 168
    • Pass (diff <= min_noise): 144
    • Unknown (infinite noise): 0
    • Failure (diff > min_noise): 24

@bdice
Copy link
Contributor Author

bdice commented Nov 22, 2024

@lamarrr Looks like benchmarks are either a slight improvement for large trees, or no impact (edit: most are single-digit swings, but maybe not all are wins on second look). Can we determine the impact on binary size and compile time as well?

@lamarrr
Copy link
Contributor

lamarrr commented Nov 22, 2024

@lamarrr Looks like benchmarks are either a slight improvement for large trees, or no impact (edit: most are single-digit swings, but maybe not all are wins on second look). Can we determine the impact on binary size and compile time as well?

There's about a 3MB reduction in binary size of libcudf.so (note that this is with lineinfo and in RelWithDebInfo mode)

  • main: 1,520,188,704 bytes
  • ast-precompute-arity: 1,517,889,336 bytes

@lamarrr
Copy link
Contributor

lamarrr commented Nov 26, 2024

Build Metrics Report:

rapidsai:branch-25.02

file build time binary size
src/ast/expression_parser.cpp.o 6.457 s 794.296 KB
src/join/mixed_join_size_kernel.cu.o 2:05 min 494.592 KB
src/join/mixed_join_semi.cu.o 28.462 s 1.528 MB
src/join/mixed_join_size_kernel_nulls.cu.o 4:44 min 539.512 KB
src/join/mixed_join_kernel.cu.o 113.738 s 484.872 KB
src/join/mixed_join.cu.o 20.055 s 1.498 MB
src/join/mixed_join_kernels_semi.cu.o 73.705 s 362.616 KB
src/join/mixed_join_kernel_nulls.cu.o 3:49 min 525.120 KB

bdice:ast-precompute-arity

file build time binary size
src/ast/operators.cpp.o 4:28 min 537.048 KB
src/ast/expression_parser.cpp.o 1.902 s 123.224 KB
src/join/mixed_join_size_kernel.cu.o 2:02 min 491.776 KB
src/join/mixed_join_semi.cu.o 23.890 s 1.529 MB
src/join/mixed_join_size_kernel_nulls.cu.o 4:28 min 537.048 KB
src/join/mixed_join_kernel.cu.o 108.065 s 481.608 KB
src/join/mixed_join.cu.o 15.857 s 1.500 MB
src/join/mixed_join_kernels_semi.cu.o 67.704 s 361.912 KB
src/join/mixed_join_kernel_nulls.cu.o 3:49 min 524.872 KB

@bdice
Copy link
Contributor Author

bdice commented Nov 26, 2024

Mixed join kernels take more like ~15 minutes to compile. Build times that are 2-3 seconds are due to a sccache hit. We want to know the times when sccache misses.

Try altering something trivial in one of the headers to generate an sccache miss, or try disabling sccache as a compiler launcher. That will give us better numbers.

@lamarrr
Copy link
Contributor

lamarrr commented Nov 27, 2024

I've edited the types.h header to generate an sccache miss, the results are same as unsetting the launcher.

@lamarrr
Copy link
Contributor

lamarrr commented Dec 5, 2024

All Arch Build Metrics Report:

rapidsai:branch-25.02

file build time binary size
src/ast/expression_parser.cpp.o 7.092 s 794.296 KB
src/join/mixed_join_size_kernel.cu.o 10:20 min 3.199 MB
src/join/mixed_join_semi.cu.o 104.804 s 2.356 MB
src/join/mixed_join_size_kernel_nulls.cu.o 21:27 min 3.653 MB
src/join/mixed_join_kernel.cu.o 8:47 min 3.213 MB
src/join/mixed_join.cu.o 57.511 s 1.831 MB
src/join/mixed_join_kernels_semi.cu.o 5:08 min 2.121 MB
src/join/mixed_join_kernel_nulls.cu.o 19:27 min 3.645 MB

bdice:ast-precompute-arity

file build time binary size
src/ast/operators.cpp.o 553 ms 255.992 KB
src/ast/expression_parser.cpp.o 596 ms 123.224 KB
src/join/mixed_join_size_kernel.cu.o 11:31 min 3.182 MB
src/join/mixed_join_semi.cu.o 109.986 s 2.357 MB
src/join/mixed_join_size_kernel_nulls.cu.o 25:40 min 3.636 MB
src/join/mixed_join_kernel.cu.o 9:29 min 3.202 MB
src/join/mixed_join.cu.o 50.106 s 1.833 MB
src/join/mixed_join_kernels_semi.cu.o 5:57 min 2.113 MB
src/join/mixed_join_kernel_nulls.cu.o 21:32 min 3.635 MB

@lamarrr
Copy link
Contributor

lamarrr commented Dec 5, 2024

All Arch Build Metrics Report:

rapidsai:branch-25.02

file build time binary size
src/ast/expression_parser.cpp.o 7.092 s 794.296 KB
src/join/mixed_join_size_kernel.cu.o 10:20 min 3.199 MB
src/join/mixed_join_semi.cu.o 104.804 s 2.356 MB
src/join/mixed_join_size_kernel_nulls.cu.o 21:27 min 3.653 MB
src/join/mixed_join_kernel.cu.o 8:47 min 3.213 MB
src/join/mixed_join.cu.o 57.511 s 1.831 MB
src/join/mixed_join_kernels_semi.cu.o 5:08 min 2.121 MB
src/join/mixed_join_kernel_nulls.cu.o 19:27 min 3.645 MB

bdice:ast-precompute-arity

file build time binary size
src/ast/operators.cpp.o 553 ms 255.992 KB
src/ast/expression_parser.cpp.o 596 ms 123.224 KB
src/join/mixed_join_size_kernel.cu.o 11:31 min 3.182 MB
src/join/mixed_join_semi.cu.o 109.986 s 2.357 MB
src/join/mixed_join_size_kernel_nulls.cu.o 25:40 min 3.636 MB
src/join/mixed_join_kernel.cu.o 9:29 min 3.202 MB
src/join/mixed_join.cu.o 50.106 s 1.833 MB
src/join/mixed_join_kernels_semi.cu.o 5:57 min 2.113 MB
src/join/mixed_join_kernel_nulls.cu.o 21:32 min 3.635 MB

@bdice it seems the all-arch build-time increases, but the executable size decreases

@lamarrr lamarrr removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Dec 9, 2024
@lamarrr lamarrr self-requested a review December 9, 2024 14:37
Copy link
Contributor

@lamarrr lamarrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and ready to me

@lamarrr
Copy link
Contributor

lamarrr commented Jan 10, 2025

/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Burndown
Development

Successfully merging this pull request may close these issues.

3 participants