-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precompute AST arity #17234
base: branch-25.02
Are you sure you want to change the base?
Precompute AST arity #17234
Conversation
I'm realizing that by pre-computing the arity, we can move a huge amount of code from being I think this entire section can be made host-only. (Partly done in b69dfc2 but there's more to do.) https://github.com/rapidsai/cudf/blob/branch-24.12/cpp/include/cudf/ast/detail/operators.hpp#L958-L1230 |
Does that remove a whole level of dispatching as well? If so that could be a big plus for both runtime performance, binary size, and compile time. |
The arity precomputation on host removes an operator dispatcher from the device side, but it was just a lookup (single dispatch). It wasn’t like an operator dispatch before a type dispatch (not a double dispatch). The other changes I noted above just clean up our dispatching logic so we don’t have to support the operator dispatch on device at all, and can make it host-only code instead of host-device. I am hopeful this may reduce compile time, binary size, and kernel complexity. I haven’t gotten far enough to measure that yet. |
…oid where possible.
I think 5598f68 should build successfully. We'll need to verify if build times have changed at all with this PR. I shifted a fair bit of code into a narrower scope, in an anonymous namespace of a C++ file, so hopefully it's not worse. However, I also did some refactoring of the AST machinery to make it possible for dispatched functors to return non- |
{ | ||
switch (op) { | ||
case ast_operator::ADD: | ||
f.template operator()<ast_operator::ADD>(std::forward<Ts>(args)...); | ||
break; | ||
return f.template operator()<ast_operator::ADD>(std::forward<Ts>(args)...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is one thread per row a good utilization of the waves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we also benefit from using indices into function pointers instead of the switch statements for dynamic operator dispatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is one thread per row a good utilization of the waves?
Would you propose that each thread takes on multiple rows? My hope was that loads of adjacent rows would be coalesced across threads, but I don't know if that is occurring. We would have to read the kernel PTX/SASS to know. It would be worth trying just to see what we get. This is something I would consider in a follow-up PR.
could we also benefit from using indices into function pointers instead of the switch statements for dynamic operator dispatch?
That idea has been raised before. I quote from a DM with @jrhemstad (which I don't think he'll mind):
The basic idea is instead of doing a bunch of nested switches in device code to route us to the code path we want, we could just create an array of function pointers ahead of time in host code and pass those to a kernel where we execute those functions in order without needing any switchs in device code.
I had always thought this would be impossible because you can't take the address of a device function from host code.
Two key realizations:
- You can create a lookup table of device functions automatically as a
__device__
array.- Instead of populating an array of function pointers in host code, you just populate an array of indices into that lookup table.
So for AST, instead of all the nasty nested switches, as part of the linearization we could just populate that array of indices into the function pointer lookup table and then the kernel would just be a matter of marching through those indices and executing their associated function pointer.
Very minimal demo:__device__ void foo(){} __device__ void bar(){} __device__ void baz(){} using func_ptr = void(*)(); __device__ func_ptr ptrs[] = {foo, bar, baz}; __global__ void kernel(int i){ auto f = ptrs[i]; f(); }
We wondered whether __noinline__
could be used on the dispatcher (or operators) to emulate this behavior. Let's file an issue to document this idea for a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did run a micro-benchmark of this approach (using a simple sum kernel) but it was slower than enums, I suspect the compiler does more aggressive branch analysis with enums, and function pointers could inhibit it from optimizing the code further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much did the performance change? Can you share your micro-benchmark and the results?
One of the biggest challenges we face with the AST is compile time and kernel size -- if the function-pointer dispatch helps with either of those, it could still be worth evaluating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
via: https://github.com/lamarrr/cudf/tree/static-branch-substitution
The performance is mostly same
We need benchmarks to see the performance impact of this before merging. Marking as do-not-merge for now. @lamarrr, would you be able to help benchmark this PR? Happy to have your feedback on the design as well. |
Yeah, sure! I'll get on that now |
Here are the results: ast_int32_imbalanced_unique[0] NVIDIA RTX A6000
ast_int32_imbalanced_reuse[0] NVIDIA RTX A6000
ast_double_imbalanced_unique[0] NVIDIA RTX A6000
ast_int32_imbalanced_unique_nulls[0] NVIDIA RTX A6000
ast_int32_imbalanced_reuse_nulls[0] NVIDIA RTX A6000
ast_double_imbalanced_unique_nulls[0] NVIDIA RTX A6000
ast_string_equal_logical_and[0] NVIDIA RTX A6000
Summary
|
@lamarrr Looks like benchmarks are either a slight improvement for large trees, or no impact (edit: most are single-digit swings, but maybe not all are wins on second look). Can we determine the impact on binary size and compile time as well? |
There's about a 3MB reduction in binary size of libcudf.so (note that this is with lineinfo and in RelWithDebInfo mode)
|
Build Metrics Report: rapidsai:branch-25.02
bdice:ast-precompute-arity
|
Mixed join kernels take more like ~15 minutes to compile. Build times that are 2-3 seconds are due to a sccache hit. We want to know the times when sccache misses. Try altering something trivial in one of the headers to generate an sccache miss, or try disabling sccache as a compiler launcher. That will give us better numbers. |
I've edited the types.h header to generate an sccache miss, the results are same as unsetting the launcher. |
All Arch Build Metrics Report: rapidsai:branch-25.02
bdice:ast-precompute-arity
|
@bdice it seems the all-arch build-time increases, but the executable size decreases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and ready to me
fixed ast program's buffer alignment and sizing
/merge |
Description
This PR precomputes AST arity on the host, to reduce the complexity in device-side arity lookup.
Checklist