-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] JSON host tree algorithms #16545
[REVIEW] JSON host tree algorithms #16545
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has fixed all of the regressions I was seeing before. It is looking good.
From a performance perspective I am still seeing about 16.5 seconds to run my benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few comments.
} | ||
|
||
// Pruning | ||
thrust::host_vector<uint8_t> is_pruned(num_columns, options.is_enabled_prune_columns()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it's still uint8_t, but assignments use true/false. We shold use the same type, whichever it is.
dev_ref parent_ref = std::ref(root); | ||
// creates children column | ||
std::function<void(NodeIndexT, dev_ref)> construct_tree; | ||
construct_tree = [&](NodeIndexT root, dev_ref ref) -> void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if some of the lambdas in this function should be functions. They're pretty large and overall build_tree
is definitely on the "too big" side. Not a blocking comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered making it independent functions too. But the number of captures by reference seemed very high.
is there any smarter way to do this (like that dispatch function suggestion)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a (functor?) class, so common parameters are stored as data members and we don't have to pass them in each call. We would need to pass them explicitly to create the object, but at least it's done only once.
/merge |
Description
Depends on #16836
This change adds a new host tree building algorithms for JSON reader and utf8 field name support.
This constructs the device_column_tree using an adjacency list created from parent information.
This adjacency list is pruned based on input schema, and also types are enforced as per schema.
mark_is_pruned
Tree is constructed from pruned adjacency list, (with mixed types handling).
construct_tree
utf8 field name support added: (spark requested)
utf8 decoding of field names during hashing of field nodes so that utf8 encoded field names also match to same column.
All unit tests passes, 1 unit test added where old algorithm fails.
This code is kept under experimental flag.
Checklist