Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] JSON host tree algorithms #16545

Merged
merged 56 commits into from
Sep 26, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Aug 13, 2024

Description

Depends on #16836
This change adds a new host tree building algorithms for JSON reader and utf8 field name support.

This constructs the device_column_tree using an adjacency list created from parent information.
This adjacency list is pruned based on input schema, and also types are enforced as per schema. mark_is_pruned
Tree is constructed from pruned adjacency list, (with mixed types handling). construct_tree

utf8 field name support added: (spark requested)
utf8 decoding of field names during hashing of field nodes so that utf8 encoded field names also match to same column.

All unit tests passes, 1 unit test added where old algorithm fails.
This code is kept under experimental flag.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Aug 13, 2024
@shrshi shrshi added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change improvement Improvement / enhancement to an existing function and removed libcudf Affects libcudf (C++/CUDA) code. labels Aug 13, 2024
@shrshi shrshi changed the title JSON host tree algorithms [WIP] JSON host tree algorithms Aug 14, 2024
@karthikeyann karthikeyann requested a review from a team as a code owner September 24, 2024 18:15
@github-actions github-actions bot added the Java Affects Java cuDF API. label Sep 24, 2024
Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has fixed all of the regressions I was seeing before. It is looking good.

From a performance perspective I am still seeing about 16.5 seconds to run my benchmark.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few comments.

cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Outdated Show resolved Hide resolved
cpp/src/io/json/json_tree.cu Show resolved Hide resolved
cpp/src/io/json/host_tree_algorithms.cu Outdated Show resolved Hide resolved
}

// Pruning
thrust::host_vector<uint8_t> is_pruned(num_columns, options.is_enabled_prune_columns());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's still uint8_t, but assignments use true/false. We shold use the same type, whichever it is.

cpp/src/io/json/host_tree_algorithms.cu Outdated Show resolved Hide resolved
dev_ref parent_ref = std::ref(root);
// creates children column
std::function<void(NodeIndexT, dev_ref)> construct_tree;
construct_tree = [&](NodeIndexT root, dev_ref ref) -> void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if some of the lambdas in this function should be functions. They're pretty large and overall build_tree is definitely on the "too big" side. Not a blocking comment.

Copy link
Contributor

@karthikeyann karthikeyann Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered making it independent functions too. But the number of captures by reference seemed very high.
is there any smarter way to do this (like that dispatch function suggestion)?

Copy link
Contributor

@vuule vuule Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a (functor?) class, so common parameters are stored as data members and we don't have to pass them in each call. We would need to pass them explicitly to create the object, but at least it's done only once.

@karthikeyann karthikeyann requested a review from vuule September 25, 2024 01:55
@karthikeyann karthikeyann self-requested a review September 25, 2024 20:41
@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Sep 25, 2024
@vuule
Copy link
Contributor

vuule commented Sep 26, 2024

/merge

@rapids-bot rapids-bot bot merged commit 12ee360 into rapidsai:branch-24.10 Sep 26, 2024
100 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants