Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long string optimization for string column parsing in JSON reader #13803

Merged
merged 63 commits into from
Sep 20, 2023

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Aug 2, 2023

Description

closes #13724

In old code, 1 thread per string is allocated for parsing a string column.
For longer strings (>1024), the runtime of 1-thread-per-string to decode is taking too long even for few strings.

In this change, 1 warp per string is used for parsing for strings length <=1024 and 1 block per string for string length >1024. If max string length < 128, 1 thread per string is used as usual.

256 threads_per_block is used for both kernels.
Code for 1-warp-per-string and 1-block-per-string is similar, but only varies with warp-wide and block-wide primitives for reduction and scan operations. shared memory usage will differ slightly too.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@karthikeyann karthikeyann added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 2, 2023
@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond 4 - Needs cuIO Reviewer and removed 2 - In Progress Currently a work in progress labels Aug 17, 2023
@karthikeyann karthikeyann marked this pull request as ready for review August 17, 2023 19:30
@karthikeyann karthikeyann requested a review from a team as a code owner August 17, 2023 19:30
@karthikeyann karthikeyann requested review from harrism and vuule August 17, 2023 19:30
@karthikeyann karthikeyann changed the title Long string optimization for string column parsing in JSON reader Long string optimization for string column parsing in JSON reader Sep 12, 2023
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small comments, looks great otherwise.

cpp/tests/io/json_test.cpp Outdated Show resolved Hide resolved
cpp/src/io/utilities/data_casting.cu Outdated Show resolved Hide resolved
cpp/src/io/utilities/data_casting.cu Outdated Show resolved Hide resolved
@karthikeyann
Copy link
Contributor Author

/ok to test

@wence-
Copy link
Contributor

wence- commented Sep 18, 2023

/ok to test

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nitpicks:

two, I think, want addressed before merge (even if only to convince me that there is "nothing to see here"). Specifically:

  • possible integer overflow in the grid stride loop
  • logic difference in the two places UTF16 surrogate codepoints are handled

cpp/src/io/utilities/data_casting.cu Show resolved Hide resolved
cpp/src/io/utilities/data_casting.cu Show resolved Hide resolved
cpp/src/io/utilities/data_casting.cu Outdated Show resolved Hide resolved

// This is indeed a UTF16 surrogate pair
if (hex_val >= UTF16_HIGH_SURROGATE_BEGIN && hex_val < UTF16_HIGH_SURROGATE_END &&
hex_low_val >= UTF16_LOW_SURROGATE_BEGIN && hex_low_val < UTF16_LOW_SURROGATE_END) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: ‏Is it valid to have two successive UTF-16 codepoints where the first is contained in [HIGH_SURROGATE_BEGIN, HIGH_SURROGATE_END) and the second is not contained in [LOW_SURROGATE_BEGIN, LOW_SURROGATE_END)? If not, then I think we are missing a parse error here for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, it's considered as error for UTF-16.
https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF
Since the ranges for the high surrogates (0xD800–0xDBFF), low surrogates (0xDC00–0xDFFF), and valid BMP characters (0x0000–0xD7FF, 0xE000–0xFFFF) are disjoint, it is not possible for a surrogate to match a BMP character, or for two adjacent code units to look like a legal surrogate pair. This simplifies searches a great deal.

https://en.wikipedia.org/wiki/UTF-16#U+D800_to_U+DFFF_(surrogates)

It is possible to unambiguously encode an unpaired surrogate (a high surrogate code point not followed by a low one, or a low one not preceded by a high one) in the format of UTF-16 by using a code unit equal to the code point. The result is not valid UTF-16, but the majority of UTF-16 encoder and decoder implementations do this when translating between encodings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so if I understand correctly, we want logic that is something like:

hex_val = parse_unicode_hex(stream);
if (is_high_surrogate(hex_val)) {
    auto hex_val_low = parse_unicode_hex(stream);
    if (is_low_surrogate(hex_val_low)) {
        // valid surrogate pair, decode
    } else {
        // invalid surrogate pair, parse error
        // I think this case is missing
        return {bytes, parse_error};
    }
} else {
   // non-surrogate pair, decode as normal
}

But I think the parse error in the case that we have a pair where the first is a surrogate pair and the second is not handled. Except that maybe this is deliberate per:

The result is not valid UTF-16, but the majority of UTF-16 encoder and decoder implementations do this when translating between encodings.

So you're just "carrying on" and letting an invalid character appear in the output stream. Is that right?

If so, maybe

suggestion: Add a comment explaining this parsing behaviour.

Copy link
Contributor Author

@karthikeyann karthikeyann Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.
added a comment at the place where this is skipped, not thrown as error.

cpp/src/io/utilities/data_casting.cu Show resolved Hide resolved
cpp/src/io/utilities/data_casting.cu Show resolved Hide resolved
cpp/src/io/utilities/data_casting.cu Show resolved Hide resolved
cpp/src/io/utilities/data_casting.cu Show resolved Hide resolved
cpp/src/io/utilities/data_casting.cu Outdated Show resolved Hide resolved
cpp/src/io/utilities/data_casting.cu Outdated Show resolved Hide resolved
@harrism harrism removed their request for review September 18, 2023 23:22
@wence-
Copy link
Contributor

wence- commented Sep 19, 2023

/ok to test

@wence-
Copy link
Contributor

wence- commented Sep 19, 2023

Thanks for your hard work here!

@karthikeyann
Copy link
Contributor Author

/ok to test

@karthikeyann karthikeyann added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond 4 - Needs cuIO Reviewer 3 - Ready for Review Ready for review by team labels Sep 19, 2023
@karthikeyann
Copy link
Contributor Author

karthikeyann commented Sep 19, 2023

Final benchmark:

~70-80% runtime reduction for long strings. (overall range 30%-89%)

json_read_string_column (single column) (branch-23.10 vs this PR)

single column with max length string, json file size 512MB

[0] Quadro GV100

max_length is_fixed_length Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff
16 1 247.072 ms 0.67% 244.442 ms 0.21% -2629.677 us -1.06%
32 1 161.946 ms 0.48% 160.267 ms 0.18% -1678.671 us -1.04%
64 1 160.508 ms 0.07% 161.867 ms 0.44% 1.358 ms 0.85%
128 1 225.849 ms 0.43% 162.441 ms 0.36% -63408.243 us -28.08%
256 1 291.920 ms 0.33% 144.881 ms 0.47% -147039.083 us -50.37%
512 1 317.194 ms 0.20% 132.372 ms 0.48% -184822.043 us -58.27%
1024 1 353.383 ms 0.23% 125.344 ms 0.37% -228039.301 us -64.53%
4096 1 235.406 ms 0.16% 121.457 ms 0.50% -113949.635 us -48.41%
16384 1 123.327 ms 0.03% 135.223 ms 0.49% 11.896 ms 9.65%
65536 1 272.949 ms 0.04% 142.435 ms 0.27% -130513.632 us -47.82%
262144 1 951.804 ms 0.03% 150.038 ms 0.41% -801766.458 us -84.24%
1048576 1 3.243 s 0.03% 347.749 ms 0.29% -2895555.682 us -89.28%
10485760 1 15.121 s inf% 2.737 s 0.08% -12384781.445 us -81.90%
104857600 1 59.740 s inf% 18.052 s inf% -41687603.516 us -69.78%
16 0 341.065 ms 0.20% 338.887 ms 0.18% -2177.289 us -0.64%
32 0 203.926 ms 0.41% 202.162 ms 0.26% -1764.044 us -0.87%
64 0 148.662 ms 0.06% 148.158 ms 0.46% -504.291 us -0.34%
128 0 179.354 ms 0.80% 188.805 ms 0.14% 9.450 ms 5.27%
256 0 243.734 ms 0.44% 153.771 ms 0.50% -89962.469 us -36.91%
512 0 287.616 ms 0.26% 141.374 ms 0.10% -146241.812 us -50.85%
1024 0 315.399 ms 0.07% 132.822 ms 0.49% -182577.268 us -57.89%
4096 0 296.523 ms 0.20% 126.245 ms 0.47% -170277.805 us -57.42%
16384 0 114.834 ms 0.12% 128.630 ms 0.30% 13.796 ms 12.01%
65536 0 210.012 ms 0.34% 145.012 ms 0.17% -64999.411 us -30.95%
262144 0 697.045 ms 0.04% 145.651 ms 0.10% -551394.577 us -79.10%
1048576 0 2.579 s 0.01% 292.832 ms 0.11% -2285863.312 us -88.64%
10485760 0 11.416 s inf% 2.323 s 0.02% -9093176.855 us -79.65%
104857600 0 41.310 s inf% 10.164 s inf% -31145317.383 us -75.39%

json_read_string_column (multiple columns) (branch-23.10 vs this PR)

64 column in json, json file size 512MB

[0] Quadro GV100

max_length is_fixed_length Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff
16 1 375.037 ms 0.37% 395.404 ms 0.24% 20.367 ms 5.43%
32 1 312.670 ms 0.13% 335.295 ms 0.02% 22.624 ms 7.24%
64 1 172.643 ms 0.12% 179.064 ms 0.34% 6.421 ms 3.72%
128 1 129.609 ms 0.49% 161.565 ms 0.30% 31.956 ms 24.66%
256 1 136.144 ms 0.42% 150.128 ms 0.48% 13.984 ms 10.27%
512 1 169.633 ms 0.35% 141.509 ms 0.02% -28123.434 us -16.58%
1024 1 267.141 ms 0.02% 144.759 ms 0.49% -122382.312 us -45.81%
4096 1 868.568 ms 0.11% 150.625 ms 0.10% -717943.390 us -82.66%
16384 1 3.227 s 0.02% 429.324 ms 0.26% -2797406.024 us -86.69%
65536 1 5.988 s inf% 1.177 s 0.07% -4811603.776 us -80.35%
262144 1 18.942 s inf% 4.006 s inf% -14936076.111 us -78.85%
1048576 1 35.072 s inf% 10.725 s inf% -24346885.742 us -69.42%
10485760 1 186.692 s inf% 24.209 s inf% -162482992.188 us -87.03%
16 0 446.509 ms 0.30% 463.299 ms 0.30% 16.790 ms 3.76%
32 0 310.329 ms 0.26% 325.206 ms 0.29% 14.878 ms 4.79%
64 0 229.938 ms 0.19% 244.593 ms 0.41% 14.655 ms 6.37%
128 0 141.235 ms 0.34% 164.888 ms 0.46% 23.653 ms 16.75%
256 0 125.721 ms 0.24% 162.090 ms 0.19% 36.369 ms 28.93%
512 0 143.359 ms 0.39% 154.406 ms 0.40% 11.047 ms 7.71%
1024 0 210.963 ms 0.39% 151.757 ms 0.48% -59205.129 us -28.06%
4096 0 640.442 ms 0.04% 179.280 ms 0.22% -461162.570 us -72.01%
16384 0 2.338 s 0.04% 279.270 ms 0.07% -2058996.429 us -88.06%
65536 0 4.671 s inf% 929.830 ms 0.05% -3741301.990 us -80.09%
262144 0 14.595 s inf% 3.134 s 0.03% -11460648.486 us -78.53%
1048576 0 32.698 s inf% 11.862 s inf% -20835319.336 us -63.72%
10485760 0 125.915 s inf% 16.261 s inf% -109654444.336 us -87.09%

@karthikeyann
Copy link
Contributor Author

karthikeyann commented Sep 19, 2023

Thanks to all the reviewers for the great inputs! Many suggestions lead to great performance improvements 🚀
@vuule @elstehle @wence- @robertmaynard

@vuule
Copy link
Contributor

vuule commented Sep 19, 2023

/merge

@rapids-bot rapids-bot bot merged commit 97501d8 into rapidsai:branch-23.10 Sep 20, 2023
54 of 55 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Improve JSON reader performance with "document" strings columns
6 participants