Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stateless Row Encoding / Don't Preserve Dictionaries in RowConverter (#4811) #4819

Merged
merged 4 commits into from
Sep 17, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Sep 15, 2023

Which issue does this PR close?

Closes #4811

Rationale for this change

Interning of DictionaryArray values is slightly faster for String dictionaries with a small number of distinct values, as the representation in Row format is more concise than copying the actual underlying value. However, as we have discovered several times in downstream projects (for example apache/datafusion#7200 from @JayjeetAtGithub ) the interning has huge downsides once the number of distinct keys becomes "large". Specifically:

  1. Memory consumption is unbounded (since the RowConverter keeps a copy of all values has ever seen)
  2. Performance suffers as the length of the interned representation is a function of the number of distinct values ever seen, which can easily exceed the original data size with high cardinality values)

I initially considered a more holistic rework of the API to entirely remove RowConverter, but it quickly became quite a large change. Instead this PR starts small and simply removes the dictionary interning, allowing the RowConverter methods to take &self instead of &mut self. This will simplify sharing a RowConverter across multiple threads, e.g. by wrapping it in Arc.

There is more discussion on #4811

What changes are included in this PR?

Removes the dictionary preservation logic from RowConverter.

It isn't exactly a fair comparison, but the performance improvements are very significant.

convert_columns 4096 string_dictionary(10, 0)
                        time:   [99.401 µs 99.421 µs 99.445 µs]
                        change: [-87.296% -87.276% -87.263%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

convert_columns_prepared 4096 string_dictionary(10, 0)
                        time:   [98.143 µs 98.166 µs 98.193 µs]
                        change: [-35.716% -35.547% -35.409%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

convert_rows 4096 string_dictionary(10, 0)
                        time:   [58.769 µs 58.784 µs 58.804 µs]
                        change: [-77.717% -77.660% -77.612%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

convert_columns 4096 string_dictionary(30, 0)
                        time:   [99.277 µs 99.292 µs 99.308 µs]
                        change: [-87.710% -87.698% -87.683%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

convert_columns_prepared 4096 string_dictionary(30, 0)
                        time:   [98.284 µs 98.314 µs 98.347 µs]
                        change: [-37.308% -37.253% -37.162%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

convert_rows 4096 string_dictionary(30, 0)
                        time:   [58.738 µs 58.750 µs 58.765 µs]
                        change: [-77.761% -77.710% -77.667%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

convert_columns 4096 string_dictionary(100, 0)
                        time:   [167.89 µs 167.94 µs 167.99 µs]
                        change: [-80.874% -80.832% -80.788%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

convert_columns_prepared 4096 string_dictionary(100, 0)
                        time:   [164.66 µs 164.70 µs 164.74 µs]
                        change: [-24.825% -24.662% -24.561%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

convert_rows 4096 string_dictionary(100, 0)
                        time:   [98.780 µs 98.801 µs 98.823 µs]
                        change: [-66.415% -66.391% -66.367%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

convert_columns 4096 string_dictionary(100, 0.5)
                        time:   [123.44 µs 123.49 µs 123.54 µs]
                        change: [-70.268% -70.242% -70.218%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

convert_columns_prepared 4096 string_dictionary(100, 0.5)
                        time:   [122.29 µs 122.33 µs 122.37 µs]
                        change: [-11.596% -11.533% -11.473%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

convert_rows 4096 string_dictionary(100, 0.5)
                        time:   [101.20 µs 101.23 µs 101.26 µs]
                        change: [-34.506% -34.347% -34.243%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

convert_columns 4096 4096 string_dictionary(20, 0.5), string_dictionary(30, 0), string_dictionary(10...
                        time:   [428.24 µs 428.55 µs 429.10 µs]
                        change: [-12.006% -11.937% -11.865%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

convert_columns_prepared 4096 4096 string_dictionary(20, 0.5), string_dictionary(30, 0), string_dict...
                        time:   [421.87 µs 422.00 µs 422.17 µs]
                        change: [-12.320% -12.276% -12.230%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

convert_rows 4096 4096 string_dictionary(20, 0.5), string_dictionary(30, 0), string_dictionary(100, ...
                        time:   [257.66 µs 257.74 µs 257.82 µs]
                        change: [-6.7605% -5.7590% -4.7880%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

Are there any user-facing changes?

This disables dictionary preservation for keys, leading to different output types and memory / performance characteristics.

#4813 tracks potentially preserving the dictionary on decode, but I'm not yet sure about the benefits/practicalities of doing this.

#4712 tracks potential further performance improvements to this dictionary encoding approach

@tustvold tustvold added the api-change Changes to the arrow API label Sep 15, 2023
@github-actions github-actions bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Sep 15, 2023
@tustvold
Copy link
Contributor Author

FYI @sunchao I recall you mentioning the dictionary encoding overheads were very high in your workloads

@alamb
Copy link
Contributor

alamb commented Sep 15, 2023

I also added some additional rationale to the top of this PR to explain the CPU vs memory tradeoffs to try and help better motivate this change

@alamb alamb changed the title Stateless Row Encoding / Don't Preserve Dictionaries (#4811) Stateless Row Encoding / Don't Preserve Dictionaries in RowConverter (#4811) Sep 15, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all our benchmarks go faster and how many problems we have had with dictionary interning I am highly supportive of this approach.

I had a few benchmark questions, but otherwise 🏆

I looked at the benchmark results -- do we have any coverage for convert "low" cardinality dictionaries? The results posted seem to show that for high cardinality dictionaries (aka where each value is a random value) things get much faster, as expected

FYI @JayjeetAtGithub if we merge this I think apache/datafusion#7401 might not be needed

However, I still think getting some end to end performance numbers to find where preserving dictionaries improves performance (if at all) would still be hugely valuable to quanity the end to end performance effect that removing the dictionary interning from the row converter would hanve

@tustvold
Copy link
Contributor Author

tustvold commented Sep 15, 2023

do we have any coverage for convert "low" cardinality dictionaries

This is a very good point, added some benchmarks with "low" cardinality dictionaries containing 10 unique values. The performance gain from this PR is reduced, but is still sizeable, especially in the case where the values haven't been seen before.

convert_columns 4096 string_dictionary_low_cardinality(10, 0)
                        time:   [38.948 µs 38.956 µs 38.966 µs]
                        change: [-7.6416% -7.3418% -6.9125%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

convert_columns_prepared 4096 string_dictionary_low_cardinality(10, 0)
                        time:   [38.230 µs 38.244 µs 38.261 µs]
                        change: [-7.1958% -6.9091% -6.6087%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

convert_rows 4096 string_dictionary_low_cardinality(10, 0)
                        time:   [61.243 µs 61.258 µs 61.278 µs]
                        change: [-42.531% -42.345% -42.158%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

convert_columns 4096 string_dictionary_low_cardinality(30, 0)
                        time:   [39.168 µs 39.175 µs 39.183 µs]
                        change: [-11.201% -11.002% -10.889%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

convert_columns_prepared 4096 string_dictionary_low_cardinality(30, 0)
                        time:   [38.485 µs 38.493 µs 38.500 µs]
                        change: [-7.6690% -7.4510% -7.3212%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

convert_rows 4096 string_dictionary_low_cardinality(30, 0)
                        time:   [61.568 µs 61.584 µs 61.605 µs]
                        change: [-42.307% -42.220% -42.080%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

convert_columns 4096 string_dictionary_low_cardinality(100, 0)
                        time:   [40.500 µs 40.509 µs 40.520 µs]
                        change: [-19.204% -19.017% -18.911%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

convert_columns_prepared 4096 string_dictionary_low_cardinality(100, 0)
                        time:   [39.657 µs 39.665 µs 39.674 µs]
                        change: [-8.0164% -7.8251% -7.7172%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

convert_rows 4096 string_dictionary_low_cardinality(100, 0)
                        time:   [61.436 µs 61.448 µs 61.466 µs]
                        change: [-43.411% -43.330% -43.170%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

Edit: actually messed up, standby

@tustvold tustvold force-pushed the stateless-row-encoding branch from 47a8e43 to daa31df Compare September 15, 2023 15:36
@tustvold tustvold force-pushed the stateless-row-encoding branch from daa31df to b864c5e Compare September 15, 2023 15:39
@sunchao
Copy link
Member

sunchao commented Sep 15, 2023

I recall you mentioning the dictionary encoding overheads were very high in your workloads

Thanks @tustvold ! Yes, we observed the columnar-to-row conversion for dictionary encoding incurs quite some overhead. Even in my current vectorized hash table POC, which is based of GroupValuesRows, the conversion takes about 30% of total hash table time, measured from running tpch_mem and clickbench_1.

@tustvold
Copy link
Contributor Author

Now that I got the benchmarks working correctly they make a lot more sense 🤦...

With a low cardinality dictionary consisting of 10 distinct values, the method in this PR still yields a performance advantage with moderate-sized strings of less than 30 bytes. This relationship inverts once we get to strings consisting of 100 bytes, with a non-trivial regression for strings of this size.

convert_columns 4096 string_dictionary_low_cardinality(10, 0)
                        time:   [39.115 µs 39.128 µs 39.143 µs]
                        change: [-7.2399% -6.9400% -6.6487%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

convert_columns_prepared 4096 string_dictionary_low_cardinality(10, 0)
                        time:   [37.562 µs 37.568 µs 37.574 µs]
                        change: [-8.4250% -8.1655% -7.8921%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

convert_rows 4096 string_dictionary_low_cardinality(10, 0)
                        time:   [61.337 µs 61.348 µs 61.362 µs]
                        change: [-41.300% -41.123% -40.951%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe

convert_columns 4096 string_dictionary_low_cardinality(30, 0)
                        time:   [39.096 µs 39.104 µs 39.113 µs]
                        change: [-7.1129% -6.9915% -6.7990%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

convert_columns_prepared 4096 string_dictionary_low_cardinality(30, 0)
                        time:   [37.734 µs 37.741 µs 37.749 µs]
                        change: [-11.792% -10.671% -9.5985%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

convert_rows 4096 string_dictionary_low_cardinality(30, 0)
                        time:   [60.281 µs 60.290 µs 60.300 µs]
                        change: [-42.061% -41.997% -41.879%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

convert_columns 4096 string_dictionary_low_cardinality(100, 0)
                        time:   [58.589 µs 58.609 µs 58.636 µs]
                        change: [+38.632% +39.044% +39.486%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe

convert_columns_prepared 4096 string_dictionary_low_cardinality(100, 0)
                        time:   [57.292 µs 57.302 µs 57.313 µs]
                        change: [+39.663% +39.880% +40.261%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  3 (3.00%) high severe

convert_rows 4096 string_dictionary_low_cardinality(100, 0)
                        time:   [91.827 µs 91.914 µs 92.020 µs]
                        change: [-11.573% -11.484% -11.406%] (p = 0.00 < 0.05)
                        Performance has improved.

This is inline with my expectations, in the ideal case of a small dictionary containing large strings, the interning logic does represent a benefit, I'm not sure how common this case is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stateless Row Conversion
3 participants