Stateless Row Encoding / Don't Preserve Dictionaries in `RowConverter` (#4811) #4819

tustvold · 2023-09-15T11:36:49Z

Which issue does this PR close?

Rationale for this change

Interning of DictionaryArray values is slightly faster for String dictionaries with a small number of distinct values, as the representation in Row format is more concise than copying the actual underlying value. However, as we have discovered several times in downstream projects (for example apache/datafusion#7200 from @JayjeetAtGithub ) the interning has huge downsides once the number of distinct keys becomes "large". Specifically:

Memory consumption is unbounded (since the RowConverter keeps a copy of all values has ever seen)
Performance suffers as the length of the interned representation is a function of the number of distinct values ever seen, which can easily exceed the original data size with high cardinality values)

I initially considered a more holistic rework of the API to entirely remove RowConverter, but it quickly became quite a large change. Instead this PR starts small and simply removes the dictionary interning, allowing the RowConverter methods to take &self instead of &mut self. This will simplify sharing a RowConverter across multiple threads, e.g. by wrapping it in Arc.

There is more discussion on #4811

What changes are included in this PR?

Removes the dictionary preservation logic from RowConverter.

It isn't exactly a fair comparison, but the performance improvements are very significant.

convert_columns 4096 string_dictionary(10, 0)
                        time:   [99.401 µs 99.421 µs 99.445 µs]
                        change: [-87.296% -87.276% -87.263%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

convert_columns_prepared 4096 string_dictionary(10, 0)
                        time:   [98.143 µs 98.166 µs 98.193 µs]
                        change: [-35.716% -35.547% -35.409%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

convert_rows 4096 string_dictionary(10, 0)
                        time:   [58.769 µs 58.784 µs 58.804 µs]
                        change: [-77.717% -77.660% -77.612%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

convert_columns 4096 string_dictionary(30, 0)
                        time:   [99.277 µs 99.292 µs 99.308 µs]
                        change: [-87.710% -87.698% -87.683%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

convert_columns_prepared 4096 string_dictionary(30, 0)
                        time:   [98.284 µs 98.314 µs 98.347 µs]
                        change: [-37.308% -37.253% -37.162%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

convert_rows 4096 string_dictionary(30, 0)
                        time:   [58.738 µs 58.750 µs 58.765 µs]
                        change: [-77.761% -77.710% -77.667%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe

convert_columns 4096 string_dictionary(100, 0)
                        time:   [167.89 µs 167.94 µs 167.99 µs]
                        change: [-80.874% -80.832% -80.788%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

convert_columns_prepared 4096 string_dictionary(100, 0)
                        time:   [164.66 µs 164.70 µs 164.74 µs]
                        change: [-24.825% -24.662% -24.561%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

convert_rows 4096 string_dictionary(100, 0)
                        time:   [98.780 µs 98.801 µs 98.823 µs]
                        change: [-66.415% -66.391% -66.367%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

convert_columns 4096 string_dictionary(100, 0.5)
                        time:   [123.44 µs 123.49 µs 123.54 µs]
                        change: [-70.268% -70.242% -70.218%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

convert_columns_prepared 4096 string_dictionary(100, 0.5)
                        time:   [122.29 µs 122.33 µs 122.37 µs]
                        change: [-11.596% -11.533% -11.473%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe

convert_rows 4096 string_dictionary(100, 0.5)
                        time:   [101.20 µs 101.23 µs 101.26 µs]
                        change: [-34.506% -34.347% -34.243%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

convert_columns 4096 4096 string_dictionary(20, 0.5), string_dictionary(30, 0), string_dictionary(10...
                        time:   [428.24 µs 428.55 µs 429.10 µs]
                        change: [-12.006% -11.937% -11.865%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

convert_columns_prepared 4096 4096 string_dictionary(20, 0.5), string_dictionary(30, 0), string_dict...
                        time:   [421.87 µs 422.00 µs 422.17 µs]
                        change: [-12.320% -12.276% -12.230%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

convert_rows 4096 4096 string_dictionary(20, 0.5), string_dictionary(30, 0), string_dictionary(100, ...
                        time:   [257.66 µs 257.74 µs 257.82 µs]
                        change: [-6.7605% -5.7590% -4.7880%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

Are there any user-facing changes?

This disables dictionary preservation for keys, leading to different output types and memory / performance characteristics.

#4813 tracks potentially preserving the dictionary on decode, but I'm not yet sure about the benefits/practicalities of doing this.

#4712 tracks potential further performance improvements to this dictionary encoding approach

tustvold · 2023-09-15T12:54:26Z

FYI @sunchao I recall you mentioning the dictionary encoding overheads were very high in your workloads

alamb · 2023-09-15T13:55:58Z

I also added some additional rationale to the top of this PR to explain the CPU vs memory tradeoffs to try and help better motivate this change

alamb

Given all our benchmarks go faster and how many problems we have had with dictionary interning I am highly supportive of this approach.

I had a few benchmark questions, but otherwise 🏆

I looked at the benchmark results -- do we have any coverage for convert "low" cardinality dictionaries? The results posted seem to show that for high cardinality dictionaries (aka where each value is a random value) things get much faster, as expected

FYI @JayjeetAtGithub if we merge this I think apache/datafusion#7401 might not be needed

However, I still think getting some end to end performance numbers to find where preserving dictionaries improves performance (if at all) would still be hugely valuable to quanity the end to end performance effect that removing the dictionary interning from the row converter would hanve

tustvold · 2023-09-15T15:34:12Z

do we have any coverage for convert "low" cardinality dictionaries

This is a very good point, added some benchmarks with "low" cardinality dictionaries containing 10 unique values. The performance gain from this PR is reduced, but is still sizeable, especially in the case where the values haven't been seen before.

convert_columns 4096 string_dictionary_low_cardinality(10, 0)
                        time:   [38.948 µs 38.956 µs 38.966 µs]
                        change: [-7.6416% -7.3418% -6.9125%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

convert_columns_prepared 4096 string_dictionary_low_cardinality(10, 0)
                        time:   [38.230 µs 38.244 µs 38.261 µs]
                        change: [-7.1958% -6.9091% -6.6087%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

convert_rows 4096 string_dictionary_low_cardinality(10, 0)
                        time:   [61.243 µs 61.258 µs 61.278 µs]
                        change: [-42.531% -42.345% -42.158%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe

convert_columns 4096 string_dictionary_low_cardinality(30, 0)
                        time:   [39.168 µs 39.175 µs 39.183 µs]
                        change: [-11.201% -11.002% -10.889%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

convert_columns_prepared 4096 string_dictionary_low_cardinality(30, 0)
                        time:   [38.485 µs 38.493 µs 38.500 µs]
                        change: [-7.6690% -7.4510% -7.3212%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

convert_rows 4096 string_dictionary_low_cardinality(30, 0)
                        time:   [61.568 µs 61.584 µs 61.605 µs]
                        change: [-42.307% -42.220% -42.080%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

convert_columns 4096 string_dictionary_low_cardinality(100, 0)
                        time:   [40.500 µs 40.509 µs 40.520 µs]
                        change: [-19.204% -19.017% -18.911%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

convert_columns_prepared 4096 string_dictionary_low_cardinality(100, 0)
                        time:   [39.657 µs 39.665 µs 39.674 µs]
                        change: [-8.0164% -7.8251% -7.7172%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

convert_rows 4096 string_dictionary_low_cardinality(100, 0)
                        time:   [61.436 µs 61.448 µs 61.466 µs]
                        change: [-43.411% -43.330% -43.170%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

Edit: actually messed up, standby

sunchao · 2023-09-15T15:50:20Z

I recall you mentioning the dictionary encoding overheads were very high in your workloads

Thanks @tustvold ! Yes, we observed the columnar-to-row conversion for dictionary encoding incurs quite some overhead. Even in my current vectorized hash table POC, which is based of GroupValuesRows, the conversion takes about 30% of total hash table time, measured from running tpch_mem and clickbench_1.

tustvold · 2023-09-15T16:01:06Z

Now that I got the benchmarks working correctly they make a lot more sense 🤦...

With a low cardinality dictionary consisting of 10 distinct values, the method in this PR still yields a performance advantage with moderate-sized strings of less than 30 bytes. This relationship inverts once we get to strings consisting of 100 bytes, with a non-trivial regression for strings of this size.

convert_columns 4096 string_dictionary_low_cardinality(10, 0)
                        time:   [39.115 µs 39.128 µs 39.143 µs]
                        change: [-7.2399% -6.9400% -6.6487%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

convert_columns_prepared 4096 string_dictionary_low_cardinality(10, 0)
                        time:   [37.562 µs 37.568 µs 37.574 µs]
                        change: [-8.4250% -8.1655% -7.8921%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

convert_rows 4096 string_dictionary_low_cardinality(10, 0)
                        time:   [61.337 µs 61.348 µs 61.362 µs]
                        change: [-41.300% -41.123% -40.951%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe

convert_columns 4096 string_dictionary_low_cardinality(30, 0)
                        time:   [39.096 µs 39.104 µs 39.113 µs]
                        change: [-7.1129% -6.9915% -6.7990%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

convert_columns_prepared 4096 string_dictionary_low_cardinality(30, 0)
                        time:   [37.734 µs 37.741 µs 37.749 µs]
                        change: [-11.792% -10.671% -9.5985%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe

convert_rows 4096 string_dictionary_low_cardinality(30, 0)
                        time:   [60.281 µs 60.290 µs 60.300 µs]
                        change: [-42.061% -41.997% -41.879%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

convert_columns 4096 string_dictionary_low_cardinality(100, 0)
                        time:   [58.589 µs 58.609 µs 58.636 µs]
                        change: [+38.632% +39.044% +39.486%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe

convert_columns_prepared 4096 string_dictionary_low_cardinality(100, 0)
                        time:   [57.292 µs 57.302 µs 57.313 µs]
                        change: [+39.663% +39.880% +40.261%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  3 (3.00%) high severe

convert_rows 4096 string_dictionary_low_cardinality(100, 0)
                        time:   [91.827 µs 91.914 µs 92.020 µs]
                        change: [-11.573% -11.484% -11.406%] (p = 0.00 < 0.05)
                        Performance has improved.

This is inline with my expectations, in the ideal case of a small dictionary containing large strings, the interning logic does represent a benefit, I'm not sure how common this case is

…oding

Stateless Row Encoding / Don't Preserve Dictionaries (apache#4811)

98dbedb

tustvold added the api-change Changes to the arrow API label Sep 15, 2023

github-actions bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Sep 15, 2023

alamb changed the title ~~Stateless Row Encoding / Don't Preserve Dictionaries (#4811)~~ Stateless Row Encoding / Don't Preserve Dictionaries in RowConverter (#4811) Sep 15, 2023

alamb reviewed Sep 15, 2023

View reviewed changes

tustvold force-pushed the stateless-row-encoding branch from 47a8e43 to daa31df Compare September 15, 2023 15:36

Add low cardinality benchmarks

b864c5e

tustvold force-pushed the stateless-row-encoding branch from daa31df to b864c5e Compare September 15, 2023 15:39

alamb approved these changes Sep 15, 2023

View reviewed changes

tustvold added 2 commits September 17, 2023 11:01

Merge remote-tracking branch 'upstream/master' into stateless-row-enc…

b5529db

…oding

Merge remote-tracking branch 'upstream/master' into stateless-row-enc…

10c1242

…oding

tustvold merged commit 9cb4a75 into apache:master Sep 17, 2023

alamb mentioned this pull request Sep 18, 2023

Implement CardinalityAwareRowConverter while doing streaming merge apache/datafusion#7401

Merged

tustvold mentioned this pull request Sep 18, 2023

Stateless Row Conversion #4811

Closed

tustvold mentioned this pull request Oct 18, 2023

Improve the performance of "DictionaryValue" row encoding #4712

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stateless Row Encoding / Don't Preserve Dictionaries in `RowConverter` (#4811) #4819

Stateless Row Encoding / Don't Preserve Dictionaries in `RowConverter` (#4811) #4819

tustvold commented Sep 15, 2023 •

edited by alamb

Loading

tustvold commented Sep 15, 2023

alamb commented Sep 15, 2023

alamb left a comment •

edited

Loading

tustvold commented Sep 15, 2023 •

edited

Loading

sunchao commented Sep 15, 2023

tustvold commented Sep 15, 2023

Stateless Row Encoding / Don't Preserve Dictionaries in RowConverter (#4811) #4819

Stateless Row Encoding / Don't Preserve Dictionaries in RowConverter (#4811) #4819

Conversation

tustvold commented Sep 15, 2023 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented Sep 15, 2023

alamb commented Sep 15, 2023

alamb left a comment • edited Loading

Choose a reason for hiding this comment

tustvold commented Sep 15, 2023 • edited Loading

sunchao commented Sep 15, 2023

tustvold commented Sep 15, 2023

Stateless Row Encoding / Don't Preserve Dictionaries in `RowConverter` (#4811) #4819

Stateless Row Encoding / Don't Preserve Dictionaries in `RowConverter` (#4811) #4819

tustvold commented Sep 15, 2023 •

edited by alamb

Loading

alamb left a comment •

edited

Loading

tustvold commented Sep 15, 2023 •

edited

Loading