-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stateless Row Encoding / Don't Preserve Dictionaries in RowConverter
(#4811)
#4819
Conversation
FYI @sunchao I recall you mentioning the dictionary encoding overheads were very high in your workloads |
I also added some additional rationale to the top of this PR to explain the CPU vs memory tradeoffs to try and help better motivate this change |
RowConverter
(#4811)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given all our benchmarks go faster and how many problems we have had with dictionary interning I am highly supportive of this approach.
I had a few benchmark questions, but otherwise 🏆
I looked at the benchmark results -- do we have any coverage for convert "low" cardinality dictionaries? The results posted seem to show that for high cardinality dictionaries (aka where each value is a random value) things get much faster, as expected
FYI @JayjeetAtGithub if we merge this I think apache/datafusion#7401 might not be needed
However, I still think getting some end to end performance numbers to find where preserving dictionaries improves performance (if at all) would still be hugely valuable to quanity the end to end performance effect that removing the dictionary interning from the row converter would hanve
This is a very good point, added some benchmarks with "low" cardinality dictionaries containing 10 unique values. The performance gain from this PR is reduced, but is still sizeable, especially in the case where the values haven't been seen before.
Edit: actually messed up, standby |
47a8e43
to
daa31df
Compare
daa31df
to
b864c5e
Compare
Thanks @tustvold ! Yes, we observed the columnar-to-row conversion for dictionary encoding incurs quite some overhead. Even in my current vectorized hash table POC, which is based of |
Now that I got the benchmarks working correctly they make a lot more sense 🤦... With a low cardinality dictionary consisting of 10 distinct values, the method in this PR still yields a performance advantage with moderate-sized strings of less than 30 bytes. This relationship inverts once we get to strings consisting of 100 bytes, with a non-trivial regression for strings of this size.
This is inline with my expectations, in the ideal case of a small dictionary containing large strings, the interning logic does represent a benefit, I'm not sure how common this case is |
Which issue does this PR close?
Closes #4811
Rationale for this change
Interning of
DictionaryArray
values is slightly faster forString
dictionaries with a small number of distinct values, as the representation inRow
format is more concise than copying the actual underlying value. However, as we have discovered several times in downstream projects (for example apache/datafusion#7200 from @JayjeetAtGithub ) the interning has huge downsides once the number of distinct keys becomes "large". Specifically:I initially considered a more holistic rework of the API to entirely remove
RowConverter
, but it quickly became quite a large change. Instead this PR starts small and simply removes the dictionary interning, allowing the RowConverter methods to take&self
instead of&mut self
. This will simplify sharing aRowConverter
across multiple threads, e.g. by wrapping it inArc
.There is more discussion on #4811
What changes are included in this PR?
Removes the dictionary preservation logic from RowConverter.
It isn't exactly a fair comparison, but the performance improvements are very significant.
Are there any user-facing changes?
This disables dictionary preservation for keys, leading to different output types and memory / performance characteristics.
#4813 tracks potentially preserving the dictionary on decode, but I'm not yet sure about the benefits/practicalities of doing this.
#4712 tracks potential further performance improvements to this dictionary encoding approach