Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement specialized group values for single Uft8/LargeUtf8/Binary/LargeBinary column #8827

Merged
merged 5 commits into from
Feb 20, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 10, 2024

Note this looks like a large PR, but a substantial amount of it is moving code and tests and documentation

Which issue does this PR close?

closes #7064

Rationale for this change

This PR makes queries like this faster (grouping on strings):

SELECT url, count(*) FROM hits GROUP BY URL ORDER BY count(*) desc limit 10;

ClickBench seems slightly slower in 34.0.0 than in 33.0.0: #8789 (comment) so we need some more real wins

One thing that takes significant time in queries is grouping on a single string column, where the current GroupsAccumulator copies the strings several times:

  1. Copying into Rows,
  2. Copying OwnedRow ,
  3. Copying from Rows to form the final output Array

We can make such queries faster by avoiding the copies

What changes are included in this PR?

  1. Generalize the SSOStringSet that @jayzhan211 and I implemented in Optimize COUNT( DISTINCT ...) for strings (up to 9x faster) #8849 into ArrowBinaryMap which also handles values and both string and binary
  2. Refactor COUNT DISTINCT specialization to handle binary/large binary
  3. Use ArrowBinaryMap to implement GroupValuesByes, a GroupsValues implementation for String / LargeString / Binary / LargeBinary data

Are these changes tested?

Yes:

  1. Unit tests
  2. New fuzz tests (added in Add string aggregate grouping fuzz test, add MemTable::with_sort_exprs #9190)

Are there any user-facing changes?

Faster performance.

For Clickbench, several of the longest running queries got significantly faster. This is almost certainly the result of less string copying as they are grouping by "Url", a string column

│ QQuery 5     │  2756.24ms │                    1917.97ms │ +1.44x faster │
│ QQuery 12    │  2501.83ms │                    2003.72ms │ +1.25x faster │
│ QQuery 13    │  4875.29ms │                    4373.19ms │ +1.11x faster │
│ QQuery 28    │ 31656.41ms │                   26984.03ms │ +1.17x faster │
│ QQuery 33    │ 12469.57ms │                    9065.49ms │ +1.38x faster │
│ QQuery 36    │   475.49ms │                     350.97ms │ +1.35x faster │
│ QQuery 37    │   252.94ms │                     232.00ms │ +1.09x faster │

There are two that are reported to be slower, but I think given they run in 10s of ms this is likely a measurement error

│ QQuery 41    │    77.78ms │                      81.92ms │  1.05x slower │
│ QQuery 42    │    91.72ms │                      97.71ms │  1.07x slower │

For ClickBench extended things got faster:

│ QQuery 2     │ 3555.86ms │                    3025.06ms │ +1.18x faster │

For TPCH there were also several queries that got faster, but the time spent is so small (10s of ms) that I think it likely measurement error

Full Performance Results

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ alamb_specialized_group_keys ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.79ms │                       0.77ms │     no change │
│ QQuery 1     │    89.67ms │                      91.96ms │     no change │
│ QQuery 2     │   189.61ms │                     198.97ms │     no change │
│ QQuery 3     │   200.28ms │                     199.00ms │     no change │
│ QQuery 4     │  2142.31ms │                    2090.35ms │     no change │
│ QQuery 5     │  2756.24ms │                    1917.97ms │ +1.44x faster │
│ QQuery 6     │    81.90ms │                      81.71ms │     no change │
│ QQuery 7     │    92.59ms │                      95.39ms │     no change │
│ QQuery 8     │  3271.38ms │                    3112.39ms │     no change │
│ QQuery 9     │  2364.29ms │                    2344.73ms │     no change │
│ QQuery 10    │   801.35ms │                     828.64ms │     no change │
│ QQuery 11    │   874.21ms │                     902.34ms │     no change │
│ QQuery 12    │  2501.83ms │                    2003.72ms │ +1.25x faster │
│ QQuery 13    │  4875.29ms │                    4373.19ms │ +1.11x faster │
│ QQuery 14    │  2816.89ms │                    2707.51ms │     no change │
│ QQuery 15    │  2353.51ms │                    2331.28ms │     no change │
│ QQuery 16    │  5839.98ms │                    5631.23ms │     no change │
│ QQuery 17    │  5732.99ms │                    5509.90ms │     no change │
│ QQuery 18    │ 11679.49ms │                   11116.23ms │     no change │
│ QQuery 19    │   163.08ms │                     159.01ms │     no change │
│ QQuery 20    │  2657.83ms │                    2586.30ms │     no change │
│ QQuery 21    │  3382.28ms │                    3307.62ms │     no change │
│ QQuery 22    │  9120.73ms │                    9097.39ms │     no change │
│ QQuery 23    │ 21176.57ms │                   21659.87ms │     no change │
│ QQuery 24    │  1337.87ms │                    1351.44ms │     no change │
│ QQuery 25    │  1148.52ms │                    1171.45ms │     no change │
│ QQuery 26    │  1477.97ms │                    1475.16ms │     no change │
│ QQuery 27    │  3902.05ms │                    3815.77ms │     no change │
│ QQuery 28    │ 31656.41ms │                   26984.03ms │ +1.17x faster │
│ QQuery 29    │  1065.64ms │                    1058.78ms │     no change │
│ QQuery 30    │  2480.57ms │                    2385.47ms │     no change │
│ QQuery 31    │  3199.02ms │                    3095.34ms │     no change │
│ QQuery 32    │ 16151.55ms │                   16111.18ms │     no change │
│ QQuery 33    │ 12469.57ms │                    9065.49ms │ +1.38x faster │
│ QQuery 34    │ 13162.17ms │                   13481.65ms │     no change │
│ QQuery 35    │  3869.02ms │                    3907.86ms │     no change │
│ QQuery 36    │   475.49ms │                     350.97ms │ +1.35x faster │
│ QQuery 37    │   252.94ms │                     232.00ms │ +1.09x faster │
│ QQuery 38    │   194.91ms │                     189.39ms │     no change │
│ QQuery 39    │  1112.28ms │                    1148.26ms │     no change │
│ QQuery 40    │    89.82ms │                      92.75ms │     no change │
│ QQuery 41    │    77.78ms │                      81.92ms │  1.05x slower │
│ QQuery 42    │    91.72ms │                      97.71ms │  1.07x slower │
└──────────────┴────────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main_base)                      │ 179380.43ms │
│ Total Time (alamb_specialized_group_keys)   │ 168444.07ms │
│ Average Time (main_base)                    │   4171.64ms │
│ Average Time (alamb_specialized_group_keys) │   3917.30ms │
│ Queries Faster                              │           7 │
│ Queries Slower                              │           2 │
│ Queries with No Change                      │          34 │
└─────────────────────────────────────────────┴─────────────┘

--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ alamb_specialized_group_keys ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │ 3761.83ms │                    3645.99ms │     no change │
│ QQuery 1     │ 1540.43ms │                    1523.35ms │     no change │
│ QQuery 2     │ 3555.86ms │                    3025.06ms │ +1.18x faster │
└──────────────┴───────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                      │ 8858.12ms │
│ Total Time (alamb_specialized_group_keys)   │ 8194.40ms │
│ Average Time (main_base)                    │ 2952.71ms │
│ Average Time (alamb_specialized_group_keys) │ 2731.47ms │
│ Queries Faster                              │         1 │
│ Queries Slower                              │         0 │
│ Queries with No Change                      │         2 │
└─────────────────────────────────────────────┴───────────┘

--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ alamb_specialized_group_keys ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  221.65ms │                     211.74ms │     no change │
│ QQuery 2     │   45.09ms │                      44.14ms │     no change │
│ QQuery 3     │   75.12ms │                      75.49ms │     no change │
│ QQuery 4     │   74.56ms │                      73.28ms │     no change │
│ QQuery 5     │  121.22ms │                     121.11ms │     no change │
│ QQuery 6     │   16.32ms │                      16.63ms │     no change │
│ QQuery 7     │  317.89ms │                     310.41ms │     no change │
│ QQuery 8     │   77.76ms │                      75.12ms │     no change │
│ QQuery 9     │  127.65ms │                     128.59ms │     no change │
│ QQuery 10    │  147.28ms │                     152.27ms │     no change │
│ QQuery 11    │   33.24ms │                      33.89ms │     no change │
│ QQuery 12    │   68.39ms │                      70.30ms │     no change │
│ QQuery 13    │   86.20ms │                      85.18ms │     no change │
│ QQuery 14    │   25.62ms │                      23.91ms │ +1.07x faster │
│ QQuery 15    │   57.68ms │                      56.15ms │     no change │
│ QQuery 16    │   46.01ms │                      48.26ms │     no change │
│ QQuery 17    │  167.87ms │                     165.19ms │     no change │
│ QQuery 18    │  459.99ms │                     469.59ms │     no change │
│ QQuery 19    │   63.05ms │                      63.21ms │     no change │
│ QQuery 20    │  116.83ms │                     112.89ms │     no change │
│ QQuery 21    │  343.40ms │                     351.09ms │     no change │
│ QQuery 22    │   31.69ms │                      29.94ms │ +1.06x faster │
└──────────────┴───────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                      │ 2724.50ms │
│ Total Time (alamb_specialized_group_keys)   │ 2718.37ms │
│ Average Time (main_base)                    │  123.84ms │
│ Average Time (alamb_specialized_group_keys) │  123.56ms │
│ Queries Faster                              │         2 │
│ Queries Slower                              │         0 │
│ Queries with No Change                      │        20 │
└─────────────────────────────────────────────┴───────────┘


TODOs

  • Write tests for storing values in the ArrowStringMap
  • Make it work for LargeStrings
  • Add fuzz test
  • Propose porting ArrowStringMap upstream to arrow-rs (could use the DictionaryArray builder)
  • Implement drain / retain + memory accounting
  • Try and ArrowStringMap generic so it works for Binary Types
  • Final Performance benchmarks

@alamb

This comment was marked as outdated.

@alamb alamb force-pushed the alamb/specialized_group_keys branch from 9224a38 to 28d9105 Compare January 27, 2024 10:37
@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Jan 27, 2024
@alamb

This comment was marked as outdated.

@alamb alamb force-pushed the alamb/specialized_group_keys branch from 7c23abb to 09b3c5f Compare February 2, 2024 22:56
@github-actions github-actions bot removed core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Feb 2, 2024
@alamb
Copy link
Contributor Author

alamb commented Feb 6, 2024

I think this is getting pretty close. I need to add some more fuzz tests and maybe try to make this code generic and work for BinaryArray but otherwise it is ready

@@ -92,403 +86,3 @@ impl<O: OffsetSizeTrait> Accumulator for StringDistinctCountAccumulator<O> {
std::mem::size_of_val(self) + self.0.size()
}
}

/// Maximum size of a string that can be inlined in the hash table
Copy link
Contributor Author

@alamb alamb Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code and tests were moved into bytes_map.rs and made general

@alamb alamb force-pushed the alamb/specialized_group_keys branch from 1009b85 to 6ee2e03 Compare February 6, 2024 13:14
@alamb alamb force-pushed the alamb/specialized_group_keys branch from 6ee2e03 to da68c45 Compare February 10, 2024 13:08
@github-actions github-actions bot added the core Core DataFusion crate label Feb 10, 2024
@alamb
Copy link
Contributor Author

alamb commented Feb 10, 2024

Update here is I have written a fuzz test (which actually caught several bugs 🐛 ). I plan to take a final pass through the code tomorrow morning and get it up for review (It is a pretty large PR so I expect it will take time to review, but it will be worth it I think)

Update:
I was thinking about this. My plan is to:

  1. Break out the fuzz test into its own PR: Add string aggregate grouping fuzz test, add MemTable::with_sort_exprs #9190
  2. Remove the streaming support (even though it works, it is complicated and I am not sure how much value it adds relative to its complexity we can do that as a follow on PR potentially) - Implement streaming for specialized group by string aggregator #9188
  3. Take one final pass

@alamb alamb force-pushed the alamb/specialized_group_keys branch from 46ba28c to 637908c Compare February 12, 2024 16:43
@github-actions github-actions bot added core Core DataFusion crate and removed core Core DataFusion crate labels Feb 12, 2024
@alamb alamb force-pushed the alamb/specialized_group_keys branch from d9fa058 to 637908c Compare February 12, 2024 21:52
@github-actions github-actions bot removed the core Core DataFusion crate label Feb 12, 2024
@alamb alamb force-pushed the alamb/specialized_group_keys branch from 637908c to dc55c1b Compare February 13, 2024 16:22
@alamb
Copy link
Contributor Author

alamb commented Feb 13, 2024

Update here is that the benchmark programs consistently shows this query slower, but I can't reproduce the difference. I am continuing to investigate

│ QQuery 28    │ 28361.56ms │                   30849.47ms │  1.09x slower │

@alamb alamb force-pushed the alamb/specialized_group_keys branch 2 times, most recently from 3de9042 to b2fe2c7 Compare February 16, 2024 12:38
@alamb alamb force-pushed the alamb/specialized_group_keys branch from b2fe2c7 to 53e7274 Compare February 16, 2024 12:40
});
// SAFETY: the offsets were constructed correctly in `insert_if_new` --
// monotonically increasing, overflows were checked.
let offsets = unsafe { OffsetBuffer::new_unchecked(ScalarBuffer::from(offsets)) };
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validating the offsets actually took significant time which took me a while to find and trackdown (without this unsafe call ClickBench Q28 slowed down by 10%)

@alamb alamb marked this pull request as ready for review February 16, 2024 13:46
@alamb
Copy link
Contributor Author

alamb commented Feb 16, 2024

FYI @thinkharderdev and @Dandandan as you have expressed interest in this type of optimization before, this PR is now ready for review

@alamb alamb added the performance Make DataFusion faster label Feb 17, 2024
Copy link
Contributor

@Dandandan Dandandan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@alamb
Copy link
Contributor Author

alamb commented Feb 19, 2024

Thank you for the review @Dandandan 🙏

@alamb alamb merged commit 8233e64 into apache:main Feb 20, 2024
23 checks passed
@alamb
Copy link
Contributor Author

alamb commented Feb 20, 2024

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Make DataFusion faster physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve aggregate performance with specialized groups accumulator for single string group by
3 participants