Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Add new Extended ClickBench benchmark queries #8950

Merged
merged 1 commit into from
Jan 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
177 changes: 167 additions & 10 deletions benchmarks/queries/clickbench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,23 +11,180 @@ ClickBench is focused on aggregation and filtering performance (though it has no
[ClickBench repository]: https://github.com/ClickHouse/ClickBench/blob/main/datafusion/queries.sql

## "Extended" Queries
The "extended" queries are not part of the official ClickBench benchmark.
Instead they are used to test other DataFusion features that are not
covered by the standard benchmark

Each description below is for the corresponding line in `extended.sql` (line 1
is `Q0`, line 2 is `Q1`, etc.)
The "extended" queries are not part of the official ClickBench benchmark.
Instead they are used to test other DataFusion features that are not covered by
the standard benchmark Each description below is for the corresponding line in
`extended.sql` (line 1 is `Q0`, line 2 is `Q1`, etc.)

### Q0: Data Exploration

**Question**: "How many distinct searches, mobile phones, and mobile phone models are there in the dataset?"

**Important Query Properties**: multiple `COUNT DISTINCT`s, with low and high cardinality
distinct string columns.

```sql
SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel")
FROM hits;
```


### Q1: Data Exploration

**Question**: "How many distinct "hit color", "browser country" and "language" are there in the dataset?"

**Important Query Properties**: multiple `COUNT DISTINCT`s. All three are small strings (length either 1 or 2).

### Q0
Models initial Data exploration, to understand some statistics of data.
Import Query Properties: multiple `COUNT DISTINCT` on strings

```sql
SELECT
COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel")
SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), COUNT(DISTINCT "BrowserLanguage")
FROM hits;
```

### Q2: Top 10 anaylsis

**Question**: "Find the top 10 "browser country" by number of distinct "social network"s,
including the distinct counts of "hit color", "browser language",
and "social action"."

**Important Query Properties**: GROUP BY short, string, multiple `COUNT DISTINCT`s. There are several small strings (length either 1 or 2).

```sql
SELECT "BrowserCountry", COUNT(DISTINCT "SocialNetwork"), COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "SocialAction")
FROM hits
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;
```


## Data Notes

Here are some interesting statistics about the data used in the queries
Max length of `"SearchPhrase"` is 1113 characters
```sql
❯ select min(length("SearchPhrase")) as "SearchPhrase_len_min", max(length("SearchPhrase")) "SearchPhrase_len_max" from 'hits.parquet' limit 10;
+----------------------+----------------------+
| SearchPhrase_len_min | SearchPhrase_len_max |
+----------------------+----------------------+
| 0 | 1113 |
+----------------------+----------------------+
```


Here is the schema of the data
```sql
❯ describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID | Int64 | NO |
| JavaEnable | Int16 | NO |
| Title | Utf8 | NO |
| GoodEvent | Int16 | NO |
| EventTime | Int64 | NO |
| EventDate | UInt16 | NO |
| CounterID | Int32 | NO |
| ClientIP | Int32 | NO |
| RegionID | Int32 | NO |
| UserID | Int64 | NO |
| CounterClass | Int16 | NO |
| OS | Int16 | NO |
| UserAgent | Int16 | NO |
| URL | Utf8 | NO |
| Referer | Utf8 | NO |
| IsRefresh | Int16 | NO |
| RefererCategoryID | Int16 | NO |
| RefererRegionID | Int32 | NO |
| URLCategoryID | Int16 | NO |
| URLRegionID | Int32 | NO |
| ResolutionWidth | Int16 | NO |
| ResolutionHeight | Int16 | NO |
| ResolutionDepth | Int16 | NO |
| FlashMajor | Int16 | NO |
| FlashMinor | Int16 | NO |
| FlashMinor2 | Utf8 | NO |
| NetMajor | Int16 | NO |
| NetMinor | Int16 | NO |
| UserAgentMajor | Int16 | NO |
| UserAgentMinor | Utf8 | NO |
| CookieEnable | Int16 | NO |
| JavascriptEnable | Int16 | NO |
| IsMobile | Int16 | NO |
| MobilePhone | Int16 | NO |
| MobilePhoneModel | Utf8 | NO |
| Params | Utf8 | NO |
| IPNetworkID | Int32 | NO |
| TraficSourceID | Int16 | NO |
| SearchEngineID | Int16 | NO |
| SearchPhrase | Utf8 | NO |
| AdvEngineID | Int16 | NO |
| IsArtifical | Int16 | NO |
| WindowClientWidth | Int16 | NO |
| WindowClientHeight | Int16 | NO |
| ClientTimeZone | Int16 | NO |
| ClientEventTime | Int64 | NO |
| SilverlightVersion1 | Int16 | NO |
| SilverlightVersion2 | Int16 | NO |
| SilverlightVersion3 | Int32 | NO |
| SilverlightVersion4 | Int16 | NO |
| PageCharset | Utf8 | NO |
| CodeVersion | Int32 | NO |
| IsLink | Int16 | NO |
| IsDownload | Int16 | NO |
| IsNotBounce | Int16 | NO |
| FUniqID | Int64 | NO |
| OriginalURL | Utf8 | NO |
| HID | Int32 | NO |
| IsOldCounter | Int16 | NO |
| IsEvent | Int16 | NO |
| IsParameter | Int16 | NO |
| DontCountHits | Int16 | NO |
| WithHash | Int16 | NO |
| HitColor | Utf8 | NO |
| LocalEventTime | Int64 | NO |
| Age | Int16 | NO |
| Sex | Int16 | NO |
| Income | Int16 | NO |
| Interests | Int16 | NO |
| Robotness | Int16 | NO |
| RemoteIP | Int32 | NO |
| WindowName | Int32 | NO |
| OpenerName | Int32 | NO |
| HistoryLength | Int16 | NO |
| BrowserLanguage | Utf8 | NO |
| BrowserCountry | Utf8 | NO |
| SocialNetwork | Utf8 | NO |
| SocialAction | Utf8 | NO |
| HTTPError | Int16 | NO |
| SendTiming | Int32 | NO |
| DNSTiming | Int32 | NO |
| ConnectTiming | Int32 | NO |
| ResponseStartTiming | Int32 | NO |
| ResponseEndTiming | Int32 | NO |
| FetchTiming | Int32 | NO |
| SocialSourceNetworkID | Int16 | NO |
| SocialSourcePage | Utf8 | NO |
| ParamPrice | Int64 | NO |
| ParamOrderID | Utf8 | NO |
| ParamCurrency | Utf8 | NO |
| ParamCurrencyID | Int16 | NO |
| OpenstatServiceName | Utf8 | NO |
| OpenstatCampaignID | Utf8 | NO |
| OpenstatAdID | Utf8 | NO |
| OpenstatSourceID | Utf8 | NO |
| UTMSource | Utf8 | NO |
| UTMMedium | Utf8 | NO |
| UTMCampaign | Utf8 | NO |
| UTMContent | Utf8 | NO |
| UTMTerm | Utf8 | NO |
| FromTag | Utf8 | NO |
| HasGCLID | Int16 | NO |
| RefererHash | Int64 | NO |
| URLHash | Int64 | NO |
| CLID | Int32 | NO |
+-----------------------+-----------+-------------+
105 rows in set. Query took 0.034 seconds.

```
4 changes: 3 additions & 1 deletion benchmarks/queries/clickbench/extended.sql
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel") FROM hits;
SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel") FROM hits;
SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), COUNT(DISTINCT "BrowserLanguage") FROM hits;
SELECT "BrowserCountry", COUNT(DISTINCT "SocialNetwork"), COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "SocialAction") FROM hits GROUP BY 1 ORDER BY 2 DESC LIMIT 10;