Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use parse_url kernel for PROTOCOL parsing #9481

Merged
merged 35 commits into from
Dec 12, 2023
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
8235c95
WIP: Support parse_url
thirtiseven Jul 20, 2023
9f17539
Merge branch 'NVIDIA:branch-23.08' into prase_url
thirtiseven Jul 20, 2023
729fe35
fix build failures
thirtiseven Jul 20, 2023
85c3284
regex refactor
thirtiseven Aug 3, 2023
6819214
Merge branch 'NVIDIA:branch-23.08' into prase_url
thirtiseven Aug 3, 2023
4166362
Separate regexes and UTF-8 special characters support
thirtiseven Aug 3, 2023
43acceb
hostname validation
thirtiseven Aug 3, 2023
64d8373
hostname validation
thirtiseven Aug 3, 2023
e6a45d3
ipv4 validation
thirtiseven Aug 4, 2023
8c4dc7a
verify
thirtiseven Aug 4, 2023
fee5a3d
wip ipv6 and SPARK-44500
thirtiseven Aug 4, 2023
e81d8a3
optional protocol and ref validation
thirtiseven Aug 7, 2023
93a9342
IPV6 VALIDATION
thirtiseven Aug 8, 2023
1ad665f
clean up
thirtiseven Aug 8, 2023
3edb929
Fix ipv6 validation, it is still wip
thirtiseven Aug 9, 2023
daa61ea
Fix ipv6 validation and some clean up
thirtiseven Aug 9, 2023
70a5d88
Merge branch 'prase_url' into parse_url_protocol
thirtiseven Oct 19, 2023
b3abaf6
Use parse_url kernel for PROTOCOL parsing
thirtiseven Oct 19, 2023
592c642
verify
thirtiseven Oct 19, 2023
9db1b2a
edit compatibility and update IT
thirtiseven Oct 19, 2023
d09f06d
update integration tests
thirtiseven Oct 20, 2023
3b71c4d
address comments
thirtiseven Oct 24, 2023
46527f3
remove unnecessary error handling
thirtiseven Oct 24, 2023
6161fa4
clean up
thirtiseven Oct 24, 2023
e16fe1e
Merge branch 'parse_url_protocol' of https://github.com/thirtiseven/s…
thirtiseven Nov 16, 2023
8e7ed44
Merge branch 'thirtiseven-parse_url_protocol' into parse_url_protocol
thirtiseven Nov 16, 2023
f93b944
Merge branch 'NVIDIA:branch-23.12' into parse_url_protocol
thirtiseven Nov 16, 2023
8f4990c
Revert scala tests temporarily for easier testing
thirtiseven Nov 16, 2023
3376376
Fix two nits
thirtiseven Nov 16, 2023
4e98888
Updated results
thirtiseven Nov 22, 2023
6d916c4
clean up
thirtiseven Nov 22, 2023
1b36090
rename urlFunctions to GpuParseUrl
thirtiseven Nov 28, 2023
7eca922
Merge branch 'branch-23.12' into parse_url_protocol
thirtiseven Dec 1, 2023
e4fdf13
Merge branch 'NVIDIA:branch-24.02' into parse_url_protocol
thirtiseven Dec 4, 2023
3ace124
verify
thirtiseven Dec 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/additional-functionality/advanced_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -300,6 +300,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.NthValue"></a>spark.rapids.sql.expression.NthValue|`nth_value`|nth window operator|true|None|
<a name="sql.expression.OctetLength"></a>spark.rapids.sql.expression.OctetLength|`octet_length`|The byte length of string data|true|None|
<a name="sql.expression.Or"></a>spark.rapids.sql.expression.Or|`or`|Logical OR|true|None|
<a name="sql.expression.ParseUrl"></a>spark.rapids.sql.expression.ParseUrl|`parse_url`|Extracts a part from a URL|true|None|
hyperbolic2346 marked this conversation as resolved.
Show resolved Hide resolved
<a name="sql.expression.PercentRank"></a>spark.rapids.sql.expression.PercentRank|`percent_rank`|Window function that returns the percent rank value within the aggregation window|true|None|
<a name="sql.expression.Pmod"></a>spark.rapids.sql.expression.Pmod|`pmod`|Pmod|true|None|
<a name="sql.expression.PosExplode"></a>spark.rapids.sql.expression.PosExplode|`posexplode_outer`, `posexplode`|Given an input array produces a sequence of rows for each value in the array|true|None|
Expand Down
11 changes: 11 additions & 0 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -451,6 +451,17 @@ Spark stores timestamps internally relative to the JVM time zone. Converting an
between time zones is not currently supported on the GPU. Therefore operations involving timestamps
will only be GPU-accelerated if the time zone used by the JVM is UTC.

## URL parsing

`parse_url` can produce different results on the GPU compared to the CPU.

Known issues for PROTOCOL parsing:
- If urls containing utf-8 special characters, PROTOCOL results on GPU will be null.
- If urls containing ipv6 host, GPU will return null for PROTOCOL.
- GPU will still try to parse the PROTOCOL instead of returning null for some edge invalid cases,
such as urls containing multiple '#' in REF (http://##) or empty authority component followed by
a empty path (http://).
hyperbolic2346 marked this conversation as resolved.
Show resolved Hide resolved

## Windowing

### Window Functions
Expand Down
Loading