-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use parse_url kernel for PROTOCOL parsing #9481
Conversation
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
tests/src/test/scala/com/nvidia/spark/rapids/UrlFunctionsSuite.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <[email protected]>
), | ||
'ParseUrl') | ||
|
||
def test_parse_url_too_many_args(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a test of Spark itself rather than the plugin? IIUC this error is going to be thrown not because the plugin throws it but during the query analysis which occurs before the plugin runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right! I removed this case and related unnecessary error handling code.
def test_parse_url_protocol(): | ||
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark : unary_op_df(spark, url_gen).selectExpr( | ||
"a", | ||
"parse_url(a, 'PROTOCOL')" | ||
)) | ||
|
||
def test_parse_url_protocol_edge_cases(): | ||
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark : unary_op_df(spark, edge_cases_gen).selectExpr( | ||
"a", | ||
"parse_url(a, 'PROTOCOL')" | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Parameterize this test on the data generation rather than duplicate it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@allow_non_gpu('ProjectExec', 'ParseUrl') | ||
def test_parse_url_host_fallback(): | ||
assert_gpu_fallback_collect( | ||
lambda spark : unary_op_df(spark, url_gen).selectExpr( | ||
"a", | ||
"parse_url(a, 'HOST')" | ||
), | ||
'ParseUrl') | ||
|
||
@allow_non_gpu('ProjectExec', 'ParseUrl') | ||
def test_parse_url_path_fallback(): | ||
assert_gpu_fallback_collect( | ||
lambda spark : unary_op_df(spark, url_gen).selectExpr( | ||
"a", | ||
"parse_url(a, 'PATH')" | ||
), | ||
'ParseUrl') | ||
|
||
@allow_non_gpu('ProjectExec', 'ParseUrl') | ||
def test_parse_url_query_fallback(): | ||
assert_gpu_fallback_collect( | ||
lambda spark : unary_op_df(spark, url_gen).selectExpr( | ||
"a", | ||
"parse_url(a, 'QUERY')" | ||
), | ||
'ParseUrl') | ||
|
||
@allow_non_gpu('ProjectExec', 'ParseUrl') | ||
def test_parse_url_ref_fallback(): | ||
assert_gpu_fallback_collect( | ||
lambda spark : unary_op_df(spark, url_gen).selectExpr( | ||
"a", | ||
"parse_url(a, 'REF')" | ||
), | ||
'ParseUrl') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of these should be refactored to a single test parameterized on the literal rather than duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
ParamCheck("partToExtract", TypeSig.lit(TypeEnum.STRING), TypeSig.STRING)), | ||
// Should really be an OptionalParam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a withPsNote
denoting that it only supports PROTOCOL so that limitation shows up in the documentation. Otherwise the documentation implies we fully support this as long as a literal is supplied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
"http://foo.com/blah_blah_(wikipedia)_(again)", | ||
"http://www.example.com/wpstyle/?p=364", | ||
"https://www.example.com/foo/?bar=baz&inga=42&quux", | ||
# "http://✪df.ws/123", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failing cases need to be uncommented and passing before this should be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/urlFunctions.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Depends on NVIDIA/spark-rapids-jni#1554 |
Signed-off-by: Haoyang Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tests can be enabled now. Question about URL vs URI, but I'm fine either way.
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Thanks all for the review! I think all comments have been addressed and it's ready to merge once jni PR is merged. |
Please retarget to 24.02. |
Verified locally that all protocol tests passed after NVIDIA/spark-rapids-jni#1554 merged, both for IT and the kaggle dataset. |
build |
Signed-off-by: Haoyang Li <[email protected]>
build |
@hyperbolic2346 cold you also take a look as your change request is blocking this from going in. @thirtiseven I think @jlowe is out on vacation so you probably should not wait for him to come back. On a side note, this is crazy fast! I ran
both on the GPU and the CPU and the GPU ended up being 452x faster than 12 CPU cores. This is a crappy test that is not real world at all, but still. That is really fast. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this, but I am not a java/scala reviewer.
I get that, but I just don't like to use my admin merge powers to push a red merge button unless I have to. |
Contributes to #8963
This PR add plugin support for PROTOCOL parsing in
parse_url
.Related JNI PR: NVIDIA/spark-rapids-jni#1502
Most of the code is reused from the old PR #8761
Performance test:
against 651191 urls' dataset, repeated 100 times.
CPU: 10752 ms
GPU: 1032 ms
It's about 10x faster