Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use parse_url kernel for HOST parsing #9845

Merged
merged 44 commits into from
Dec 13, 2023

Conversation

thirtiseven
Copy link
Collaborator

Contributes to #8963

This PR add plugin support for HOST parsing in parse_url.

Depends on JNI PR: NVIDIA/spark-rapids-jni#1569

Depends on: #9481

thirtiseven and others added 30 commits July 20, 2023 13:11
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven thirtiseven self-assigned this Nov 23, 2023
integration_tests/src/main/python/url_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/url_test.py Outdated Show resolved Hide resolved
integration_tests/src/main/python/url_test.py Outdated Show resolved Hide resolved
parts match {
case partScalar: GpuScalar =>
GpuColumnVector.from(doColumnar(urls, partScalar), dataType)
case _ =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious in which case we will hit this exception.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columnarEvalAny will return Any, so it is possible that there is something other than GpuScalar in the batch.

An example from GpuExpressions.scala:

override def columnarEval(batch: ColumnarBatch): GpuColumnVector = {
withResourceIfAllowed(left.columnarEvalAny(batch)) { lhs =>
withResourceIfAllowed(right.columnarEvalAny(batch)) { rhs =>
(lhs, rhs) match {
case (l: GpuColumnVector, r: GpuColumnVector) =>
GpuColumnVector.from(doColumnar(l, r), dataType)
case (l: GpuScalar, r: GpuColumnVector) =>
GpuColumnVector.from(doColumnar(l, r), dataType)
case (l: GpuColumnVector, r: GpuScalar) =>
GpuColumnVector.from(doColumnar(l, r), dataType)
case (l: GpuScalar, r: GpuScalar) =>
GpuColumnVector.from(doColumnar(batch.numRows(), l, r), dataType)
case (l, r) =>
throw new UnsupportedOperationException(s"Unsupported data '($l: " +
s"${l.getClass}, $r: ${r.getClass})' for GPU binary expression.")
}
}
}
}
}

In the real world, I think this case won't be hit if we do the right thing in GpuOverrides, the exception here is more for defensive purpose.

GpuColumnVector.from(doColumnar(urlCv, partScalar, keyScalar), dataType)
case _ =>
throw new
UnsupportedOperationException(s"Cannot columnar evaluate expression: $this")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

"http://www.example.com/wpstyle/?p=364",
"https://www.example.com/foo/?bar=baz&inga=42&quux",
"http://✪df.ws/123",
// "http://userid:[email protected]:8080",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these if we don't want.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are failing tests now, will uncomment them after the kernel fixes these cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be working now.

@@ -0,0 +1,198 @@
/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems duplicated to url_test.py. We can just keep url_test.py which can be run in parallel.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is just make it easy to test for now. Will remove it before it's ready for review.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@sameerz sameerz added the feature request New feature or request label Nov 28, 2023
@thirtiseven thirtiseven marked this pull request as ready for review December 6, 2023 08:08
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven
Copy link
Collaborator Author

build

// return a null columnvector
return ColumnVector.fromStrings(null, null)
}
throw new UnsupportedOperationException(s"$this only supports partToExtract = PROTOCOL")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message seems to be vague here. We say we are supporting PROTOCOL and HOST so here needs more explanation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

}
} else {
// 3-arg, i.e. QUERY with key
assert(children.size == 3)
Copy link
Collaborator

@ttnghia ttnghia Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about checking and throwing IllegalArgumentException on object constructor (object GpuParseUrl#apply())?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the assert here is unlikely to fail because there is an argument checking for it in GpuOverrides.

@thirtiseven
Copy link
Collaborator Author

dependent PROTOCOL PR is merged, so this one is ready to merge too.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@revans2
Copy link
Collaborator

revans2 commented Dec 13, 2023

build

@thirtiseven thirtiseven merged commit 8d590eb into NVIDIA:branch-24.02 Dec 13, 2023
37 of 38 checks passed
@thirtiseven thirtiseven deleted the parse_url_host branch December 19, 2023 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants