Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HybridParquetScan: Fix velox runtime error in hybrid scan when filter timestamp #12112

Merged

Conversation

thirtiseven
Copy link
Collaborator

Fix #12111

Timestamp is not fully supported in Hybrid Filter, so conditions with Timestamp should not pushed to Hybrid Scan in filter splitter.

This pr fixes it.

@thirtiseven thirtiseven requested a review from res-life February 12, 2025 10:27
@thirtiseven thirtiseven self-assigned this Feb 12, 2025
@thirtiseven thirtiseven changed the base branch from branch-25.04 to branch-25.02 February 12, 2025 10:28
revans2
revans2 previously approved these changes Feb 12, 2025
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that we are missing a lot of tests around predicate push down for various data types that we say we support. Can you please file a follow on issue to test predicate push down for all data types that we support?

byte, short, int, long, float, double, date, decimal_32, decimal_64, string, struct, array, ...

I also see that we support a large number of expressions.

val ansiOn = Seq(
classOf[Acos],
classOf[Acosh],
classOf[AddMonths],
classOf[Alias],
classOf[And],
classOf[ArrayAggregate],
classOf[ArrayContains],
classOf[ArrayDistinct],
classOf[ArrayExcept],
classOf[ArrayExists],
classOf[ArrayForAll],
classOf[ArrayIntersect],
classOf[ArrayJoin],
classOf[ArrayMax],
classOf[ArrayMin],
classOf[ArrayPosition],
classOf[ArrayRemove],
classOf[ArrayRepeat],
classOf[ArraySort],
classOf[ArraysZip],
classOf[Ascii],
classOf[Asin],
classOf[Asinh],
classOf[Atan],
classOf[Atan2],
classOf[Atanh],
classOf[AttributeReference],
classOf[BitLength],
classOf[BitwiseAnd],
classOf[BitwiseOr],
classOf[Cbrt],
classOf[Ceil],
classOf[Chr],
classOf[Concat],
classOf[Cos],
classOf[Cosh],
classOf[Crc32],
classOf[CreateNamedStruct],
classOf[DateAdd],
classOf[DateDiff],
classOf[DateFormatClass],
classOf[DateFromUnixDate],
classOf[DateSub],
classOf[DayOfMonth],
classOf[DayOfWeek],
classOf[DayOfYear],
classOf[ElementAt],
classOf[EqualNullSafe],
classOf[EqualTo],
classOf[Exp],
classOf[Expm1],
classOf[FindInSet],
classOf[Flatten],
classOf[Floor],
classOf[FromUTCTimestamp],
classOf[FromUnixTime],
classOf[GetJsonObject],
classOf[GetMapValue],
classOf[GreaterThan],
classOf[GreaterThanOrEqual],
classOf[Greatest],
classOf[Hex],
classOf[Hour],
classOf[If],
classOf[In],
classOf[IsNaN],
classOf[IsNotNull],
classOf[IsNull],
classOf[LastDay],
classOf[Least],
classOf[Left],
classOf[Length],
classOf[LengthOfJsonArray],
classOf[LessThan],
classOf[Levenshtein],
classOf[Like],
classOf[Literal],
classOf[Log],
classOf[Log10],
classOf[Log2],
classOf[Lower],
classOf[MapFromArrays],
classOf[MapKeys],
classOf[MapValues],
classOf[MapZipWith],
classOf[Md5],
classOf[MicrosToTimestamp],
classOf[MillisToTimestamp],
classOf[Minute],
classOf[MonotonicallyIncreasingID],
classOf[Month],
classOf[NaNvl],
classOf[NextDay],
classOf[Not],
classOf[Or],
classOf[Overlay],
classOf[Pi],
classOf[Pow],
classOf[Quarter],
classOf[Rand],
classOf[Remainder],
classOf[Reverse],
classOf[Rint],
classOf[Round],
classOf[Second],
classOf[Sha1],
classOf[Sha2],
classOf[ShiftLeft],
classOf[ShiftRight],
classOf[Shuffle],
classOf[Sin],
classOf[Size],
classOf[SortArray],
classOf[SoundEx],
classOf[SparkPartitionID],
classOf[Sqrt],
classOf[Stack],
classOf[StringInstr],
classOf[StringLPad],
classOf[StringRPad],
classOf[StringRepeat],
classOf[StringReplace],
classOf[StringToMap],
classOf[StringTrim],
classOf[StringTrimLeft],
classOf[StringTrimRight],
classOf[Substring],
classOf[SubstringIndex],
classOf[Tan],
classOf[Tanh],
classOf[ToDegrees],
classOf[ToRadians],
classOf[ToUnixTimestamp],
classOf[UnaryPositive],
classOf[Unhex],
classOf[UnixMicros],
classOf[UnixMillis],
classOf[UnixSeconds],
classOf[Upper],
classOf[Uuid],
classOf[WeekDay],
classOf[WeekOfYear],
classOf[WidthBucket],
classOf[Year],
classOf[ZipWith]
)
// Some functions are fully supported with ANSI mode off
lazy val ansiOff = Seq(
classOf[Remainder],
classOf[Multiply],
classOf[Add],
classOf[Subtract],
classOf[Divide],
classOf[Abs],
classOf[Pmod]
)

But I don't see hardly any test coverage for these expressions as a part of predicate push down.

I don't fee comfortable saying we support something with no tests for it. Could you file an issue to add in at least some basic tests for each of these expressions too? I am okay if it is the same issue as above.

@@ -227,3 +227,19 @@ def udf_fallback(s):
assert_gpu_and_cpu_are_equal_collect(
lambda spark: spark.read.parquet(data_path).filter("ascii(a) >= 50 and udf_fallback(a) = 'udf_100'"),
conf=filter_split_conf)

@allow_non_gpu(*non_utc_allow)
def test_hybrid_parquet_filter_pushdown_timestamp(spark_tmp_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be marked as a hybrid test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, thanks

Signed-off-by: Haoyang Li <[email protected]>
@thirtiseven
Copy link
Collaborator Author

build

@thirtiseven
Copy link
Collaborator Author

Follow-up issue filed #12124

@pxLi
Copy link
Collaborator

pxLi commented Feb 13, 2025

build

@pxLi
Copy link
Collaborator

pxLi commented Feb 13, 2025

rekick, part of the ci hung due to HW issue

@thirtiseven thirtiseven merged commit c696b7e into NVIDIA:branch-25.02 Feb 13, 2025
51 checks passed
@thirtiseven thirtiseven deleted the hybrid_timestamp_filter_quick_fix branch February 13, 2025 16:24
pxLi added a commit to pxLi/spark-rapids that referenced this pull request Feb 17, 2025
@sameerz sameerz added the bug Something isn't working label Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Hit velox runtime error when filtering df with timestamp column inside when enabling hybrid
6 participants