Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KYUUBI #6912][LINEAGE] Properly handle empty attribute set on mergeRelationColumnLineage #6911

Closed

Conversation

xglv1985
Copy link

@xglv1985 xglv1985 commented Feb 8, 2025

Why are the changes needed?

Issue reference:

#6912

How to reproduce the issue?

The changes in this PR will avoid a wrong result when generating the instance of org.apache.kyuubi.plugin.lineage.Lineage, in the certain case as follows:
step 1: create a temporary view from a file
step 2: insert into a table by selecting from the temporary view in step 1
step 3: generate the lineage when executing the insert statement in step 2
In detail, please see the UT code submission in this patch.

The issue analysis

Let's see the current code when getting the Lineage object by resolving a LogicalPlan object:
image

According to the above logic, a None org.apache.kyuubi.plugin.lineage.Lineage object will be generated due to "try-catch" self-protection, in this certain case. This None object will lead to problems in the following 2 scenes:

Unit Test Environment

In Unit Test, when the code runs here a "None.get" exception will be raised:
image

Here's the runtime exception stack:

None.get
java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:529)
	at scala.None$.get(Option.scala:527)
	at org.apache.kyuubi.plugin.lineage.helper.SparkSQLLineageParserHelperSuite.extractLineageWithoutExecuting(SparkSQLLineageParserHelperSuite.scala:1485)
	at org.apache.kyuubi.plugin.lineage.helper.SparkSQLLineageParserHelperSuite.$anonfun$new$83(SparkSQLLineageParserHelperSuite.scala:1465)

Production Environment

This Lineage object cannot be used in the production environment because it has a None value which lacks some necessary lineage information. The right content of the Lineage instance in the above case should be:

inputTables(List())
outputTables(List(spark_catalog.test_db.test_table_from_dir))
columnLineage(List(ColumnLineage(spark_catalog.test_db.test_table_from_dir.a0,Set()), ColumnLineage(spark_catalog.test_db.test_table_from_dir.b0,Set())))

a newly added test case(test directory to table) passed after this issue is fixed.

How to fix the issue?

Add a "Empty judgment" logic. In detail, please see the code submission in this patch.

How was this patch tested?

  1. by adding a new test case in UT code and make sure it passes
  2. by submitting a Spark application including the SQL of this case in the production environment, and make sure a right Lineage instance is generated, instead of a None object

Was this patch authored or co-authored using generative AI tooling?

No

@xglv1985 xglv1985 changed the title [WIP]Fix a spark lineage runtime exception when referencing the Lineage object [WIP]Fix a "wrong value of a Spark lineage object" issue introduced by lacking "Empty Judgment" Feb 8, 2025
@xglv1985 xglv1985 changed the title [WIP]Fix a "wrong value of a Spark lineage object" issue introduced by lacking "Empty Judgment" [WIP]Fix a "wrong value of a Spark lineage object" issue introduced by lacking "the judgment to Empty" Feb 10, 2025
@xglv1985 xglv1985 changed the title [WIP]Fix a "wrong value of a Spark lineage object" issue introduced by lacking "the judgment to Empty" [KYUUBI #6912] Fix a "wrong value of a Spark lineage object" issue introduced by lacking "the judgment to Empty" Feb 10, 2025
(acc + (attr -> x.head._2), x.tail)
val head_2 = if (x.isEmpty) AttributeSet.empty else x.head._2
val tail = if (x.isEmpty) x.empty else x.tail
(acc + (attr -> head_2), tail)
Copy link
Member

@pan3793 pan3793 Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe can be written in

relationOutput.foldLeft((ListMap[Attribute, AttributeSet](), relationColumnLineage)) {
        case ((acc, x), attr) if x.isEmpty =>
          (acc + (attr -> AttributeSet.empty), ListMap.empty)
        case ((acc, x), attr) =>
          (acc + (attr -> x.head._2), x.tail)
      }._1

TBH, the code is not much readable ...

Comment on lines 1468 to 1474
s"""
|insert overwrite table test_db.test_table_from_dir
|SELECT
| `a`,
| `b`
|FROM temp_view
|""".stripMargin)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use UPPER CASE for SQL keywords, and format the SQL in a short readable way

Suggested change
s"""
|insert overwrite table test_db.test_table_from_dir
|SELECT
| `a`,
| `b`
|FROM temp_view
|""".stripMargin)
s"""
|INSERT OVERWRITE TABLE test_db.test_table_from_dir
|SELECT `a`, `b` FROM temp_view
|""".stripMargin)

Comment on lines 1455 to 1465
s"""
| CREATE OR REPLACE TEMPORARY VIEW temp_view
|(
| `a` STRING COMMENT '',
| `b` STRING COMMENT ''
|)
|USING csv OPTIONS(
| sep='\t',
| path='${sourceFile.path}'
|);
|""".stripMargin).collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
s"""
| CREATE OR REPLACE TEMPORARY VIEW temp_view
|(
| `a` STRING COMMENT '',
| `b` STRING COMMENT ''
|)
|USING csv OPTIONS(
| sep='\t',
| path='${sourceFile.path}'
|);
|""".stripMargin).collect()
s"""
|CREATE OR REPLACE TEMPORARY VIEW temp_view (
| `a` STRING COMMENT '',
| `b` STRING COMMENT ''
|) USING csv OPTIONS(
| sep='\t',
| path='${sourceFile.path}'
|)
|""".stripMargin).collect()

@pan3793
Copy link
Member

pan3793 commented Feb 11, 2025

thanks for fixing the issue, leave some nits, BTW, you can run dev/reformat to fix the code style.

@xglv1985
Copy link
Author

thanks for fixing the issue, leave some nits, BTW, you can run dev/reformat to fix the code style.

thanks for reviewing. I will optimize the code by your advice.

@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2025

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 0.00%. Comparing base (dd9cc0e) to head (13a7107).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
...in/lineage/helper/SparkSQLLineageParseHelper.scala 0.00% 2 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##           master   #6911   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         688     688           
  Lines       42590   42592    +2     
  Branches     5805    5806    +1     
======================================
- Misses      42590   42592    +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…e/kyuubi/plugin/lineage/helper/SparkSQLLineageParserHelperSuite.scala
…e/kyuubi/plugin/lineage/helper/SparkSQLLineageParserHelperSuite.scala
@pan3793 pan3793 changed the title [KYUUBI #6912] Fix a "wrong value of a Spark lineage object" issue introduced by lacking "the judgment to Empty" [KYUUBI #6912][LINEAGE] Properly handle empty attribute set on mergeRelationColumnLineage Feb 13, 2025
@pan3793
Copy link
Member

pan3793 commented Feb 13, 2025

@xglv1985 Thanks for the contribution, CI failure is irrelevant, I have sent a PR(#6915) to fix that, and will merge this PR once CI is recovered.

@pan3793 pan3793 closed this in 7c110b6 Feb 14, 2025
pan3793 added a commit that referenced this pull request Feb 14, 2025
…elationColumnLineage

# Why are the changes needed?
## Issue reference:
#6912

## How to reproduce the issue?
The changes in this PR will avoid a wrong result when generating the instance of org.apache.kyuubi.plugin.lineage.Lineage, in the certain case as follows:
step 1: create a temporary view from a file
step 2: insert into a table by selecting from the temporary view in step 1
step 3: generate the lineage when executing the insert statement in step 2
In detail, please see the UT code submission in this patch.

## The issue analysis
Let's see the current code when getting the Lineage object by resolving a LogicalPlan object:
<img width="694" alt="image" src="https://github.com/user-attachments/assets/65256a0d-320d-4271-968f-59eafb74de9f" />

According to the above logic, a None org.apache.kyuubi.plugin.lineage.Lineage object will be generated due to "try-catch" self-protection, in this certain case. This None object will lead to problems in the following 2 scenes:
### Unit Test Environment
In Unit Test, when the code runs here a "None.get" exception will be raised:
<img width="682" alt="image" src="https://github.com/user-attachments/assets/102dc9bd-294f-4b1e-b1c6-01b6fee50fed" />

Here's the runtime exception stack:
```
None.get
java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:529)
	at scala.None$.get(Option.scala:527)
	at org.apache.kyuubi.plugin.lineage.helper.SparkSQLLineageParserHelperSuite.extractLineageWithoutExecuting(SparkSQLLineageParserHelperSuite.scala:1485)
	at org.apache.kyuubi.plugin.lineage.helper.SparkSQLLineageParserHelperSuite.$anonfun$new$83(SparkSQLLineageParserHelperSuite.scala:1465)
```
### Production Environment
This Lineage object cannot be used in the production environment because it has a None value which lacks some necessary lineage information. The right content of the Lineage instance in the above case should be:
```
inputTables(List())
outputTables(List(spark_catalog.test_db.test_table_from_dir))
columnLineage(List(ColumnLineage(spark_catalog.test_db.test_table_from_dir.a0,Set()), ColumnLineage(spark_catalog.test_db.test_table_from_dir.b0,Set())))
```

a newly added test case(test directory to table) passed after this issue is fixed.

# How to fix the issue?
Add a "Empty judgment" logic. In detail, please see the code submission in this patch.

# How was this patch tested?
1. by adding a new test case in UT code and make sure it passes
2. by submitting a Spark application including the SQL of this case in the production environment, and make sure a right Lineage instance is generated, instead of a None object

# Was this patch authored or co-authored using generative AI tooling?
No

Closes #6911 from xglv1985/fix_spark_lineage_runtime_exception.

Closes #6912

13a7107 [Cheng Pan] Update extensions/spark/kyuubi-spark-lineage/src/test/scala/org/apache/kyuubi/plugin/lineage/helper/SparkSQLLineageParserHelperSuite.scala
4e89b95 [Cheng Pan] Update extensions/spark/kyuubi-spark-lineage/src/test/scala/org/apache/kyuubi/plugin/lineage/helper/SparkSQLLineageParserHelperSuite.scala
59b350b [xglv1985] fix a runtime exception when generate column lineage tuple--more readable code
52bc028 [xglv1985] fix a runtime exception when generate column lineage tuple--spotless sytle
fea6bbc [xglv1985] fix a runtime exception when generate column lineage tuple--remove tab from UT code
9018790 [xglv1985] fix a runtime exception when generate column lineage tuple--unit test
fbb4df8 [xglv1985] fix a runtime exception when generate column lineage tuple

Lead-authored-by: xglv1985 <[email protected]>
Co-authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
(cherry picked from commit 7c110b6)
Signed-off-by: Cheng Pan <[email protected]>
pan3793 added a commit that referenced this pull request Feb 14, 2025
…elationColumnLineage

# Why are the changes needed?
## Issue reference:
#6912

## How to reproduce the issue?
The changes in this PR will avoid a wrong result when generating the instance of org.apache.kyuubi.plugin.lineage.Lineage, in the certain case as follows:
step 1: create a temporary view from a file
step 2: insert into a table by selecting from the temporary view in step 1
step 3: generate the lineage when executing the insert statement in step 2
In detail, please see the UT code submission in this patch.

## The issue analysis
Let's see the current code when getting the Lineage object by resolving a LogicalPlan object:
<img width="694" alt="image" src="https://github.com/user-attachments/assets/65256a0d-320d-4271-968f-59eafb74de9f" />

According to the above logic, a None org.apache.kyuubi.plugin.lineage.Lineage object will be generated due to "try-catch" self-protection, in this certain case. This None object will lead to problems in the following 2 scenes:
### Unit Test Environment
In Unit Test, when the code runs here a "None.get" exception will be raised:
<img width="682" alt="image" src="https://github.com/user-attachments/assets/102dc9bd-294f-4b1e-b1c6-01b6fee50fed" />

Here's the runtime exception stack:
```
None.get
java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:529)
	at scala.None$.get(Option.scala:527)
	at org.apache.kyuubi.plugin.lineage.helper.SparkSQLLineageParserHelperSuite.extractLineageWithoutExecuting(SparkSQLLineageParserHelperSuite.scala:1485)
	at org.apache.kyuubi.plugin.lineage.helper.SparkSQLLineageParserHelperSuite.$anonfun$new$83(SparkSQLLineageParserHelperSuite.scala:1465)
```
### Production Environment
This Lineage object cannot be used in the production environment because it has a None value which lacks some necessary lineage information. The right content of the Lineage instance in the above case should be:
```
inputTables(List())
outputTables(List(spark_catalog.test_db.test_table_from_dir))
columnLineage(List(ColumnLineage(spark_catalog.test_db.test_table_from_dir.a0,Set()), ColumnLineage(spark_catalog.test_db.test_table_from_dir.b0,Set())))
```

a newly added test case(test directory to table) passed after this issue is fixed.

# How to fix the issue?
Add a "Empty judgment" logic. In detail, please see the code submission in this patch.

# How was this patch tested?
1. by adding a new test case in UT code and make sure it passes
2. by submitting a Spark application including the SQL of this case in the production environment, and make sure a right Lineage instance is generated, instead of a None object

# Was this patch authored or co-authored using generative AI tooling?
No

Closes #6911 from xglv1985/fix_spark_lineage_runtime_exception.

Closes #6912

13a7107 [Cheng Pan] Update extensions/spark/kyuubi-spark-lineage/src/test/scala/org/apache/kyuubi/plugin/lineage/helper/SparkSQLLineageParserHelperSuite.scala
4e89b95 [Cheng Pan] Update extensions/spark/kyuubi-spark-lineage/src/test/scala/org/apache/kyuubi/plugin/lineage/helper/SparkSQLLineageParserHelperSuite.scala
59b350b [xglv1985] fix a runtime exception when generate column lineage tuple--more readable code
52bc028 [xglv1985] fix a runtime exception when generate column lineage tuple--spotless sytle
fea6bbc [xglv1985] fix a runtime exception when generate column lineage tuple--remove tab from UT code
9018790 [xglv1985] fix a runtime exception when generate column lineage tuple--unit test
fbb4df8 [xglv1985] fix a runtime exception when generate column lineage tuple

Lead-authored-by: xglv1985 <[email protected]>
Co-authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
(cherry picked from commit 7c110b6)
Signed-off-by: Cheng Pan <[email protected]>
@pan3793
Copy link
Member

pan3793 commented Feb 14, 2025

Thanks, merged to master/1.10.2/1.9.4

@pan3793 pan3793 added this to the v1.9.4 milestone Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants