Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/100 0 n columns #111

Merged
merged 12 commits into from
Jan 2, 2024
Merged

Feature/100 0 n columns #111

merged 12 commits into from
Jan 2, 2024

Conversation

salamonpavel
Copy link
Collaborator

@salamonpavel salamonpavel commented Nov 3, 2023

Closes #100
Introduces support for 0-n columns (depends on individual measure).

Copy link

github-actions bot commented Nov 3, 2023

JaCoCo agent module code coverage report - spark:2 - scala 2.12.12

File Coverage [85.5%] 🍏
MeasurementBuilder.scala 100% 🍏
AtumContext.scala 97.03% 🍏
Measure.scala 89.17% 🍏
MeasuresBuilder.scala 85.51% 🍏
Measurement.scala 67.56%
Total Project Coverage 81.07% 🍏

Copy link

github-actions bot commented Nov 3, 2023

JaCoCo server module code coverage report - scala 2.12.12

There is no coverage information present for the Files changed

Total Project Coverage 12.45%

@salamonpavel salamonpavel marked this pull request as ready for review November 3, 2023 14:20
measureName: String,
resultValueType: ResultValueType.ResultValueType
) extends Measure {
private val columnExpression = countDistinct(col(controlCols.head), controlCols.tail.map(col): _*)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure if this is what we want. I thought that it's about 'distinct -> count' operation, performed on a single column. Do you remember what was used in the old Atum & what can our users require @benedeki ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the suggested code it can be performed on single column or multiple columns. If this particular measure is expected to work differently please let me know. @benedeki

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the universality of supporting multi-column uniqueness 👍

@lsulak
Copy link
Collaborator

lsulak commented Nov 8, 2023

Code is based off of feature/99-custom-measure-2.

Why not to repoint the PR then? It would be easier to do the review

@salamonpavel
Copy link
Collaborator Author

Why not to repoint the PR then? It would be easier to do the review

Good point, will change the target branch to be the feature/99-custom-measure-2.

@salamonpavel salamonpavel changed the base branch from master to feature/99-custom-measure-2 November 8, 2023 16:05
# Conflicts:
#	agent/src/main/scala/za/co/absa/atum/agent/model/Measurement.scala
#	agent/src/main/scala/za/co/absa/atum/agent/model/MeasuresBuilder.scala
Copy link
Collaborator

@lsulak lsulak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished the review, I like the changes. There is only one tiny minor thing I found, and 1 thing regarding distinct count functionality. Otherwise LGTM and I will be happy to approve once these are addressed.

Copy link
Contributor

@benedeki benedeki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To consider:
I like class hierarchies. What you think of having 3 abstract classes

  • MeasureWithoutColumn
  • MeasureOneColumn
  • MeasureMultipleColumns

measureName: String,
resultValueType: ResultValueType.ResultValueType
) extends Measure {
private val columnExpression = countDistinct(col(controlCols.head), controlCols.tail.map(col): _*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the universality of supporting multi-column uniqueness 👍

def apply(controlCol: String): AbsSumOfValuesOfColumn = AbsSumOfValuesOfColumn(measureName, controlCol)
}

case class SumOfHashesOfColumn private (measureName: String, controlCol: String) extends AtumMeasure {

private val columnExpression: Column = sum(crc32(col(controlCol).cast("String")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this function also support a list of columns, concatenated.
Extra bonus is * for all columns. 😉

Copy link
Collaborator

@miroslavpojer miroslavpojer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • pulled
  • built
  • test run
  • code coverage - required merge of conflicting files, there were changes in tests
    • in current test implementation I see not covered paths

I will continue in review after merge of conflict files.

@@ -28,15 +28,15 @@ class MeasurementTest extends AnyFlatSpec with Matchers with SparkTestBase { sel

"MeasurementProvided" should "be able to be converted to MeasurementProvided object when the result is Double" in {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test speak about Double, but works with Big Decimals

Copy link
Contributor

@benedeki benedeki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok.
Just decide uf you want to add the hash function on multiple columns to add here or separate ticket+PR.

Oh and there's Miroslav's comment on one of the tests.

Comment on lines +156 to 166
private def handleAggregationResult(dataType: DataType, result: Any): String = {
val aggregatedValue = dataType match {
case _: LongType =>
if (result == null) 0 else result
case _: StringType =>
val value =
if (collected == null) new java.math.BigDecimal(0)
else collected.asInstanceOf[java.math.BigDecimal]
if (result == null) new java.math.BigDecimal(0)
else result.asInstanceOf[java.math.BigDecimal]
value.stripTrailingZeros // removes trailing zeros (2001.500000 -> 2001.5, but can introduce scientific notation (600.000 -> 6E+2)
.toPlainString // converts to normal string (6E+2 -> "600")
case _ =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, this we might want to revisit and rethink. But not part of this PR. 😉

@salamonpavel
Copy link
Collaborator Author

I will re-point this PR to master instead of feature/99-custom-measure-2.

@salamonpavel salamonpavel changed the base branch from feature/99-custom-measure-2 to master December 28, 2023 09:44
Copy link

github-actions bot commented Dec 28, 2023

JaCoCo agent module code coverage report - spark:2 - scala 2.12.18

File Coverage [89.28%] 🍏
MeasurementBuilder.scala 100% 🍏
AtumContext.scala 97.07% 🍏
Measure.scala 89.39% 🍏
MeasuresBuilder.scala 84.73% 🍏
AtumAgent.scala 84.52% 🍏
Measurement.scala 82.18% 🍏
Total Project Coverage 77.08%

Copy link

JaCoCo server module code coverage report - scala 2.12.18

There is no coverage information present for the Files changed

Total Project Coverage 0%

@salamonpavel salamonpavel merged commit 4057443 into master Jan 2, 2024
8 of 10 checks passed
@salamonpavel salamonpavel deleted the feature/100-0-n-columns branch January 2, 2024 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Agent: Refactor the code, Measures requiring 0-n columns, not exactly 1
4 participants