Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to serialize Histogram with binningUdf when using them with useRepository #500

Open
psyking841 opened this issue Aug 17, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@psyking841
Copy link

psyking841 commented Aug 17, 2023

Describe the bug
When using Histogram analyzer (with an UDF) with useRepository API, I got Unable to serialize Histogram with binningUdf error.

Stacktrace:

An error was encountered:
java.lang.IllegalArgumentException: Unable to serialize Histogram with binningUdf!
  at com.amazon.deequ.repository.AnalyzerSerializer$.serialize(AnalysisResultSerde.scala:314)
  at com.amazon.deequ.repository.AnalyzerSerializer$.serialize(AnalysisResultSerde.scala:221)
  at com.google.gson.internal.bind.TreeTypeAdapter.write(TreeTypeAdapter.java:81)
  at com.google.gson.Gson.toJson(Gson.java:704)
  at com.google.gson.Gson.toJsonTree(Gson.java:597)
  at com.google.gson.internal.bind.TreeTypeAdapter$GsonContextImpl.serialize(TreeTypeAdapter.java:158)
  at com.amazon.deequ.repository.AnalyzerContextSerializer$.$anonfun$serialize$2(AnalysisResultSerde.scala:182)
  at com.amazon.deequ.repository.AnalyzerContextSerializer$.$anonfun$serialize$2$adapted(AnalysisResultSerde.scala:179)
  at scala.collection.immutable.Map$Map2.foreach(Map.scala:273)
  at com.amazon.deequ.repository.AnalyzerContextSerializer$.serialize(AnalysisResultSerde.scala:179)
  at com.amazon.deequ.repository.AnalyzerContextSerializer$.serialize(AnalysisResultSerde.scala:170)
  at com.google.gson.internal.bind.TreeTypeAdapter.write(TreeTypeAdapter.java:81)
  at com.google.gson.Gson.toJson(Gson.java:704)
  at com.google.gson.Gson.toJsonTree(Gson.java:597)
  at com.google.gson.internal.bind.TreeTypeAdapter$GsonContextImpl.serialize(TreeTypeAdapter.java:158)
  at com.amazon.deequ.repository.AnalysisResultSerializer$.serialize(AnalysisResultSerde.scala:149)
  at com.amazon.deequ.repository.AnalysisResultSerializer$.serialize(AnalysisResultSerde.scala:139)
  at com.google.gson.internal.bind.TreeTypeAdapter.write(TreeTypeAdapter.java:81)
  at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:69)
  at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:97)
  at com.google.gson.internal.bind.CollectionTypeAdapterFactory$Adapter.write(CollectionTypeAdapterFactory.java:61)
  at com.google.gson.Gson.toJson(Gson.java:704)
  at com.google.gson.Gson.toJson(Gson.java:683)
  at com.google.gson.Gson.toJson(Gson.java:638)
  at com.amazon.deequ.repository.AnalysisResultSerde$.serialize(AnalysisResultSerde.scala:90)
  at com.amazon.deequ.repository.fs.FileSystemMetricsRepository.save(FileSystemMetricsRepository.scala:57)
  at com.amazon.deequ.analyzers.runners.AnalysisRunner$.$anonfun$saveOrAppendResultsIfNecessary$2(AnalysisRunner.scala:233)
  at com.amazon.deequ.analyzers.runners.AnalysisRunner$.$anonfun$saveOrAppendResultsIfNecessary$2$adapted(AnalysisRunner.scala:225)
  at scala.Option.foreach(Option.scala:407)
  at com.amazon.deequ.analyzers.runners.AnalysisRunner$.$anonfun$saveOrAppendResultsIfNecessary$1(AnalysisRunner.scala:225)
  at com.amazon.deequ.analyzers.runners.AnalysisRunner$.$anonfun$saveOrAppendResultsIfNecessary$1$adapted(AnalysisRunner.scala:224)
  at scala.Option.foreach(Option.scala:407)
  at com.amazon.deequ.analyzers.runners.AnalysisRunner$.saveOrAppendResultsIfNecessary(AnalysisRunner.scala:224)
  at com.amazon.deequ.analyzers.runners.AnalysisRunner$.doAnalysisRun(AnalysisRunner.scala:204)
  at com.amazon.deequ.analyzers.runners.AnalysisRunBuilder.run(AnalysisRunBuilder.scala:110)
  ... 61 elided

To Reproduce
Steps to reproduce the behavior:
Just run below code with any df that require a binningUDF:

I ran them in Jupyternotebook, each code block below runs in one notebook block.

val analysisResult: AnalyzerContext = (AnalysisRunner
          .onData(df)
          .addAnalyzer(Size())
          .addAnalyzer(Histogram("score", Some(scoreBinningUdf)))
          .useRepository(FileSystemMetricsRepository(spark, "s3://path/to/metrics/file"))
          .saveOrAppendResult(resultKey)
          .run())
val analysisResults = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResults.show(100, truncate = false)

Here is my UDF, you should be able to use it as it is:

val scoreBinningUdf = udf((score: Double) => {
    if (score < 0.10) {
        "lower"
    } else if (score > 0.90) {
        "upper"
    } else {
        "mid"
    }
})
  1. See error

Expected behavior
Above code should just work!

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
There are two ways to make above code works:

  1. remove below two lines
.useRepository(FileSystemMetricsRepository(spark, "s3://path/to/metrics/file"))
.saveOrAppendResult(resultKey)

Or
2. Removing UDF from Histogram analyzer, i.e. applying the UDF to the df (to create a new column) before the analyzer.

Therefore, I think it is the problem is the "incompatibility" between Histogram w. UDF and useRepository.

@psyking841 psyking841 added the bug Something isn't working label Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant