Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Isolation Forest java.lang.ClassCastException #2231

Open
2 of 19 tasks
obause opened this issue Jun 7, 2024 · 0 comments
Open
2 of 19 tasks

[BUG] Isolation Forest java.lang.ClassCastException #2231

obause opened this issue Jun 7, 2024 · 0 comments

Comments

@obause
Copy link

obause commented Jun 7, 2024

SynapseML version

1.0.4

System information

  • Language version: python 3.8 , scala 2.12
  • Spark Version: 3.5.0
  • Spark Platform): on-premise

Describe the problem

I'm currently trying to train an Isolation Forest model.
However, when I try to run pipeline.fit() the execution aborts after some stages with an exception I have no idea about what is going wrong:
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.BoundReference.accessor of type scala.Function2 in instance of org.apache.spark.sql.catalyst.expressions.BoundReference

Have had anyone else some similar issues?

Code to reproduce issue

I'm currently doing the same like in the documentation examples:

Isolation Forest Parameter

contamination = 0.01
num_estimators = 1
max_samples = 1
max_features = 1.0

MLFlow Experiment

artifact_path = "isolation_forest"
experiment_name = f"/opt/spark-data/iforest/isolation_forest_experiment{str(uuid.uuid1())}/"
model_name = f"isolation-forest-model-v1"

Isolation Forest Model

isolationForest = IsolationForest()
.setNumEstimators(num_estimators)
.setBootstrap(False)
.setMaxSamples(max_samples)
.setMaxFeatures(max_features)
.setFeaturesCol("features")
.setPredictionCol("predictedLabel")
.setScoreCol("outlierScore")
.setContamination(contamination)
.setContaminationError(0.01 * contamination)
.setRandomSeed(1)

mlflow.set_experiment(experiment_name)
with mlflow.start_run():
va = VectorAssembler(inputCols=inputCols, outputCol="features")
pipeline = Pipeline(stages=[va, isolationForest])
model = pipeline.fit(df_train)
mlflow.spark.log_model(
model, artifact_path=artifact_path, registered_model_name=model_name
)

Other info / logs

24/06/07 12:17:19 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 1037) (172.20.0.6 executor 1): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.BoundReference.accessor of type scala.Function2 in instance of org.apache.spark.sql.catalyst.expressions.BoundReference
	at java.base/java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(Unknown Source)
	at java.base/java.io.ObjectStreamClass$FieldReflector.checkObjectFieldValueTypes(Unknown Source)
	at java.base/java.io.ObjectStreamClass...
...
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
Cell In[28], line 5
      3 va = VectorAssembler(inputCols=inputCols, outputCol="features")
      4 pipeline = Pipeline(stages=[va, isolationForest])
----> 5 model = pipeline.fit(df_train)
      6 mlflow.spark.log_model(
      7     model, artifact_path=artifact_path, registered_model_name=model_name
      8 )

File /usr/local/lib/python3.10/site-packages/pyspark/ml/base.py:205, in Estimator.fit(self, dataset, params)
    203         return self.copy(params)._fit(dataset)
    204     else:
--> 205         return self._fit(dataset)
    206 else:
    207     raise TypeError(
    208         "Params must be either a param map or a list/tuple of param maps, "
    209         "but got %s." % type(params)
    210     )

File /usr/local/lib/python3.10/site-packages/pyspark/ml/pipeline.py:134, in Pipeline._fit(self, dataset)
    132     dataset = stage.transform(dataset)
    133 else:  # must be an Estimator
--> 134     model = stage.fit(dataset)
    135     transformers.append(model)
    136     if i < indexOfLastEstimator:

File /usr/local/lib/python3.10/site-packages/pyspark/ml/base.py:205, in Estimator.fit(self, dataset, params)
    203         return self.copy(params)._fit(dataset)
    204     else:
--> 205         return self._fit(dataset)
    206 else:
    207     raise TypeError(
    208         "Params must be either a param map or a list/tuple of param maps, "
    209         "but got %s." % type(params)
    210     )

File /tmp/spark-d3e18495-1dca-4d82-af1b-2b8ad9c97eee/userFiles-ad8359f8-030f-45fb-b8ee-0cd05a246fe7/com.microsoft.azure_synapseml-core_2.12-1.0.4.jar/synapse/ml/isolationforest/IsolationForest.py:309, in IsolationForest._fit(self, dataset)
    308 def _fit(self, dataset):
--> 309     java_model = self._fit_java(dataset)
    310     return self._create_model(java_model)

File /usr/local/lib/python3.10/site-packages/pyspark/ml/wrapper.py:378, in JavaEstimator._fit_java(self, dataset)
    375 assert self._java_obj is not None
    377 self._transfer_params_to_java()
--> 378 return self._java_obj.fit(dataset._jdf)

File /usr/local/lib/python3.10/site-packages/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File /usr/local/lib/python3.10/site-packages/pyspark/errors/exceptions/captured.py:179, in capture_sql_exception.<locals>.deco(*a, **kw)
    177 def deco(*a: Any, **kw: Any) -> Any:
    178     try:
--> 179         return f(*a, **kw)
    180     except Py4JJavaError as e:
    181         converted = convert_exception(e.java_exception)

File /usr/local/lib/python3.10/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o119.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 1040) (172.20.0.7 executor 0): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.BoundReference.accessor of type scala.Function2 in instance of org.apache.spark.sql.catalyst.expressions.BoundReference
...

What component(s) does this bug affect?

  • area/cognitive: Cognitive project
  • area/core: Core project
  • area/deep-learning: DeepLearning project
  • area/lightgbm: Lightgbm project
  • area/opencv: Opencv project
  • area/vw: VW project
  • area/website: Website
  • area/build: Project build system
  • area/notebooks: Samples under notebooks folder
  • area/docker: Docker usage
  • area/models: models related issue

What language(s) does this bug affect?

  • language/scala: Scala source code
  • language/python: Pyspark APIs
  • language/r: R APIs
  • language/csharp: .NET APIs
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/synapse: Azure Synapse integrations
  • integrations/azureml: Azure ML integrations
  • integrations/databricks: Databricks integrations
@obause obause added the bug label Jun 7, 2024
@github-actions github-actions bot added the triage label Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant