[BUG] Isolation Forest java.lang.ClassCastException #2231

obause · 2024-06-07T12:43:20Z

SynapseML version

1.0.4

System information

Language version: python 3.8 , scala 2.12
Spark Version: 3.5.0
Spark Platform): on-premise

Describe the problem

I'm currently trying to train an Isolation Forest model.
However, when I try to run pipeline.fit() the execution aborts after some stages with an exception I have no idea about what is going wrong:
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.BoundReference.accessor of type scala.Function2 in instance of org.apache.spark.sql.catalyst.expressions.BoundReference

Have had anyone else some similar issues?

Code to reproduce issue

I'm currently doing the same like in the documentation examples:

Isolation Forest Parameter

contamination = 0.01
num_estimators = 1
max_samples = 1
max_features = 1.0

MLFlow Experiment

artifact_path = "isolation_forest"
experiment_name = f"/opt/spark-data/iforest/isolation_forest_experiment{str(uuid.uuid1())}/"
model_name = f"isolation-forest-model-v1"

Isolation Forest Model

isolationForest = IsolationForest()
.setNumEstimators(num_estimators)
.setBootstrap(False)
.setMaxSamples(max_samples)
.setMaxFeatures(max_features)
.setFeaturesCol("features")
.setPredictionCol("predictedLabel")
.setScoreCol("outlierScore")
.setContamination(contamination)
.setContaminationError(0.01 * contamination)
.setRandomSeed(1)

mlflow.set_experiment(experiment_name)
with mlflow.start_run():
va = VectorAssembler(inputCols=inputCols, outputCol="features")
pipeline = Pipeline(stages=[va, isolationForest])
model = pipeline.fit(df_train)
mlflow.spark.log_model(
model, artifact_path=artifact_path, registered_model_name=model_name
)

Other info / logs

24/06/07 12:17:19 WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 1037) (172.20.0.6 executor 1): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.BoundReference.accessor of type scala.Function2 in instance of org.apache.spark.sql.catalyst.expressions.BoundReference
	at java.base/java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(Unknown Source)
	at java.base/java.io.ObjectStreamClass$FieldReflector.checkObjectFieldValueTypes(Unknown Source)
	at java.base/java.io.ObjectStreamClass...
...
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
Cell In[28], line 5
      3 va = VectorAssembler(inputCols=inputCols, outputCol="features")
      4 pipeline = Pipeline(stages=[va, isolationForest])
----> 5 model = pipeline.fit(df_train)
      6 mlflow.spark.log_model(
      7     model, artifact_path=artifact_path, registered_model_name=model_name
      8 )

File /usr/local/lib/python3.10/site-packages/pyspark/ml/base.py:205, in Estimator.fit(self, dataset, params)
    203         return self.copy(params)._fit(dataset)
    204     else:
--> 205         return self._fit(dataset)
    206 else:
    207     raise TypeError(
    208         "Params must be either a param map or a list/tuple of param maps, "
    209         "but got %s." % type(params)
    210     )

File /usr/local/lib/python3.10/site-packages/pyspark/ml/pipeline.py:134, in Pipeline._fit(self, dataset)
    132     dataset = stage.transform(dataset)
    133 else:  # must be an Estimator
--> 134     model = stage.fit(dataset)
    135     transformers.append(model)
    136     if i < indexOfLastEstimator:

File /usr/local/lib/python3.10/site-packages/pyspark/ml/base.py:205, in Estimator.fit(self, dataset, params)
    203         return self.copy(params)._fit(dataset)
    204     else:
--> 205         return self._fit(dataset)
    206 else:
    207     raise TypeError(
    208         "Params must be either a param map or a list/tuple of param maps, "
    209         "but got %s." % type(params)
    210     )

File /tmp/spark-d3e18495-1dca-4d82-af1b-2b8ad9c97eee/userFiles-ad8359f8-030f-45fb-b8ee-0cd05a246fe7/com.microsoft.azure_synapseml-core_2.12-1.0.4.jar/synapse/ml/isolationforest/IsolationForest.py:309, in IsolationForest._fit(self, dataset)
    308 def _fit(self, dataset):
--> 309     java_model = self._fit_java(dataset)
    310     return self._create_model(java_model)

File /usr/local/lib/python3.10/site-packages/pyspark/ml/wrapper.py:378, in JavaEstimator._fit_java(self, dataset)
    375 assert self._java_obj is not None
    377 self._transfer_params_to_java()
--> 378 return self._java_obj.fit(dataset._jdf)

File /usr/local/lib/python3.10/site-packages/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File /usr/local/lib/python3.10/site-packages/pyspark/errors/exceptions/captured.py:179, in capture_sql_exception.<locals>.deco(*a, **kw)
    177 def deco(*a: Any, **kw: Any) -> Any:
    178     try:
--> 179         return f(*a, **kw)
    180     except Py4JJavaError as e:
    181         converted = convert_exception(e.java_exception)

File /usr/local/lib/python3.10/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o119.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 1040) (172.20.0.7 executor 0): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.BoundReference.accessor of type scala.Function2 in instance of org.apache.spark.sql.catalyst.expressions.BoundReference
...

What component(s) does this bug affect?

What language(s) does this bug affect?

language/scala: Scala source code
language/python: Pyspark APIs
language/r: R APIs
language/csharp: .NET APIs
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/synapse: Azure Synapse integrations
integrations/azureml: Azure ML integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

obause added the bug label Jun 7, 2024

github-actions bot added the triage label Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Isolation Forest java.lang.ClassCastException #2231

[BUG] Isolation Forest java.lang.ClassCastException #2231

obause commented Jun 7, 2024

[BUG] Isolation Forest java.lang.ClassCastException #2231

[BUG] Isolation Forest java.lang.ClassCastException #2231

Comments

obause commented Jun 7, 2024

SynapseML version

System information

Describe the problem

Code to reproduce issue

Isolation Forest Parameter

MLFlow Experiment

Isolation Forest Model

Other info / logs

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?