Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock downstream from WorkflowNodeExecutorActor #44

Open
metasim opened this issue Mar 27, 2018 · 1 comment
Open

Possible deadlock downstream from WorkflowNodeExecutorActor #44

metasim opened this issue Mar 27, 2018 · 1 comment

Comments

@metasim
Copy link
Contributor

metasim commented Mar 27, 2018

Logging this in case it helps someone else. Cause not yet known...

We were running "K-Means" through "Fit + Transform" (in local mode), and when the "fit" stage finished, the Spark job would hang, and the workflow would never progress. Clicking "Abort" wouldn't kill the job either, but "Stop editing" did.

Through the Spark UI we were able to do a thread dump and get the backtrace (below) of the thread that was blocked.

Unfortunately, haven't figured out the direct cause, but when we added the following to the "Custom settings" it the job was able to complete.

--conf spark.ui.showConsoleProgress=false

I do wonder if this is a red herring (an accidental fix that just delays the real cause), and could imagine this being related to #43 if the OutputInterceptorFactory isn't doing its job and keeping output buffers cleared/processed.

Blocked Thread

131    default-node-executor-dispatcher-16    BLOCKED    
Blocked by Thread Some(35) Lock(org.apache.spark.ui.ConsoleProgressBar@302054359})
org.apache.spark.ui.ConsoleProgressBar.finishAll(ConsoleProgressBar.scala:122)
org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1926)
org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1926)
scala.Option.foreach(Option.scala:257)
org.apache.spark.SparkContext.runJob(SparkContext.scala:1926)
org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)
org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
org.apache.spark.SparkContext.runJob(SparkContext.scala:1965)
org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
org.apache.spark.rdd.RDD.collect(RDD.scala:935)
org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:381)
org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:256)
org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:227)
org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:319)
org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:253)
ai.deepsense.deeplang.doperables.serialization.SerializableSparkEstimator.fitDF(SerializableSparkEstimator.scala:35)
ai.deepsense.deeplang.doperables.serialization.SerializableSparkEstimator.fitDF(SerializableSparkEstimator.scala:28)
ai.deepsense.sparkutils.ML$Estimator.fit(SparkUtils.scala:89)
org.apache.spark.ml.Estimator.fit(Estimator.scala:61)
ai.deepsense.deeplang.doperables.SparkEstimatorWrapper._fit(SparkEstimatorWrapper.scala:57)
ai.deepsense.deeplang.doperables.SparkEstimatorWrapper._fit(SparkEstimatorWrapper.scala:39)
ai.deepsense.deeplang.doperables.Estimator$$anon$1.apply(Estimator.scala:56)
ai.deepsense.deeplang.doperables.Estimator$$anon$1.apply(Estimator.scala:54)
ai.deepsense.deeplang.doperations.FitPlusTransform.execute(FitPlusTransform.scala:65)
ai.deepsense.deeplang.doperations.FitPlusTransform.execute(FitPlusTransform.scala:35)
ai.deepsense.deeplang.DOperation2To2.executeUntyped(DOperations.scala:483)
ai.deepsense.workflowexecutor.WorkflowNodeExecutorActor.executeOperation(WorkflowNodeExecutorActor.scala:126)
ai.deepsense.workflowexecutor.WorkflowNodeExecutorActor$$anonfun$receive$1$$anonfun$applyOrElse$1.apply$mcV$sp(WorkflowNodeExecutorActor.scala:55)
ai.deepsense.workflowexecutor.WorkflowNodeExecutorActor$$anonfun$receive$1$$anonfun$applyOrElse$1.apply(WorkflowNodeExecutorActor.scala:54)
ai.deepsense.workflowexecutor.WorkflowNodeExecutorActor$$anonfun$receive$1$$anonfun$applyOrElse$1.apply(WorkflowNodeExecutorActor.scala:54)
ai.deepsense.workflowexecutor.WorkflowNodeExecutorActor.ai$deepsense$workflowexecutor$WorkflowNodeExecutorActor$$asSparkJobGroup(WorkflowNodeExecutorActor.scala:145)
ai.deepsense.workflowexecutor.WorkflowNodeExecutorActor$$anonfun$receive$1.applyOrElse(WorkflowNodeExecutorActor.scala:54)
akka.actor.Actor$class.aroundReceive(Actor.scala:484)
ai.deepsense.workflowexecutor.WorkflowNodeExecutorActor.aroundReceive(WorkflowNodeExecutorActor.scala:36)
akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
akka.actor.ActorCell.invoke(ActorCell.scala:495)
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
akka.dispatch.Mailbox.run(Mailbox.scala:224)
akka.dispatch.Mailbox.exec(Mailbox.scala:234)
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
@metasim
Copy link
Contributor Author

metasim commented May 21, 2018

@jaroslaw-osmanski Do you have any assessment of this? We keep having problems with the system becoming unstable when "Abort" is used, and I'm wondering if this ticket could be the problem. Here's somewhat of a test script from one of our data scientists (we can't submit the full thing here).


I was running df.count() in Python notebook and df.show() and Seahorse seems to just hang. If I look at Spark Resource Manager, job is showing as complete. This also happens when I try to run a k-means node twice in a row. The first time will run fine. Second time will hang.

Here's an example workflow of when this happens with k-means and steps to repeat.

  1. Create k-means workflow with default settings, with a +1,000,000 row input DataFrame, and direct the results to a Python Notebook
  2. Start editing workflow on yarn cluster.
  3. Select Python Notebook node and click run. This should take about 2-3 minutes.
  4. Select K-Means node and change k to 10.
  5. Select Python Notebook node and click run. This should take FOREVER.
  6. Wait 5 minutes and check Spark Resource Manager. This should say all tasks are complete, but Seahorse is showing the workflow is still running.

cc: @courtney-whalen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant