Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging is difficult without access to the Spark UI #41

Open
metasim opened this issue Mar 25, 2018 · 9 comments
Open

Debugging is difficult without access to the Spark UI #41

metasim opened this issue Mar 25, 2018 · 9 comments

Comments

@metasim
Copy link
Contributor

metasim commented Mar 25, 2018

Logging this here because no one had an answer in the forums.

When Spark jobs get stalled or take longer than expected, it's critical to be able to look at the Spark UI for debugging and viewing logs. I've attempted to expose port 4040 and 4041 in the session manager, but have not been able to connect.

Could you provide steps on how to connect to the Spark UI when the driver is launched via Seahorse?

@metasim
Copy link
Contributor Author

metasim commented Mar 25, 2018

This is when running in "local" mode. I've tried adding the following to the "Custom Settings" area, but to no avail:

--conf spark.ui.enabled=true
--conf spark.ui.port=4040

@jaroslaw-osmanski
Copy link

By deafult (docker-compose + Linux):
SparkUI for first started workflow can be found on http://localhost:4040. Second started workflow is on port 4041.

Spark local is being run in the sessionmanager container. Sessionmanager uses host network driver and exposes all ports on localhost.

Do you work on a Mac? SparkUI might not be exposed on OSX.

@metasim
Copy link
Contributor Author

metasim commented Mar 26, 2018

Yes, running on a mac. I'll try --net host and see if that helps. However, I note that sessionmanager/docker.sbt doesn't expose any ports, so trying to map them won't work. I'm going to try expose(4040) first.

@metasim
Copy link
Contributor Author

metasim commented Mar 26, 2018

Update: I rebuild on a Linux server, confirmed that host networking is being used, see that 4040 is bound, but it's binding on the local loopback adapter and not on the network adapter:

~/seahorse$ netstat -l | grep 4040
tcp6       0      0 localhost:4040          [::]:*                  LISTEN  

I've tried setting spark.driver.bindAddress and spark.driver.host in the "Custom Settings" area, but to no avail.

What I don't understand is that the proxy service properly binds to all adapters:

~/seahorse$ netstat -l | grep 33321
tcp6       0      0 [::]:33321              [::]:*                  LISTEN    

while we could use ssh tunneling or similar get around this, it's an extra step I'd like to avoid. Furthermore, would like to continue to do development on the Mac, so still interested in figuring out why regular port mapping isn't working with the session manager.

@metasim
Copy link
Contributor Author

metasim commented Mar 26, 2018

When running in bridged network mode, it appears that port mapping isn't working for a similar reason; ports are being bound to the loopback adapter only:

root@6728e5480a2c:/opt/docker# netstat -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 localhost:4040          *:*                     LISTEN     
tcp        0      0 localhost:37613         *:*                     LISTEN     
tcp        0      0 localhost:44527         *:*                     LISTEN     
tcp        0      0 localhost:36531         *:*                     LISTEN     
tcp        0      0 localhost:43155         *:*                     LISTEN     
tcp        0      0 localhost:39449         *:*                     LISTEN     
tcp        0      0 *:9082                  *:*                     LISTEN     
tcp        0      0 localhost:35549         *:*                     LISTEN     
tcp        0      0 localhost:41153         *:*                     LISTEN     
tcp        0      0 127.0.0.11:42979        *:*                     LISTEN     
udp        0      0 127.0.0.11:40953        *:*                                
Active UNIX domain sockets (only servers)
Proto RefCnt Flags       Type       State         I-Node   Path

This could also be the problem with standalone cluster mode not working from the container either.

@metasim
Copy link
Contributor Author

metasim commented Mar 26, 2018

This is what the session manager looks like when in "viewer mode":

root@13ac5873c797:/opt/docker# netstat -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 *:9082                  *:*                     LISTEN     
tcp        0      0 127.0.0.11:40769        *:*                     LISTEN     
udp        0      0 127.0.0.11:56859        *:*          

@metasim
Copy link
Contributor Author

metasim commented Mar 26, 2018

I've had a breakthrough, but it's not pretty, as I believe it's due to a bug in Seahorse.

The issue is with an interaction between CommonEnv and LocalSparkLauncher.

The heart of the issues is that CommonEnv overrides the SPARK_LOCAL_IP environment variable:

The environment variable takes precedence over all other mechanisms for specifying the network address (e.g. spark.driver.bindAddress and spark.driver.host).

While not ideal (because this is the last resort we have for overriding all other setting mechanisms), this ends up working OK in the cluster modes, but in local mode, the userIP configuration is never set anywhere, and what ends up getting passed to Spark is SPARK_LOCAL_IP= (the empty string IP). When the variable is set, but has no value, it appears that Spark/Java picks "localhost".

I figure this out by setting SPARK_LOCAL_IP=10.255.3.2 in the environment section of the docker-compose.yml file, then running env inside the container to confirm that it was set, and then taking a look at /proc/<spark-submit pid>/environ to see how the process sees it. In that context it showed up as SPARK_LOCAL_IP=.

I'd suggest changing this in a number of ways, starting with using --conf spark.x.y.z=... options over environment variables. The second would be to not even modify the driver network handling when in local mode.

@metasim
Copy link
Contributor Author

metasim commented Mar 26, 2018

This is my current fix. Not sure if this is generally applicable. @jaroslaw-osmanski Should a separate bug be created for this issue (that Spark UI doesn't work outside of host networking and connections via localhost)?

diff --git a/sessionmanager/src/main/scala/ai/deepsense/sessionmanager/service/sessionspawner/sparklauncher/clusters/LocalSparkLauncher.scala b/sessionmanager/src/main/scala/ai/deepsense/sessionmanager/service/sessionspawner/sparklauncher/clusters/LocalSparkLauncher.scala
index 121315c2d..409b10b85 100644
--- a/sessionmanager/src/main/scala/ai/deepsense/sessionmanager/service/sessionspawner/sparklauncher/clusters/LocalSparkLauncher.scala
+++ b/sessionmanager/src/main/scala/ai/deepsense/sessionmanager/service/sessionspawner/sparklauncher/clusters/LocalSparkLauncher.scala
@@ -32,7 +32,7 @@ private [clusters] object LocalSparkLauncher {
             config: SparkLauncherConfig,
             clusterConfig: ClusterDetails,
             args: SparkOptionsMultiMap): SparkLauncher = {
-    new SparkLauncher(env(config, clusterConfig))
+    new SparkLauncher(env(config, clusterConfig.copy(userIP = "0.0.0.0")))
       .setSparkArgs(args)
       .setVerbose(true)
       .setMainClass(config.className)
-- 

@metasim
Copy link
Contributor Author

metasim commented Mar 29, 2018

A better option than above would likely to make ClusterDetails.userIP: Option[String], and only set spark.driver.host if it has a value. @jaroslaw-osmanski, if this sounds like a better approach I'll pull together a PR for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants