Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build error in one module aborts (successful) compilation of parallel module builds #1

Open
lspitzner opened this issue Jun 30, 2016 · 17 comments

Comments

@lspitzner
Copy link

lspitzner commented Jun 30, 2016

in a setup of modules like A, B, C, D where D depends on A,B,C. If there is an error in C, after a cabal clean, cabal build -wghc-shake does the following, roughly:

GHC A.hs
GHC B.hs
GHC C.hs
[error]

and this repeats when you try to recompile. I.e. A and B never finish compilation, wasting cpu-cycles at each build while the programmer works on fixing C.

@ndmitchell
Copy link

In Shake, if 1 rule fails, that's immediate failure of all rules currently running. Since failing is typically faster, that's a common scenario. The way to avoid that is to enable staunch mode, causing Shake to continue until there is nothing further it can do, but it does delay getting errors.

While the above setup does waste CPU cycles, it doesn't waste any programmer time until C successfully compiles, and even then it only "wastes" the time between finishing C and A. I think that makes it a sensible trade off.

@lspitzner
Copy link
Author

oh, but it does waste programmer time. failing on purely C is quicker than failing on (A|B|C) in the very basic testing i just did.

my results roughly were: 4.0sec for plain cabal build; 3.7s for cabal build with ghc-shake; 2.7s for directly invoking ghc on the failing module (i.e. without --make and exactly one module, but otherwise the exact same flags as the cabal invokation); ~5.9s for cabal build with ghc-shake in presence of the above behaviour.

@ezyang
Copy link
Owner

ezyang commented Jun 30, 2016

To enable staunch mode with ghc-shake, pass -ffrontend-opt --keep-going (you can also edit the command line stub.)

@lspitzner
Copy link
Author

to be fair, my testcase may be way too simple. this might add a small worst-case constant factor, while greatly improving the linear factor on general project size (or size of the project dependency graph or whatever).

@ndmitchell
Copy link

I'm confused:

  • 3.7s for cabal build with ghc-shake
  • ~5.9s for cabal build with ghc-shake in presence of the above behaviour

Are you saying its 3.7s if the error is missing, but 5.9 if the error is present? I would expect that the time to report a build failure in C is either T(C) (the time to compute the error), or X + T(C) where X indicates useful work done before starting C, and thus work that is not repeated next time.

@lspitzner
Copy link
Author

i can work around the above by fixing C, then compiling A,B,C, then breaking C again. Now cabal build -wghc-shake results in

GHC C.hs
[error]

and 3.7s. After a clean (and corresponding to the first comment's output) the time is 5.9s.

@ndmitchell
Copy link

Interesting! I don't think that should be the case... Getting some kind of traces is probably desirable, something like http://shakebuild.com/profiling#chrome-traces. However, I don't know how well ghc-shake meshes with the existing profiling, and the profiling wasn't designed for failed builds, so I don't really know what it does in those circumstances.

@ndmitchell
Copy link

Looking at it visually, does C seem to start later in the slower trace, or does it seem that the time from C failing to the end is longer?

@lspitzner
Copy link
Author

I would assume they started simultaneously, but I'd have to check again. Will also look into those traces, but i gotta run for now.

@ndmitchell
Copy link

If they do start simultaneously it suggests that the action being performed to build a GHC file is not being interrupted properly. I just confirmed in a small test that Shake really does interrupt tasks as soon as one fails, so one for @ezyang to see what that particular task is doing.

ndmitchell added a commit to ndmitchell/shake that referenced this issue Jun 30, 2016
@ndmitchell
Copy link

There's now a test in Shake that ensures a fast failing computation and a slow succeeding computation aborts quickly.

@lspitzner
Copy link
Author

Tested; i even observed a case where the failing module was first now (i.e. GHC C.hs; GHC A.hs; GHC B.hs; [error in C]. The discrepancy in timing compated to a pure GHC C.hs; [error in C] was unaffected.

Using -ffrontend-opt --keep-going fixes the problem.

@ndmitchell but more work being done in parallel could just result in each task taking longer (if the task itself uses multiple cores), right?
the chrome tracing does not work unless the build succeeds, so i can get no useful information out of it.

@ndmitchell
Copy link

Yes, more work could slow down each individual piece of work. It's possible, but I'd say it was pretty unlikely. Can you use something system task manager (e.g. Process Explorer) to check your CPU load doesn't hit 100%? If you have plenty of clearance it's unlikely to be that.

@ezyang
Copy link
Owner

ezyang commented Jun 30, 2016

When I try @lspitzner's example in staunch mode, the error gets reported quite quickly. So it is unlikely that extra CPU use is causing the problem; rather, it is a combination of two things: (1) Without staunch mode, Shake waits till it kills everything before reporting the exception (that's partially my fault, as I pass --exception to Shake, and (2) For some reason, GHC is not getting killed by async exceptions in a timely fashion. More investigatoin needed!

@ndmitchell
Copy link

Shake waits to kill everything because that's the most hygienic thing to do - otherwise you can't guarantee someone isn't going to access the database afterwards, which would no longer be valid. GHC not being killed by async exceptions is the thing to look at.

@lspitzner
Copy link
Author

i do see high (good 90%) cpu usage during the parallel A,B,C case.

for me, when i was in any kind of "error loop", i never observed any noticeable delay between printing of the error message and termination.

@ezyang
Copy link
Owner

ezyang commented Jul 1, 2016

So the inability to kill things in a prompt fashion is due to ndmitchell/shake#474 since ghc-shake is still pegged to an old, patched version of Shake which leaks masking information. I wonder if GHC make is able to kill things promptly; it's hard to tell because the output is interleaved too coarsely, see https://ghc.haskell.org/trac/ghc/ticket/12349

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants