Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard timeout abort #110

Closed
izgzhen opened this issue Oct 15, 2019 · 7 comments · Fixed by #119
Closed

Hard timeout abort #110

izgzhen opened this issue Oct 15, 2019 · 7 comments · Fixed by #119
Assignees

Comments

@izgzhen
Copy link
Collaborator

izgzhen commented Oct 15, 2019

Currently, Cozy will hang at the following state for a long time:

$ time cozy examples/select-flatmap.ds -t 10
Checking call legality...
Checking invariant preservation...
Done!
Adding query q...
STARTING IMPROVEMENT JOB q
updating with 1 new solutions
update order:
  --> q
considering update 1/1...
SOLUTION FOR q AT 0:00:00.037117 [size=74]
----------------------------------------
  _var131 : Bag<{ A : Int, B : String }> = Rs
  _var132 : Bag<{ B : String, C : Int }> = Ss
  _var133 : Bag<{ B : String, C : Int }> = Qs
  _var134 : Bag<{ B : String, C : Int }> = Ws
  return FlatMap {r -> FlatMap {s -> FlatMap {q -> Map {w -> ((r).A, (s).C, (q).B, (w).C)} (Filter {w -> (((q).B == (w).B) and ((r).A == 15))} (_var134))} (_var133)} (_var132)} (_var131)
----------------------------------------
Stopping jobs
requesting stop for ImproveQueryJob[q]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Waiting on 1 jobs...
  --> ImproveQueryJob[q] [pid=26128]
Stopping SafeQueue...
Done!
Generating IR...
Inlining calls...
Generating code for extension types...
Concretization functions:

_var79 : Bag<{ A : Int, B : String }> = Rs
_var80 : Bag<{ B : String, C : Int }> = Ss
_var81 : Bag<{ B : String, C : Int }> = Qs
_var82 : Bag<{ B : String, C : Int }> = Ws

SelectFlatmap:
  type R = { A : Int, B : String }
  type S = { B : String, C : Int }
  type Q = { B : String, C : Int }
  type W = { B : String, C : Int }
  state _var79 : Bag<{ A : Int, B : String }>
  state _var80 : Bag<{ B : String, C : Int }>
  state _var81 : Bag<{ B : String, C : Int }>
  state _var82 : Bag<{ B : String, C : Int }>

  query q():
    FlatMap {r -> FlatMap {s -> FlatMap {q -> Map {w -> ((r).A, (s).C, (q).B, (w).C)} (Filter {w -> (((q).B == (w).B) and ((r).A == 15))} (_var82))} (_var81)} (_var80)} (_var79)

Number of improvements done: 1
cozy examples/select-flatmap.ds -t 10  51.85s user 0.34s system 99% cpu 52.625 total

I guess Cozy is waiting for Z3 to return -- but this might far exceed the specified timeout (10s).

We should be able to let Cozy give up early, kill the Z3 job and output immediately after timeout.

NOTE: some hints on the semantic of timeout #52

@punndcoder28
Copy link

punndcoder28 commented Oct 16, 2019

Hey! Can I work on this? Can I know where file this is being produced from?

@izgzhen
Copy link
Collaborator Author

izgzhen commented Oct 16, 2019

@punndcoder28 Sure, I've added more detailed and assigned it to you. Thanks!

@Calvin-L
Copy link
Collaborator

Unfortunately, this issue may not be easy to solve. Cozy uses the Z3 library directly, rather than invoking Z3 as an external process. The Z3 library API offers timeouts, but as far as I know it does not offer a way to interrupt a running call. Since Cozy uses Python multiprocessing you might try to kill the multiprocessing jobs, but that also won't work: killing a multiprocess job can corrupt the communication channels between the main process and the multiprocessing job, resulting in unexpected crashes.

I believe that this issue is still worth working on, since it is a huge annoyance when it happens. I hope that I have missed some easy solution.

@punndcoder28
Copy link

OK. If I run into any problems I will leave a comment here.
Thanks.

@izgzhen
Copy link
Collaborator Author

izgzhen commented Oct 16, 2019

since it is a huge annoyance when it happens

My experience is that it will happen 50% of all time...and when it does happen and I can't wait, I simply send interrupt from Ctrl-C and kill it off. So, I think the most essential thing is to save the work on disk (as specified in --save) before killing everything off after receiving an interrupt signal from the users. In this vein, we can try to add an interrupt handler and let it access the best solution state so far. It doesn't matter if we need to use terminate API and corrupt the queue, as long as we don't corrupt the existing solution in memory.

To summarize: my recommendation is to save the work when interrupted by user/timer and brutally exit ASAP.

@Calvin-L
Copy link
Collaborator

Calvin-L commented Jul 1, 2020

Modern versions of Z3 catch SIGINT:
https://github.com/Z3Prover/z3/blob/d75ce380169617379fb9dff40ed6ff62b1955dc0/src/util/scoped_ctrl_c.cpp#L47

A SIGINT signal causes Z3 to return "unknown", as it would do for a timeout.

Here's my proposal for resolving this issue:

  • When the parent process needs to stop a worker, it first sets the "stop" flag (Job.request_stop) and then sends a SIGINT to the worker process.
  • If a solver call ever returns "unknown", the worker should check the "stop" flag. If it is set, then the "unknown" result can be ignored and the worker can shut down gracefully.
  • Workers should install a SIGINT handler that initiates a graceful shutdown in case the signal is delivered when the worker is not in a Z3 call.

@Calvin-L Calvin-L assigned Calvin-L and unassigned punndcoder28 Jul 1, 2020
@Calvin-L
Copy link
Collaborator

Calvin-L commented Jul 1, 2020

In the original example, it appears the job is actually busy converting the syntax tree into a Z3 query, so additionally:

  • The solver should check the stop flag periodically while converting ASTs to Z3 queries.

(And perhaps we should look into optimizing encoding bag equality a little...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants