Initial commit, moved from CakeML

Taken from CakeML/cakeml@7a6c12f
ncough · Nov 8, 2017 · 92a875f · 92a875f
1 parent 2f082c7
commit 92a875f
Show file tree

Hide file tree

Showing 7 changed files with 1,391 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,19 @@
+Automated regression test infrastructure for CakeML.
+
+[api.sml](api.sml):
+Implements the server-side regression-test API as a CGI program.
+The API is for workers to view and manipulate the job queues.
+
+[design.txt](design.txt):
+Notes on the design of the automated regression test infrastructure.
+
+[poll.sml](poll.sml):
+Implements automatic refreshing of the job queues.
+If there are new jobs on GitHub, they will be added to the waiting queue.
+If there are stale jobs, they will be removed.
+
+[regressionLib.sml](regressionLib.sml):
+Code shared between the pieces of the regression test suite.
+
+[worker.sml](worker.sml):
+Worker that claims and runs regression test jobs.
diff --git a/TODO b/TODO
@@ -0,0 +1,62 @@
+TODO: Change (text) API endpoint to regression.cgi/api/
+
+TODO: Write HTML interface for status output
+  Thus regression.cgi/{,active,waiting,stopped,etc.} will return a nicely
+  presented webpage (with links) for viewing the current status of the queues.
+
+TODO: Write help/usage interface for worker
+  Make the program more self-documenting.
+
+TODO: Make server send email to [email protected] when a job is stopped
+
+TODO: Update GitHub commit status for jobs in the system
+  Probably this should be done by the server or poller.
+  When a job is claimed: mark its commit as being tested
+  When a job is stopped: mark its commit as passed/failed
+
+TODO: Use GitHub webhooks or similar to avoid polling for changes
+  Thus the poller won't be run in polling mode, but instead will be the target
+  of a webhook that GitHub calls to update our queues.
+
+TODO: Add interface for disowning jobs (1)
+  Currently, you can manually stop a job that is supposedly active and claimed
+  by a worker that has gone rogue or died. It might be nice to support this
+  programmatically as an option to the worker, so it can make the appropriate
+  API calls.
+
+TODO: Make poller stop jobs that have been running too long
+
+TODO: Sort the id lists returned by {active,waiting,stopped}?
+  Prioritise pull requests?
+  Pull requests could be assigned lower ids by default
+
+TODO: Be smarter about assigning job ids?
+  Currently, we only assign new ids that a greater than all existing ones.
+  Would it be nice to fill in gaps instead? Maybe... When the system is running
+  well, every job will be either properly stopped or superseded, so the only
+  way for there to be a gap is when a waiting job gets superseded.
+
+  Gaps can be filled manually by renaming the highest waiting job into the
+  first gap. Indeed, that strategy could be automated (rather than assigning
+  job ids correctly in the first place). But assigning correctly is probably
+  just as easy.
+
+TODO: Replace calls to the OS shell with SML Basis functionality
+  In regressionLib.sml there is a call to OS.Process.system that passes quite a
+  lot of work off to the shell: setting environment variables, executing a
+  subprocess, and redirecting standard streams. This could all be implemented
+  in the SML Basis, in particular using Posix structure, instead. Doing so
+  might be more reliable?
+
+TODO: Add interface for disowning jobs (2)?
+  After "disowning" (and stopping) a job, another similar job will eventually
+  be added to the queue if necessary. But would it be nice to instead be able
+  to re-use the existing job number? I.e., do not stop the job after disowning
+  but rather return it to the waiting queue? This would require a new API
+  action.
+
+(*
+server = api.sml
+poller = poll.sml
+worker = worker.sml
+*)
diff --git a/api.sml b/api.sml
@@ -0,0 +1,162 @@
+(*
+  Implements the server-side regression-test API as a CGI program.
+  The API is for workers to view and manipulate the job queues.
+*)
+use "regressionLib.sml";
+
+open regressionLib
+
+val text_response_header = "Content-Type:text/plain\n\n"
+
+fun text_response s =
+  let
+    val () = TextIO.output(TextIO.stdOut, text_response_header)
+    val () = TextIO.output(TextIO.stdOut, s)
+  in () end
+
+fun cgi_die ls =
+  (List.app (fn s => TextIO.output(TextIO.stdOut, s))
+     (text_response_header::"Error:\n"::ls);
+   TextIO.output(TextIO.stdOut,"\n");
+   OS.Process.exit OS.Process.success;
+   raise (Fail "impossible"))
+
+fun cgi_assert b ls = if b then () else cgi_die ls
+
+val waiting = read_list cgi_die "waiting"
+val active  = read_list cgi_die "active"
+val stopped = read_list cgi_die "stopped"
+
+fun job id =
+  let
+    val f = Int.toString id
+    val q = queue_of_job cgi_die f
+  in
+    OS.Path.concat(q,f)
+  end
+
+fun claim id name =
+  let
+    val f = Int.toString id
+    val old = OS.Path.concat("waiting",f)
+    val new = OS.Path.concat("active",f)
+    val () =
+      if OS.FileSys.access(old,[OS.FileSys.A_READ]) then
+        if OS.FileSys.access(new,[OS.FileSys.A_READ]) then
+          cgi_die ["job ",f, " is both waiting and active"]
+        else OS.FileSys.rename{old = old, new = new}
+      else cgi_die ["job ",f," is not waiting to be claimed"]
+    val out = TextIO.openAppend new
+  in
+    print_claimed out (name,Date.fromTimeUniv(Time.now())) before TextIO.closeOut out
+  end
+
+fun append id line =
+  let
+    val f = Int.toString id
+    val p = OS.Path.concat("active",f)
+    val out = TextIO.openAppend p handle e as IO.Io _ => (cgi_die ["job ",f," is not active: cannot append"]; raise e)
+  in
+    print_log_entry out (Date.fromTimeUniv(Time.now()),line) before TextIO.closeOut out
+  end
+
+fun log id data =
+  let
+    val f = Int.toString id
+    val p = OS.Path.concat("active",f)
+    val out = TextIO.openAppend p handle e as IO.Io _ => (cgi_die ["job ",f," is not active: cannot log"]; raise e)
+  in
+    TextIO.output(out,data) before TextIO.closeOut out
+  end
+
+fun stop id =
+  let
+    val f = Int.toString id
+    val old = OS.Path.concat("active",f)
+    val new = OS.Path.concat("stopped",f)
+    val () =
+      if OS.FileSys.access(old,[OS.FileSys.A_READ]) then
+        if OS.FileSys.access(new,[OS.FileSys.A_READ]) then
+          cgi_die ["job ",f, " is both active and stopped"]
+        else OS.FileSys.rename{old = old, new = new}
+      else cgi_die ["job ",f," is not active: cannot stop"]
+  in
+    () (* TODO: send email *)
+  end
+
+fun retry id =
+  let
+    val f = Int.toString id
+    val old = OS.Path.concat("stopped",f)
+    val () = cgi_assert (OS.FileSys.access(old,[OS.FileSys.A_READ])) ["job ",f," is not stopped: cannot retry"]
+    val id = next_job_id [waiting,active,stopped]
+    val new = OS.Path.concat("waiting",Int.toString id)
+    val inp = TextIO.openIn old
+    val out = TextIO.openOut new
+    fun loop last =
+      case TextIO.inputLine inp of NONE => cgi_die ["stopped job ",f," has invalid file format"]
+      | SOME line => (TextIO.output(out,line);
+                      if last then () else
+                      loop (String.isPrefix "HOL: " line))
+    val () = loop false
+    val () = TextIO.closeOut out
+    val () = TextIO.closeIn inp
+  in id end
+
+datatype request_api = Get of api | Post of id * string
+
+fun get_api () =
+  case (OS.Process.getEnv "PATH_INFO",
+        OS.Process.getEnv "REQUEST_METHOD") of
+    (SOME path_info, SOME "GET")
+      => Option.map Get (api_from_string path_info (OS.Process.getEnv "QUERY_STRING"))
+  | (SOME path_info, SOME "POST")
+      => (case String.tokens (equal #"/") path_info of
+            ["log",n] =>
+              (Option.mapPartial
+                (fn len =>
+                  Option.compose
+                    ((fn id => Post(id,TextIO.inputN(TextIO.stdIn,len))),
+                     id_from_string) n)
+                (Option.composePartial(Int.fromString,OS.Process.getEnv) "CONTENT_LENGTH"))
+          | _ => NONE)
+  | _ => NONE
+
+local
+  fun id_list ids = String.concatWith " " (List.map Int.toString ids)
+in
+  fun dispatch api =
+    text_response (
+      case api of
+        Waiting => id_list (waiting())
+      | Active => id_list (active())
+      | Stopped => id_list (stopped())
+      | Job id => file_to_string (job id)
+      | Claim(id,name) => (claim id name; claim_response)
+      | Append(id,line) => (append id line; append_response)
+      | Stop id => (stop id; stop_response)
+      | Retry id => String.concat["retried as job ",Int.toString(retry id),"\n"]
+    ) handle e => cgi_die [exnMessage e]
+end
+
+fun dispatch_log id data =
+  text_response (log id data; log_response)
+  handle e => cgi_die [exnMessage e]
+
+fun dispatch_req (Get api) = dispatch api
+  | dispatch_req (Post (id,data)) = dispatch_log id data
+
+fun main () =
+  let
+    val () = ensure_queue_dirs cgi_die
+  in
+    case get_api () of
+      NONE => cgi_die ["bad usage"]
+    | SOME req =>
+      let
+        val fd = acquire_lock ()
+      in
+        dispatch_req req before
+        Posix.IO.close fd
+      end
+  end
diff --git a/design.txt b/design.txt
@@ -0,0 +1,153 @@
+Notes on the design of the automated regression test infrastructure.
+
+Server API:
+
+  waiting:
+    returns space-separated list of ids of waiting jobs
+
+  active:
+    returns space-separated list of ids of active jobs
+
+  stopped:
+    returns space-separated list of ids of stopped jobs
+
+  job id:
+    returns information on job <id>
+    including:
+      - commits (including pull request integration, if any)
+      - (worker) name and time started (if any)
+      - output so far
+
+  claim id name:
+    worker <name> claims job <id>.
+    <name> is in the query string.
+    returns "claimed"
+    fails if <id> is not currently waiting
+
+  append id line:
+    record <line> as additional output for job <id> with a timestamp.
+    <line> is in the query string.
+    returns "appended"
+    fails if <id> is not currently active
+
+  log id data:
+    append <data> as additional output for job <id>.
+    <data> is POST data.
+    returns "logged"
+    fails if <id> is not currently active
+
+  stop id:
+    mark job <id> as stopped
+    returns "stopped"
+    sends email with output
+    fails if <id> is not currently active
+
+  retry id:
+    create a new job with the same commits as job <id>
+    returns "retried as job <new id>"
+    fails if <id> is not currently stopped
+    fails if there is already an active or waiting job for the same commits
+
+  all failures return text starting with "Error:"
+
+each job is on exactly one list: waiting, active, stopped
+if a job changes list, it can only move to the right
+
+Poller automatic behaviours:
+
+  1. add a job to the waiting list:
+    - create a new job id
+    - the commits for the job satisfy the following:
+      - they are the current commits (on GitHub) for a particular target
+      - there are no other jobs with the same commits
+
+  2. remove a job from the waiting list:
+    - if it does not have the current commits (on GitHub) for any target, or
+    - if there are other active or waiting jobs for the same commits
+        (this should never happen -- not checking it currently)
+        (we don't count duplicate stopped jobs since
+         they might have been retried)
+    - the removed job's id number could be re-used (but is that worthwhile?)
+    - the race with workers trying to obtain this job is handled by the global
+      lock on queues. either the job is claimed before it can be removed, or
+      removed before it can be claimed.
+
+  3. move a job from active to stopped:
+    - if the time since it started is too long
+    - adds a note, "timed out", to the output
+    - does stop API actions
+
+Targets:
+
+  CakeML:
+    - branch "master"
+    - each open, mergeable pull request
+      in this case, there are two commits:
+        - the pull request commit
+        - the master commit to merge into
+      but they move together (as captured by the state of the PR on GitHub)
+  HOL:
+    - branch "master"
+
+---- OLD STUFF ----
+
+CakeML Regression Test
+
+Organised as two programs:
+
+  Queue Manager (Server)
+    - 1 instance, runs on cakeml.org
+    - Tracks commits on GitHub over the GitHub API
+    - Manages a queue of jobs
+    - Allocates jobs to workers as they are available
+    - Records which jobs have run or are running
+    - Stores all its state in the filesystem/database:
+      can be killed and restarted without problems.
+
+  Worker (Client)
+    - Runs regression test(s) as designated by the queue manager
+    - Can send the status (generated output) of any currently running job
+    - Interrupts the currently running job if requested
+    - May reuse a HOL build if the desired commit has already been built
+    - Assumes Poly/ML and git are installed
+    - Is almost stateless: either it's running a job, or it's ready for a job.
+      The state of HOL builds is kept in the filesystem.
+
+Scheduling is based on
+  - "targets"    identified by branch name or pull request number
+                 in the CakeML repository.
+  - "snapshots"  representing a specific combination of commits (in all
+                 relevant repositories, e.g., CakeML and HOL).
+  - "jobs"       representing a snapshot and a job state.
+  - "workers"    representing instances of the worker program, each of
+                 which has a worker state.
+Each snapshot will correspond to some target.
+A target may have many corresponding snapshots.
+The possible job states are as follows:
+  - queued
+  - active (running, has a corresponding worker)
+  - finished
+  - dead (cancelled or superseded)
+The possible worker states are as follows:
+  - testing (running)
+  - ready
+
+Jobs can be queued, superseded, allocated (made active and assigned a
+worker), or cancelled manually.
+
+A job will be automatically queued if:
+  - The job's snapshot has the most recent commits (on GitHub) for its target, and
+  - No active or finished job has the same snapshot
+
+A job will be automatically superseded if:
+  - Either of the conditions for queueing above is false, and
+  - The job was not queued manually
+
+A job will be automatically allocated if:
+  - It is currently queued
+  - A worker is ready
+
+The current policy for targets is defined as follows:
+  - each open pull request is a target
+  - the branch "master" is a target
+  - the branch "master" in HOL is assumed