Skip to content

Commit

Permalink
Initial commit, moved from CakeML
Browse files Browse the repository at this point in the history
  • Loading branch information
xrchz committed Nov 8, 2017
1 parent 2f082c7 commit 92a875f
Show file tree
Hide file tree
Showing 7 changed files with 1,391 additions and 0 deletions.
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Automated regression test infrastructure for CakeML.

[api.sml](api.sml):
Implements the server-side regression-test API as a CGI program.
The API is for workers to view and manipulate the job queues.

[design.txt](design.txt):
Notes on the design of the automated regression test infrastructure.

[poll.sml](poll.sml):
Implements automatic refreshing of the job queues.
If there are new jobs on GitHub, they will be added to the waiting queue.
If there are stale jobs, they will be removed.

[regressionLib.sml](regressionLib.sml):
Code shared between the pieces of the regression test suite.

[worker.sml](worker.sml):
Worker that claims and runs regression test jobs.
62 changes: 62 additions & 0 deletions TODO
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
TODO: Change (text) API endpoint to regression.cgi/api/

TODO: Write HTML interface for status output
Thus regression.cgi/{,active,waiting,stopped,etc.} will return a nicely
presented webpage (with links) for viewing the current status of the queues.

TODO: Write help/usage interface for worker
Make the program more self-documenting.

TODO: Make server send email to [email protected] when a job is stopped

TODO: Update GitHub commit status for jobs in the system
Probably this should be done by the server or poller.
When a job is claimed: mark its commit as being tested
When a job is stopped: mark its commit as passed/failed

TODO: Use GitHub webhooks or similar to avoid polling for changes
Thus the poller won't be run in polling mode, but instead will be the target
of a webhook that GitHub calls to update our queues.

TODO: Add interface for disowning jobs (1)
Currently, you can manually stop a job that is supposedly active and claimed
by a worker that has gone rogue or died. It might be nice to support this
programmatically as an option to the worker, so it can make the appropriate
API calls.

TODO: Make poller stop jobs that have been running too long

TODO: Sort the id lists returned by {active,waiting,stopped}?
Prioritise pull requests?
Pull requests could be assigned lower ids by default

TODO: Be smarter about assigning job ids?
Currently, we only assign new ids that a greater than all existing ones.
Would it be nice to fill in gaps instead? Maybe... When the system is running
well, every job will be either properly stopped or superseded, so the only
way for there to be a gap is when a waiting job gets superseded.

Gaps can be filled manually by renaming the highest waiting job into the
first gap. Indeed, that strategy could be automated (rather than assigning
job ids correctly in the first place). But assigning correctly is probably
just as easy.

TODO: Replace calls to the OS shell with SML Basis functionality
In regressionLib.sml there is a call to OS.Process.system that passes quite a
lot of work off to the shell: setting environment variables, executing a
subprocess, and redirecting standard streams. This could all be implemented
in the SML Basis, in particular using Posix structure, instead. Doing so
might be more reliable?

TODO: Add interface for disowning jobs (2)?
After "disowning" (and stopping) a job, another similar job will eventually
be added to the queue if necessary. But would it be nice to instead be able
to re-use the existing job number? I.e., do not stop the job after disowning
but rather return it to the waiting queue? This would require a new API
action.

(*
server = api.sml
poller = poll.sml
worker = worker.sml
*)
162 changes: 162 additions & 0 deletions api.sml
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
(*
Implements the server-side regression-test API as a CGI program.
The API is for workers to view and manipulate the job queues.
*)
use "regressionLib.sml";

open regressionLib

val text_response_header = "Content-Type:text/plain\n\n"

fun text_response s =
let
val () = TextIO.output(TextIO.stdOut, text_response_header)
val () = TextIO.output(TextIO.stdOut, s)
in () end

fun cgi_die ls =
(List.app (fn s => TextIO.output(TextIO.stdOut, s))
(text_response_header::"Error:\n"::ls);
TextIO.output(TextIO.stdOut,"\n");
OS.Process.exit OS.Process.success;
raise (Fail "impossible"))

fun cgi_assert b ls = if b then () else cgi_die ls

val waiting = read_list cgi_die "waiting"
val active = read_list cgi_die "active"
val stopped = read_list cgi_die "stopped"

fun job id =
let
val f = Int.toString id
val q = queue_of_job cgi_die f
in
OS.Path.concat(q,f)
end

fun claim id name =
let
val f = Int.toString id
val old = OS.Path.concat("waiting",f)
val new = OS.Path.concat("active",f)
val () =
if OS.FileSys.access(old,[OS.FileSys.A_READ]) then
if OS.FileSys.access(new,[OS.FileSys.A_READ]) then
cgi_die ["job ",f, " is both waiting and active"]
else OS.FileSys.rename{old = old, new = new}
else cgi_die ["job ",f," is not waiting to be claimed"]
val out = TextIO.openAppend new
in
print_claimed out (name,Date.fromTimeUniv(Time.now())) before TextIO.closeOut out
end

fun append id line =
let
val f = Int.toString id
val p = OS.Path.concat("active",f)
val out = TextIO.openAppend p handle e as IO.Io _ => (cgi_die ["job ",f," is not active: cannot append"]; raise e)
in
print_log_entry out (Date.fromTimeUniv(Time.now()),line) before TextIO.closeOut out
end

fun log id data =
let
val f = Int.toString id
val p = OS.Path.concat("active",f)
val out = TextIO.openAppend p handle e as IO.Io _ => (cgi_die ["job ",f," is not active: cannot log"]; raise e)
in
TextIO.output(out,data) before TextIO.closeOut out
end

fun stop id =
let
val f = Int.toString id
val old = OS.Path.concat("active",f)
val new = OS.Path.concat("stopped",f)
val () =
if OS.FileSys.access(old,[OS.FileSys.A_READ]) then
if OS.FileSys.access(new,[OS.FileSys.A_READ]) then
cgi_die ["job ",f, " is both active and stopped"]
else OS.FileSys.rename{old = old, new = new}
else cgi_die ["job ",f," is not active: cannot stop"]
in
() (* TODO: send email *)
end

fun retry id =
let
val f = Int.toString id
val old = OS.Path.concat("stopped",f)
val () = cgi_assert (OS.FileSys.access(old,[OS.FileSys.A_READ])) ["job ",f," is not stopped: cannot retry"]
val id = next_job_id [waiting,active,stopped]
val new = OS.Path.concat("waiting",Int.toString id)
val inp = TextIO.openIn old
val out = TextIO.openOut new
fun loop last =
case TextIO.inputLine inp of NONE => cgi_die ["stopped job ",f," has invalid file format"]
| SOME line => (TextIO.output(out,line);
if last then () else
loop (String.isPrefix "HOL: " line))
val () = loop false
val () = TextIO.closeOut out
val () = TextIO.closeIn inp
in id end

datatype request_api = Get of api | Post of id * string

fun get_api () =
case (OS.Process.getEnv "PATH_INFO",
OS.Process.getEnv "REQUEST_METHOD") of
(SOME path_info, SOME "GET")
=> Option.map Get (api_from_string path_info (OS.Process.getEnv "QUERY_STRING"))
| (SOME path_info, SOME "POST")
=> (case String.tokens (equal #"/") path_info of
["log",n] =>
(Option.mapPartial
(fn len =>
Option.compose
((fn id => Post(id,TextIO.inputN(TextIO.stdIn,len))),
id_from_string) n)
(Option.composePartial(Int.fromString,OS.Process.getEnv) "CONTENT_LENGTH"))
| _ => NONE)
| _ => NONE

local
fun id_list ids = String.concatWith " " (List.map Int.toString ids)
in
fun dispatch api =
text_response (
case api of
Waiting => id_list (waiting())
| Active => id_list (active())
| Stopped => id_list (stopped())
| Job id => file_to_string (job id)
| Claim(id,name) => (claim id name; claim_response)
| Append(id,line) => (append id line; append_response)
| Stop id => (stop id; stop_response)
| Retry id => String.concat["retried as job ",Int.toString(retry id),"\n"]
) handle e => cgi_die [exnMessage e]
end

fun dispatch_log id data =
text_response (log id data; log_response)
handle e => cgi_die [exnMessage e]

fun dispatch_req (Get api) = dispatch api
| dispatch_req (Post (id,data)) = dispatch_log id data

fun main () =
let
val () = ensure_queue_dirs cgi_die
in
case get_api () of
NONE => cgi_die ["bad usage"]
| SOME req =>
let
val fd = acquire_lock ()
in
dispatch_req req before
Posix.IO.close fd
end
end
153 changes: 153 additions & 0 deletions design.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
Notes on the design of the automated regression test infrastructure.

Server API:

waiting:
returns space-separated list of ids of waiting jobs

active:
returns space-separated list of ids of active jobs

stopped:
returns space-separated list of ids of stopped jobs

job id:
returns information on job <id>
including:
- commits (including pull request integration, if any)
- (worker) name and time started (if any)
- output so far

claim id name:
worker <name> claims job <id>.
<name> is in the query string.
returns "claimed"
fails if <id> is not currently waiting

append id line:
record <line> as additional output for job <id> with a timestamp.
<line> is in the query string.
returns "appended"
fails if <id> is not currently active

log id data:
append <data> as additional output for job <id>.
<data> is POST data.
returns "logged"
fails if <id> is not currently active

stop id:
mark job <id> as stopped
returns "stopped"
sends email with output
fails if <id> is not currently active

retry id:
create a new job with the same commits as job <id>
returns "retried as job <new id>"
fails if <id> is not currently stopped
fails if there is already an active or waiting job for the same commits

all failures return text starting with "Error:"

each job is on exactly one list: waiting, active, stopped
if a job changes list, it can only move to the right

Poller automatic behaviours:

1. add a job to the waiting list:
- create a new job id
- the commits for the job satisfy the following:
- they are the current commits (on GitHub) for a particular target
- there are no other jobs with the same commits

2. remove a job from the waiting list:
- if it does not have the current commits (on GitHub) for any target, or
- if there are other active or waiting jobs for the same commits
(this should never happen -- not checking it currently)
(we don't count duplicate stopped jobs since
they might have been retried)
- the removed job's id number could be re-used (but is that worthwhile?)
- the race with workers trying to obtain this job is handled by the global
lock on queues. either the job is claimed before it can be removed, or
removed before it can be claimed.

3. move a job from active to stopped:
- if the time since it started is too long
- adds a note, "timed out", to the output
- does stop API actions

Targets:

CakeML:
- branch "master"
- each open, mergeable pull request
in this case, there are two commits:
- the pull request commit
- the master commit to merge into
but they move together (as captured by the state of the PR on GitHub)
HOL:
- branch "master"

---- OLD STUFF ----

CakeML Regression Test

Organised as two programs:

Queue Manager (Server)
- 1 instance, runs on cakeml.org
- Tracks commits on GitHub over the GitHub API
- Manages a queue of jobs
- Allocates jobs to workers as they are available
- Records which jobs have run or are running
- Stores all its state in the filesystem/database:
can be killed and restarted without problems.

Worker (Client)
- Runs regression test(s) as designated by the queue manager
- Can send the status (generated output) of any currently running job
- Interrupts the currently running job if requested
- May reuse a HOL build if the desired commit has already been built
- Assumes Poly/ML and git are installed
- Is almost stateless: either it's running a job, or it's ready for a job.
The state of HOL builds is kept in the filesystem.

Scheduling is based on
- "targets" identified by branch name or pull request number
in the CakeML repository.
- "snapshots" representing a specific combination of commits (in all
relevant repositories, e.g., CakeML and HOL).
- "jobs" representing a snapshot and a job state.
- "workers" representing instances of the worker program, each of
which has a worker state.
Each snapshot will correspond to some target.
A target may have many corresponding snapshots.
The possible job states are as follows:
- queued
- active (running, has a corresponding worker)
- finished
- dead (cancelled or superseded)
The possible worker states are as follows:
- testing (running)
- ready

Jobs can be queued, superseded, allocated (made active and assigned a
worker), or cancelled manually.

A job will be automatically queued if:
- The job's snapshot has the most recent commits (on GitHub) for its target, and
- No active or finished job has the same snapshot

A job will be automatically superseded if:
- Either of the conditions for queueing above is false, and
- The job was not queued manually

A job will be automatically allocated if:
- It is currently queued
- A worker is ready

The current policy for targets is defined as follows:
- each open pull request is a target
- the branch "master" is a target
- the branch "master" in HOL is assumed
Loading

0 comments on commit 92a875f

Please sign in to comment.