Starting subordinate coordinators with flow scheduler causes crash #54

Mythra · 2017-02-01T03:57:35Z

When you start a "master" node, and a "worker" node both using the flow scheduler (e.g.:

Master node starts with:

build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --listen_uri tcp:0.0.0.0:8000 --task_lib_dir=$(pwd)/build/src --v=2

and Worker node starts with:

build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --parent_uri tcp:firmament.masternode.com:8000 --listen_uri tcp:0.0.0.0:8000 --task_lib_dir=$(pwd)/build/src --v=2

),

and submit a job (for example: python job_submit.py localhost 8080 /bin/sleep 60 (on the master node)). It leads to a crash inside the worker node.

Talking to @ms705 it seems the root of the problem lies in the worker nodes recalculation of the flow:

What *should* happen is that the subordinate coordinator ends up
running a flow scheduler itself that it can use to schedule in its more
restricted window of visibility into the cluster state; remotely-placed
tasks would have to be reflected in that flow scheduler's flow graph,
which they aren't (hence the error).

The text was updated successfully, but these errors were encountered:

ms705 · 2017-02-02T16:58:46Z

This issue arises due to bugs in the code that handles delegated tasks: the flow scheduler does not correctly maintain its data structures w.r.t. tasks that were placed on local resources by a superior coordinator.

We didn't previously notice this because we used a single flow scheduler in the cluster -- the flow scheduling approach works best when the whole cluster state is visible to the scheduler -- and ran simple, queue-based schedulers with subordinate coordinators. (This is also the workaround for the bug: use --scheduler=simple for subordinate coordinators.)

Fixing this will require correct handling of delegated tasks in FlowScheduler, i.e., overriding the relevant implementations in EventDrivenScheduler.

ms705 changed the title ~~Starting Worker Nodes with Flow Based Scheduling can cause fatal exception~~ Starting subordinate coordinators with flow scheduler causes crash Feb 2, 2017

ms705 added defect scheduling labels Feb 2, 2017

ms705 added the low priority label Feb 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starting subordinate coordinators with flow scheduler causes crash #54

Starting subordinate coordinators with flow scheduler causes crash #54

Mythra commented Feb 1, 2017

ms705 commented Feb 2, 2017

Starting subordinate coordinators with flow scheduler causes crash #54

Starting subordinate coordinators with flow scheduler causes crash #54

Comments

Mythra commented Feb 1, 2017

ms705 commented Feb 2, 2017