Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting subordinate coordinators with flow scheduler causes crash #54

Open
Mythra opened this issue Feb 1, 2017 · 1 comment
Open

Comments

@Mythra
Copy link
Contributor

Mythra commented Feb 1, 2017

When you start a "master" node, and a "worker" node both using the flow scheduler (e.g.:

Master node starts with:

build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --listen_uri tcp:0.0.0.0:8000 --task_lib_dir=$(pwd)/build/src --v=2

and Worker node starts with:

build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --parent_uri tcp:firmament.masternode.com:8000 --listen_uri tcp:0.0.0.0:8000 --task_lib_dir=$(pwd)/build/src --v=2

),

and submit a job (for example: python job_submit.py localhost 8080 /bin/sleep 60 (on the master node)). It leads to a crash inside the worker node.

Talking to @ms705 it seems the root of the problem lies in the worker nodes recalculation of the flow:

What *should* happen is that the subordinate coordinator ends up
running a flow scheduler itself that it can use to schedule in its more
restricted window of visibility into the cluster state; remotely-placed
tasks would have to be reflected in that flow scheduler's flow graph,
which they aren't (hence the error). 
@ms705 ms705 changed the title Starting Worker Nodes with Flow Based Scheduling can cause fatal exception Starting subordinate coordinators with flow scheduler causes crash Feb 2, 2017
@ms705
Copy link
Collaborator

ms705 commented Feb 2, 2017

This issue arises due to bugs in the code that handles delegated tasks: the flow scheduler does not correctly maintain its data structures w.r.t. tasks that were placed on local resources by a superior coordinator.

We didn't previously notice this because we used a single flow scheduler in the cluster -- the flow scheduling approach works best when the whole cluster state is visible to the scheduler -- and ran simple, queue-based schedulers with subordinate coordinators. (This is also the workaround for the bug: use --scheduler=simple for subordinate coordinators.)

Fixing this will require correct handling of delegated tasks in FlowScheduler, i.e., overriding the relevant implementations in EventDrivenScheduler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants