You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and submit a job (for example: python job_submit.py localhost 8080 /bin/sleep 60 (on the master node)). It leads to a crash inside the worker node.
Talking to @ms705 it seems the root of the problem lies in the worker nodes recalculation of the flow:
What *should* happen is that the subordinate coordinator ends up
running a flow scheduler itself that it can use to schedule in its more
restricted window of visibility into the cluster state; remotely-placed
tasks would have to be reflected in that flow scheduler's flow graph,
which they aren't (hence the error).
The text was updated successfully, but these errors were encountered:
ms705
changed the title
Starting Worker Nodes with Flow Based Scheduling can cause fatal exception
Starting subordinate coordinators with flow scheduler causes crash
Feb 2, 2017
This issue arises due to bugs in the code that handles delegated tasks: the flow scheduler does not correctly maintain its data structures w.r.t. tasks that were placed on local resources by a superior coordinator.
We didn't previously notice this because we used a single flow scheduler in the cluster -- the flow scheduling approach works best when the whole cluster state is visible to the scheduler -- and ran simple, queue-based schedulers with subordinate coordinators. (This is also the workaround for the bug: use --scheduler=simple for subordinate coordinators.)
Fixing this will require correct handling of delegated tasks in FlowScheduler, i.e., overriding the relevant implementations in EventDrivenScheduler.
When you start a "master" node, and a "worker" node both using the flow scheduler (e.g.:
Master node starts with:
build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --listen_uri tcp:0.0.0.0:8000 --task_lib_dir=$(pwd)/build/src --v=2
and Worker node starts with:
build/src/coordinator --scheduler flow --flow_scheduling_cost_model 6 --parent_uri tcp:firmament.masternode.com:8000 --listen_uri tcp:0.0.0.0:8000 --task_lib_dir=$(pwd)/build/src --v=2
),
and submit a job (for example:
python job_submit.py localhost 8080 /bin/sleep 60
(on the master node)). It leads to a crash inside the worker node.Talking to @ms705 it seems the root of the problem lies in the worker nodes recalculation of the flow:
The text was updated successfully, but these errors were encountered: