Test running on Baylor "case 2" dataset #548

ryan-williams · 2016-08-15T20:31:08Z

@jstjohn discussed hitting some issues running on the "case 2" data here.

I'm downloading the data now to attempt to reproduce.

ryan-williams · 2016-08-20T03:27:02Z

I was able to run the joint-caller on this data successfully from a recent HEAD (08dcc6c) three times today with varying numbers of executors: 10, 20, and dynamic (52-369, per stage widths).

Common configs:

Spark 1.6.1, Hadoop 2.6.0-cdh5.5.1
--master yarn --deploy-mode cluster
15gb driver
17gb, 6-core executors
- standard size we use on our cluster
- fits 3 on each 64GB, 24-core node, with room to spare
underlying normal/tumor files spanned 311 HDFS blocks

See stats below, including times for the bottleneck Stage 6, which builds pileups and calls variants on them:

First run: 10 executors

60 concurrent tasks, max.
Total time: 47 mins.
- Stage 6: 27 mins.
Stages page:

Second run: 20 executors

120 concurrent tasks, max.
Total time: 27 mins.
- Stage 6: 13 mins.
Stages page:

Third run: dynamically-allocated executors

During the 311-wide stages: 52 executors (for up to 312 concurrent tasks).
During the 2153-wide stages: 359 executors (for up to 2154 concurrent tasks).
Total time: 14 mins.
- Stage 6: 102s.
Stages page:

These portray a pretty good robustness story, and provide some promising scaling data points.

Going from 10 to 20 executors halved the bottleneck stage (and then some…!), and the whole app ran in 58% of the time. Put another way, they ran as if they had perfect linear scaling outside of just under 4mins of fixed-cost time. That's more than reasonable considerable we lost that doing loci-partitioning broadcasting between the end of stage 4 and the beginning of stage 5, resulting in gaps of 4:08, 3:51, and 4:20 in the 10-, 20-, and dynamic runs, resp. where the driver was the only node doing work.

This and other fixed time-costs weighed further on the linear-scaling null-hypothesis when going from 20 to dynamic (52-359) executors, the latter only running about half as fast:

it lost 78s in the beginning, doing unnecessarily-coarse-grained reference-sequence-broadcasting, among other things; cf. Send relevant reference sequences to executors in more fine-grained ways than via a Broadcast #560.
it lost a minute between the last stage's completion and the app's end while writing a VCF.

So that's more than 6mins of fixed-cost, outside of which the dynamic-allocation run was definitely in the ideal linear-scaling range for its 52-359 executors. Of course, the fixed costs matter, but this is still a good sanity check.

Local runs

In a couple of attempts to run this in "local" mode (--master local), reading the BAMs from a local NFS mount instead of HDFS, I observed a 20gb driver to OOM; we've discussed trying to make sure local runs are reasonably performant in the past, and this seems like it could be a good test case for ironing out kinks there, since these BAMs are a nice medium-small size that should be doable locally. In particular, it seems that @jstjohn was attempting it this way when he was stymied on #386.

I'll follow up on this and see if I can get it working.

hammer · 2016-12-17T18:46:30Z

@ryan-williams is this task still in progress?

ryan-williams self-assigned this Aug 15, 2016

hammer added the validation label Aug 25, 2016

hammer mentioned this issue Aug 25, 2016

Texas Cancer Research Biobank (TCRB) data hammerlab/variant-calling-benchmarks#37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test running on Baylor "case 2" dataset #548

Test running on Baylor "case 2" dataset #548

ryan-williams commented Aug 15, 2016

ryan-williams commented Aug 20, 2016

hammer commented Dec 17, 2016

Test running on Baylor "case 2" dataset #548

Test running on Baylor "case 2" dataset #548

Comments

ryan-williams commented Aug 15, 2016

ryan-williams commented Aug 20, 2016

Common configs:

First run: 10 executors

Second run: 20 executors

Third run: dynamically-allocated executors

Local runs

hammer commented Dec 17, 2016