-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
throwed exception: ./scripts/guacamole: line 39 :exec: time: not found #572
Comments
Looks like your shell doesn't have the "time" command. Try just deleting exec time java -Xmx4g -XX:MaxPermSize=512m "-Dspark.master=local[1]" -cp replace with: exec java -Xmx4g -XX:MaxPermSize=512m "-Dspark.master=local[1]" -cp We should probably do this permanently so others don't hit this issue. On Fri, Sep 9, 2016 at 5:00 AM, ZhipengCheng [email protected]
|
Interesting, I thought What shell are you using @car2008? Seems like we probably implicitly make stronger assumptions elsewhere about the shell environment we're running in, so might be worth trying to understand what that space looks like as well. |
Hi @timodonnell and @ryan-williams ,thank you very much , the shell informations i'm using are :
And I deleted Contigs-spanned-per-partition stats: 2016-09-12 08:03:10 ERROR TaskSetManager:74 - Task 0 in stage 6.0 failed 1 times; aborting job Driver stacktrace: |
Hi @car2008 The error is actually "ContigNotFound: Contig 19". I believe the files you are using are aligned to b37 where the chromosomes don't have the |
Thank you @arahuja ,it can run well when i using the b37 fasta file . |
That is actually not something we have tested, we typically run on Yarn or In that stacktrace at least, it is recommending different parameters for
Does one of those solutions resolve the issue? On Mon, Sep 12, 2016 at 4:15 AM, ZhipengCheng [email protected]
|
Hi @arahuja ,thank you, i have tried to run the command on spark local model firstly |
Are you able to share that BAM with us, @car2008? |
Hi @ryan-williams , i'm sorry i can't share the BAM because it belongs to one of my customers , and i can tell you someting about the BAM: it was generated after comparing with hg19.fasta , and you can tell me other necessary informations you want to know .May the BAM have some problems ? |
@car2008 Is the BAM very large? In local mode, by default, we use This is issue is being addressed in #569 as well to simplify this. |
He mentioned above that it's a 7GB BAM; since that's compressed you could get arbitrary blow-up when reading the whole thing in, so not terribly surprising that it would blow up in local mode, I guess! Aside: I'll give #569 a review today @arahuja, sorry for the delay. Does it let us stream over a local BAM in spark-local-mode? Do we not have a way to do that pre-#569? |
Hi @arahuja and @ryan-williams , thank you ,now i want to run the command in local mode :
Is there not the flag |
You got it, |
Ok @ryan-williams , i know it , and i have tried the command
|
It looks to have failed due to a Kryo exception:
I'm not sure otomh what would cause that, but have a couple of ideas. One possibility is that there is an issue with Java versions, since as of #569 we require Java 8! I don't think it has to do with your memory settings / limits. I'll try to run some jobs myself and see if I can figure out what may be happening. |
What version of Spark are you running on? (i.e. does The "managed memory leak detected", and logged at I have a couple ideas for printing more debugging info we can try as well, stand by. |
Hi @ryan-williams ,thanks for your advices ,the spark version is 1.6.1 and the java version is "1.8.0_101". I have ran the command successfully using the 7G BAM which generated the 74M VCF file in spark1.6.1 local mode after i changed to another server which has 100g memory and 48 cores:
Sorry for another question : i want to reduce the processing time in local mode, could you give me some advices?
|
Thanks for that info! Interesting that throwing more memory+cores (edit: it looks like you only used more memory, One thing that might shed a little more light on the crashing app is to enable
It's been really helpful to have all the output here for each case so far, thanks for including that! It might reduce clutter here if you put your output into a gist, especially if you start including all the INFO logs, per above. Crashing appYour kryo crash is occurring above during the coverage-depth-computation stage; the memory used there should be proportional to:
In general, the number of reads in / size of the BAM shouldn't affect memory usage, so the bigger BAM should only fail if it is also including particularly-high-depth regions. If you can narrow down the extent to which that may be happening / what depths you think you may be dealing with, that may also help us figure out what's going on. As an aside, I was able to reproduce the "Managed memory leak" error msg before a crash after taking a working run similar to the command line you gave and reducing the driver memory until it crashed. Notable error msgs:
So I guess that message, as well as the confusing/nonsensical Optimizing working appGlad to hear that your run succeeded on a 100GB, 48-core server; what's really interesting about it is that despite the machine having those resources, your run still had default configurations: 1 core, 10GB driver:
Did the other machine, from the crashing version, actually have 10GB of RAM available? The shortest answer for making it run faster is to use all the cores on the more powerful machine! To do that, you need to tell Guacamole to use different spark-config files than # Put some new Spark configs in files in the conf/ directory
echo "spark.master local[*]" > conf/local-multi
echo "spark.driver.memory 16g" > conf/driver-16g
# Set GUAC_SPARK_CONFS to the files you want to use for a given run.
export GUAC_SPARK_CONFS=conf/local-multi,conf/driver-16g
# Run Guacamole!
scripts/guacamole somatic-joint … My apologies if using the In particular, I'm interested to see how your runtime changes using |
Looking back at this, I noticed that, despite the machines being used having 10GB and 100GB of available memory (if I understand you correctly), your apps are running with That makes reasonable sense, as running over a whole genome, with a |
Hi @ryan-williams , thanks for your advices and i'm trying to test them as you mentioned above now. I will show you some results of the testing :
So i have rechecked the server : the logical CPU number of the server are 48 , the physical CPU number are 24 , the available memory is 122g .Also i find spilling occurred in all of the process from the spark info. So i think the bottleneck of the ptimizing is the combination of the physical CPU and memory, do you think so ? To the crashing app ,
|
That's interesting, thanks for all of the info. Here are a few observations and recommendations: SpillsAs a rule, anytime you see "spill" messages, you need more memory; usually, you'd want more memory per task, and a common remedy would be to increase the number of Spark partitions in the spilling stages, but in this case the amounts of memory you're running with is below levels where a lot of Spark infrastructure makes sense, so the main recommendation is "use more memory!", which I will say in a few different ways below 😄 . More Executors/Cores Require More MemoryThe more cores you run with, the more memory you will need! There's a fixed memory cost to each logical "executor"/core that you use in your app. For example, Guacamole moves the reference around by "broadcast"ing it (contig by contig, as needed) to each executor, which could easily mean that the first several GB of each executor are used just for storing that! So, if you go from 24 executors to 48 on 16GB of total memory, you have 24 new "executors" that need some or all of the reference, so it's not too surprising that the app blows up when making that jump! I have done some work toward doing more efficient things with the reference that should mitigate this a bit, but for the time being that's how it works! Noting this cores vs. memory-required relationship in your data above:
Those are some pretty good data points about this exact tradeoff, so thanks for providing that info! Use More MemoryIf you have 122GB available, i would use a fair amount more in order to avoid spills and OOMs! For some perspective, a typical (small-ish) Guacamole run I would do might operate on a couple of 20GB BAMs (spread across perhaps ~500 128MB HDFS partitions), and accordingly using 500-2000 cores, spread across 125-500 4-core, 12GB executors, meaning ~3GB/core and 1-6TB total; I'd expect this to finish in 10-20mins with no spilling! The reason I might use up to 4x as many cores (2000 above) as the number of initial HDFS partitions (and hence the number of partitions in the initial RDDs) is that some of the intermediate representations of the reads take up more space in memory than the compressed reads in BAM format on disk that we start with; Guacamole has some heuristics for letting the number of partitions in those intermediate stages expand elastically based on the number of reads it encounters. So, I wouldn't really expect to get very far on a maximum-10GB machine/VM, though characterizing the exact kinds of failures you encounter there, and the (relatively trivial) sizes of BAMs you can run on in such an environment, is fairly interesting to me as a proxy/impetus for optimizing Guacamole as a whole. On your 122GB machine, using my heuristics above (~3GB/core), I'd think you could run with 40 cores without any memory problems if you used all 122GB. Of course, empirically you've been able to get away with even <1GB/core (16GB/24-cores, though that was probably spilling), so you could explore the space a little bit and it'd be interesting to hear what you find; for example, if I were you I'd try a 96GB/48-core run and think that'd have a reasonable probability of being reasonably performant and spill-free. Judging by your 16GB/24-core success, with 48 cores you could work your way all the way down to 48GB or fewer, and see where you start spilling / degrading performance. Thanks for the reports from the field and let me know if any of the above doesn't make sense or you get more data on what works and doesn't! |
Hi @ryan-williams ,the runtime is
|
Hi @ryan-williams , i have tested the different config to run the Guacamole . The results are :
|
Your last stack trace, starting with Please put the entire stdout of these runs, and/or event-log JSON files, somewhere (e.g. gist.github.com, not here) where I can see them if you want me to help you debug them. I'm assuming your It's interesting to note that you seem to have gotten a 2x speed up with 4-cores just solely by using more memory, presumably due to stopping spilling (and reducing GC?). That's consistent with what I'd expect. |
Hi @ryan-williams ,these results i have tested are listed below: Firstly , In the past I always thought the physical CPU may run more efficient than the logical CPU ignoring the difference of core numbers. But now it seems that the logical one can also run as well as the physical CPU. |
Thanks for those results, sorry for the delay in responding.
Cool! I don't have much insight into that so that's good to know.
That's pretty strange; it is publicly accessible on the internet afaik… what happens when you go to that URL?
I just mean all the stdout that you see when running a Spark app, starting from the bash command that kicks off the run, and ending with the statistics output by
I'm running some benchmarks now with far fewer resources than that, so that we can compare things more directly here (and in general I'm working on bringing all of these numbers down!). I have two BAMs I've been running the I ran on them with a 10GB driver and different numbers of 12GB, 4-core executors to work my way down to reasonable sizes and timings that should be able to have you emulate:
The command I used for all of them was of the form: GUAC_SPARK_CONFS=… scripts/guacamole somatic-standard --normal-reads $n --tumor-reads $t --reference $hg19 --max-reads-per-partition 500000 --out $out.vcf and here are the Spark configs from the confs files I used (output at the beginning of the Guacamole run). So I think this paints a pretty clear picture that the total amounts of resources you have available (48 cores, 128GB), when spread around in cluster mode, are enough to get good performance; the above is also all run on ~5x as much input-BAM data (40GB) than the one 7GB BAM you mentioned that you were running on, IIRC? So we'll just have to figure out what might be causing your local runs to degrade relative to the cluster runs above. In theory, the cluster runs should be the ones with higher overhead and fixed costs, so I'm confident we can improve your timings with a little more digging. One major thing I realized you should probably try is switching over to the
I'm not super familiar with the internals of spark local-mode, but I think the concept of an executor isn't really meaningful in local mode; this SO corroborates that. I think you basically have one memory-knob to turn, and that's I hope that helps! I'm confident we can still bring your runtimes down a lot, so let me know how this all sits with you. Also, are you able/interested to share your example BAMs with me privately? I don't think it will be necessary, but am just curious, we're always interested in collecting data for internal benchmarking. |
Hi @ryan-williams , thank you very much for your advices and very sorry for the delay so long in responding. |
Hi , my command is
./scripts/guacamole somatic-joint /home/zhipeng/soft/guacamole/src/test/resources/synth1.normal.100k-200k.withmd.bam /home/zhipeng/soft/guacamole/src/test/resources/synth1.tumor.100k-200k.withmd.bam --reference-fasta /home/zhipengcheng/file/ucsc.hg19.fasta --out /home/zhipengcheng/file/result/out.vcf
but it throwed exception :
I don't know how to resole it?
The text was updated successfully, but these errors were encountered: