-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark can not serialize object larger than 2G #6
Comments
need to remove part of reads aligned to repeats region. Should filter/remove reads based on certain rules, e.g. aligned quality. Check SamFilter in BLASR for some inspiration. |
now we are using 'MapQV' to filter reads (commit edb4d0e). |
still exist this error when a single chunk of the reference has bunch of aligned reads. For now I haven't figure out a better way, gonna abandon 'MapQV', use 'random' only. And rewrite this part of code. |
Have you solved this problem? |
@1830191044 , I didn't figure out how to completely avoid the issue, however I used random partition to lower the chance. |
Spark 2.1.0
a large cmp.h5 file may be created for a repeats region of a reference after using blasr to align. The mean coverage of this repeats region could be 10K or more, and the cmp.h5 file could be 1G, over 2G after unpack.
Thus, the problem is that when spark needs to serialize the object (here is the cmp.h5 data in memory), it will throw an exception that:
in line 273, line 562 of pyspark/serializers.py .
It seems that spark tends to use 'i' (for int) directly rather than 'q' (for long) when use struct.pack. And has the limitation of "can not serialize object larger than 2G".
The text was updated successfully, but these errors were encountered: