Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark can not serialize object larger than 2G #6

Open
PengNi opened this issue Oct 17, 2017 · 5 comments
Open

Spark can not serialize object larger than 2G #6

PengNi opened this issue Oct 17, 2017 · 5 comments

Comments

@PengNi
Copy link
Owner

PengNi commented Oct 17, 2017

Spark 2.1.0

a large cmp.h5 file may be created for a repeats region of a reference after using blasr to align. The mean coverage of this repeats region could be 10K or more, and the cmp.h5 file could be 1G, over 2G after unpack.

Thus, the problem is that when spark needs to serialize the object (here is the cmp.h5 data in memory), it will throw an exception that:

error: 'i' format requires -2147483648 <= number <= 2147483647

in line 273, line 562 of pyspark/serializers.py .

It seems that spark tends to use 'i' (for int) directly rather than 'q' (for long) when use struct.pack. And has the limitation of "can not serialize object larger than 2G".

@PengNi
Copy link
Owner Author

PengNi commented Oct 18, 2017

need to remove part of reads aligned to repeats region.

Should filter/remove reads based on certain rules, e.g. aligned quality.

Check SamFilter in BLASR for some inspiration.

@PengNi
Copy link
Owner Author

PengNi commented Nov 4, 2017

now we are using 'MapQV' to filter reads (commit edb4d0e).

@PengNi
Copy link
Owner Author

PengNi commented Nov 22, 2017

still exist this error when a single chunk of the reference has bunch of aligned reads.

For now I haven't figure out a better way, gonna abandon 'MapQV', use 'random' only. And rewrite this part of code.
See commit 559c60

@1830191044
Copy link

Have you solved this problem?

@PengNi
Copy link
Owner Author

PengNi commented Jan 19, 2021

@1830191044 , I didn't figure out how to completely avoid the issue, however I used random partition to lower the chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants