Spark can not serialize object larger than 2G #6

PengNi · 2017-10-17T03:40:23Z

Spark 2.1.0

a large cmp.h5 file may be created for a repeats region of a reference after using blasr to align. The mean coverage of this repeats region could be 10K or more, and the cmp.h5 file could be 1G, over 2G after unpack.

Thus, the problem is that when spark needs to serialize the object (here is the cmp.h5 data in memory), it will throw an exception that:

error: 'i' format requires -2147483648 <= number <= 2147483647

in line 273, line 562 of pyspark/serializers.py .

It seems that spark tends to use 'i' (for int) directly rather than 'q' (for long) when use struct.pack. And has the limitation of "can not serialize object larger than 2G".

The text was updated successfully, but these errors were encountered:

PengNi · 2017-10-18T10:19:42Z

need to remove part of reads aligned to repeats region.

Should filter/remove reads based on certain rules, e.g. aligned quality.

Check SamFilter in BLASR for some inspiration.

PengNi · 2017-11-04T10:38:35Z

now we are using 'MapQV' to filter reads (commit edb4d0e).

PengNi · 2017-11-22T13:32:41Z

still exist this error when a single chunk of the reference has bunch of aligned reads.

For now I haven't figure out a better way, gonna abandon 'MapQV', use 'random' only. And rewrite this part of code.
See commit 559c60

1830191044 · 2021-01-19T01:57:35Z

Have you solved this problem?

PengNi · 2021-01-19T06:26:05Z

@1830191044 , I didn't figure out how to completely avoid the issue, however I used random partition to lower the chance.

PengNi mentioned this issue Oct 17, 2017

zlib error when spark dump files larger than 2GB in script "cmph5_opreations" #5

Closed

PengNi mentioned this issue May 7, 2018

The result of modifications.gff/csv using basemods_spark is not the same as the original pipeline in SMRT-Analysis v2.3.0 #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark can not serialize object larger than 2G #6

Spark can not serialize object larger than 2G #6

PengNi commented Oct 17, 2017

PengNi commented Oct 18, 2017

PengNi commented Nov 4, 2017

PengNi commented Nov 22, 2017 •

edited

Loading

1830191044 commented Jan 19, 2021

PengNi commented Jan 19, 2021

Spark can not serialize object larger than 2G #6

Spark can not serialize object larger than 2G #6

Comments

PengNi commented Oct 17, 2017

PengNi commented Oct 18, 2017

PengNi commented Nov 4, 2017

PengNi commented Nov 22, 2017 • edited Loading

1830191044 commented Jan 19, 2021

PengNi commented Jan 19, 2021

PengNi commented Nov 22, 2017 •

edited

Loading