Sample.idError with seqVCF2GDS #87

alexisregelson · 2023-11-06T22:26:04Z

Hello, I am trying to use seqVCF2GDS and am getting the following error:

library(SeqArray)
library(data.table)

seqVCF2GDS(high_mod_vcf, "r4_chr1_high_mod.gds", parallel=6L)
Mon Nov 6 16:09:06 2023
Variant Call Format (VCF) Import:
file(s):
r4_PASS_chr1_updated_varID_dups_drop_updated_IDs_nhw_hwe6_noNHWrelateds_high_mod_impact.vcf (198.8M)
file format: VCFv4.2
the number of sets of chromosomes (ploidy): 2
the number of samples: 14,306
genotype storage: bit2
compression method: LZMA_RA
# of samples: 14306
calculating the total number of variants ...
the total number of variants for import: 3,632
Writing to 6 files:
r4_chr1_high_mod_tmp01_ad336f56fc72 [1..606]
r4_chr1_high_mod_tmp02_ad3315e862b7 [607..1,212]
r4_chr1_high_mod_tmp03_ad33613818b1 [1,213..1,818]
r4_chr1_high_mod_tmp04_ad33473817c6 [1,819..2,424]
r4_chr1_high_mod_tmp05_ad334e0fea8c [2,425..3,030]
r4_chr1_high_mod_tmp06_ad33607634f8 [3,031..3,632]
Done (Mon Nov 6 16:09:10 2023).
Output:
r4_chr1_high_mod.gds
Merging:
opening 'r4_chr1_high_mod_tmp01_ad336f56fc72' ... [done]
opening 'r4_chr1_high_mod_tmp02_ad3315e862b7' ... [done]
opening 'r4_chr1_high_mod_tmp03_ad33613818b1' ... [done]
opening 'r4_chr1_high_mod_tmp04_ad33473817c6' ... [done]
opening 'r4_chr1_high_mod_tmp05_ad334e0fea8c' ... [done]
opening 'r4_chr1_high_mod_tmp06_ad33607634f8' ... [done]
Digests:
sample.idError: segfault from C stack overflow

Do the sampel IDs need to be in a particular format? I created my vcf with plink and used double-id option. IDs are in format: A-[Cohort]-[A#####]. A .gds file is outputed, but I don't know if it's is incorrect due to the segfault.

gds <- seqOpen(r4_chr1_high_mod.gds)
gds
Object of class "SeqVarGDSClass"
File: r4_chr1_high_mod.gds (294.4K)

[ ] *
|--+ description [ ] *
|--+ sample.id { Str8 14306 LZMA_ra(2.94%), 12.6K }
|--+ variant.id { Int32 3632 LZMA_ra(12.7%), 1.8K }
|--+ position { Int32 3632 LZMA_ra(62.3%), 8.8K }
|--+ chromosome { Str8 3632 LZMA_ra(1.62%), 125B }
|--+ allele { Str8 3632 LZMA_ra(24.4%), 4.0K }
|--+ genotype [ ] *
| |--+ data { Bit2 2x14306x3632 LZMA_ra(0.95%), 242.2K }
| |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
| --+ extra { Int16 0 LZMA_ra, 18B }
|--+ phase [ ]
| |--+ data { Bit1 14306x3632 LZMA_ra(0.02%), 1.3K }
| |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
| --+ extra { Bit1 0 LZMA_ra, 18B }
|--+ annotation [ ]
| |--+ id { Str8 3632 LZMA_ra(28.1%), 16.0K }
| |--+ qual { Float32 3632 LZMA_ra(0.92%), 141B }
| |--+ filter { Int32 3632 LZMA_ra(0.92%), 141B }
| |--+ info [ ]
| | --+ PR { Bit1 3632 LZMA_ra(18.9%), 93B } *
| --+ format [ ]
--+ sample.annotation [ ]

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /cvmfs/priv.accre.vanderbilt.edu/mirror/optimized/sandy_bridge/easybuild/software/MPI/intel/2019.1.144/impi/2018.4.274/R/3.6.0/lib64/R/lib/libR.so
LAPACK: /cvmfs/priv.accre.vanderbilt.edu/mirror/optimized/sandy_bridge/easybuild/software/MPI/intel/2019.1.144/impi/2018.4.274/R/3.6.0/lib64/R/modules/lapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.14.8 SeqArray_1.26.2 gdsfmt_1.22.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.32.0 compiler_3.6.0 IRanges_2.20.2
[4] XVector_0.26.0 parallel_3.6.0 GenomicRanges_1.38.0
[7] GenomeInfoDbData_1.2.2 RCurl_1.95-4.12 Biostrings_2.54.0
[10] S4Vectors_0.24.4 BiocGenerics_0.32.0 GenomeInfoDb_1.22.1
[13] bitops_1.0-6 stats4_3.6.0

Thank you,
Alexis

zhengxwen · 2023-11-07T06:42:11Z

See:
the total number of variants for import: 3,632
This number is too small, parallel=6L does not help at all.
I guess parallel=6L might trigger a bug when merging the data files when the number of variants is too small.

seqVCF2GDS(high_mod_vcf, "r4_chr1_high_mod.gds", parallel=1)

It might solve your problem.

alexisregelson · 2024-01-08T15:21:20Z

Hello,

I've now tried this with a vcf with a 200k+ varaints. I have successfully converted this vcf to a gds using SNPRelate. However, I am using another software that specifically needs the gds file in SeqArray format, not SNPRelate. But I am still getting the same error: sample.idError: segfault from C stack overflow.

Alexis

zhengxwen · 2024-01-08T18:14:20Z

Your R version and gdsfmt versions are old.
The recent update was made with a focus on R (>= v4.0).
I suggest using SeqArray GDS format instead of SNPRelate GDS.

zhengxwen self-assigned this Nov 7, 2023

zhengxwen added the bug label Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample.idError with seqVCF2GDS #87

Sample.idError with seqVCF2GDS #87

alexisregelson commented Nov 6, 2023 •

edited

Loading

zhengxwen commented Nov 7, 2023

alexisregelson commented Jan 8, 2024 •

edited

Loading

zhengxwen commented Jan 8, 2024

Sample.idError with seqVCF2GDS #87

Sample.idError with seqVCF2GDS #87

Comments

alexisregelson commented Nov 6, 2023 • edited Loading

zhengxwen commented Nov 7, 2023

alexisregelson commented Jan 8, 2024 • edited Loading

zhengxwen commented Jan 8, 2024

alexisregelson commented Nov 6, 2023 •

edited

Loading

alexisregelson commented Jan 8, 2024 •

edited

Loading