Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with creating database #57

Open
atimms opened this issue May 2, 2019 · 9 comments
Open

issue with creating database #57

atimms opened this issue May 2, 2019 · 9 comments

Comments

@atimms
Copy link

atimms commented May 2, 2019

Hello...

I used gemini and vcf2db previously with great successful, but I'm having issues when using a new set of VCFs I've just received..

I annotated with snpeff in the my usual way but received the following error message:

Traceback (most recent call last):
File "/home/atimms/programs/vcf2db/vcf2db.py", line 923, in
impacts_extras=a.impacts_field, aok=a.a_ok)
File "/home/atimms/programs/vcf2db/vcf2db.py", line 233, in init
self.load()
File "/home/atimms/programs/vcf2db/vcf2db.py", line 318, in load
i = self._load(self.cache, create=True, start=1)
File "/home/atimms/programs/vcf2db/vcf2db.py", line 311, in _load
self.insert(variants, expanded, keys, i, create=create)
File "/home/atimms/programs/vcf2db/vcf2db.py", line 373, in insert
vilengths, variant_impacts)
File "/home/atimms/programs/vcf2db/vcf2db.py", line 401, in _insert
self.__insert(v_objs, self.metadata.tables['variants'].insert())
File "/home/atimms/programs/vcf2db/vcf2db.py", line 443, in __insert
trans.execute(stmt, o)
File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 980, in execute
return meth(self, multiparams, params)
File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 273, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1099, in _execute_clauseelement
distilled_params,
File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1240, in _execute_context
e, statement, parameters, cursor, context
File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1458, in _handle_dbapi_exception
util.raise_from_cause(sqlalchemy_exception, exc_info)
File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 296, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1236, in _execute_context
cursor, statement, parameters, context
File "/home/atimms/miniconda2/envs/hg38_genomes/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 536, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.InterfaceError: (sqlite3.InterfaceError) Error binding parameter 48 - probably unsupported type. [SQL: u'INSERT INTO variants (variant_id, chrom, start, "end", vcf_id, ref, alt, qual, filter, type, sub_type, call_rate, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, gene, ensembl_gene_id, transcript, is_exonic, is_coding, is_lof, is_splicing, is_canonical, exon, codon_change, aa_change, aa_length, biotype, impact, impact_so, impact_severity, polyphen_pred, polyphen_score, sift_pred, sift_score, an, baseqranksum, clippingranksum, db, dp, ds, excesshet, fs, mq, mqranksum, negative_train_site, pg, positive_train_site, qd, raw_mq, readposranksum, sor, vqslod, culprit, loconfdenovo, old_multiallelic, old_variant, lof, consequence, symbol, feature_type, feature, intron, hgvsc, hgvsp, cdna_position, cds_position, protein_position, amino_acids, codons, existing_variation, distance, strand, flags, variant_class, symbol_source, hgnc_id, canonical, sift, hgvs_offset, hgvsg, amino_acid_change, transcript_biotype, gene_coding, transcript_id, exon_rank, genotype, gts, gt_types, gt_phases, gt_depths, gt_ref_depths, gt_alt_depths, gt_quals, gt_alt_freqs) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: (1, u'chr1', 10143, 10150, None, u'TAACCCC', u'T', 120.08000183105469, None, 'indel', 'del', 1.0, 1, 2, 0, 0, 0.3333333333333333, u'DDX11L1', None, u'ENST00000456328', 0, 0, 0, 0, 0, u'', u'1724', u'', None, u'processed_transcript', 'upstream_gene_variant', 'upstream_gene_variant', 'LOW', None, None, None, None, 6, -0.550000011920929, -0.550000011920929, 0, 75, 0, 3.9793999195098877, 0.0, 22.270000457763672, 0.9369999766349792, 0, (0, 0, 0), 0, 17.149999618530273, 17356.0, 0.9369999766349792, 0.36800000071525574, 3.0899999141693115, u'FS', None, None, u'None', u'None', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', u'', u'processed_transcript', u'NON_CODING', u'ENST00000456328', u'', u'T', <read-only buffer for 0x7fffdfe884f8, size -1, offset 0 at 0x7fffdef27270>, <read-only buffer for 0x7fffdfeed7a0, size -1, offset 0 at 0x7fffdef272b0>, <read-only buffer for 0x7fffdfe91120, size -1, offset 0 at 0x7fffdef272f0>, <read-only buffer for 0x7fffdfeed7d8, size -1, offset 0 at 0x7fffdef27330>, <read-only buffer for 0x7fffdfeed810, size -1, offset 0 at 0x7fffdef27370>, <read-only buffer for 0x7fffdfeed848, size -1, offset 0 at 0x7fffdef273b0>, <read-only buffer for 0x7fffdfeed880, size -1, offset 0 at 0x7fffdef273f0>, <read-only buffer for 0x7fffdfef8b30, size -1, offset 0 at 0x7fffdef27430>)] (Background on this error at: http://sqlalche.me/e/rvf5)

the VCF I received does have some strange fields in the genotypes (generated by GATK), here's an example line...

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT F03-00008 F03-00006 F03-00007
chr1 10144 . TAACCCC T 120.08 PASS AC=2;AF=0.333;AN=6;BaseQRankSum=-0.55;ClippingRankSum=-0.55;DP=75;ExcessHet=3.9794;FS=0;MLEAC=1;MLEAF=0.25;MQ=22.27;MQRankSum=0.937;PG=0,0,0;QD=17.15;RAW_MQ=17356;ReadPosRankSum=0.937;SOR=0.368;VQSLOD=3.09;culprit=FS;EFF=MOTIFMA0341.1:Egr1,MOTIFMA0366.1:Egr1,UPSTREAM(MODIFIER||1724|||DDX11L1|processed_transcript|NON_CODING|ENST00000456328||T),UPSTREAM(MODIFIER||1865|||DDX11L1|transcribed_unprocessed_pseudogene|NON_CODING|ENST00000450305||T),DOWNSTREAM(MODIFIER||4259|||WASH7P|unprocessed_pseudogene|NON_CODING|ENST00000488147||T),INTERGENIC(MODIFIER||||||||||T) GT:AD:DP:FT:GQ:JL:JP:PL:PP 0/1:40,0:40:lowGQ:2:-1:-1:0,0,545:2,0,547 0/1:3,4:7:PASS:50:-1:-1:126,0,46:127,0,50 0/0:23,0:23:lowGQ:0:.:.:0,0,0:0,0,0

Any help would be greatly appreciated.

Andrew

@brentp
Copy link
Member

brentp commented May 2, 2019

looks like PG is the offending field with 0,0,0. what does the vcf header show for PG? you can also send that field to the --black-list so it does not try to load.

@atimms
Copy link
Author

atimms commented May 2, 2019

in the vcf header it is described as:
##INFO=<ID=PG,Number=G,Type=Integer,Description="Genotype Likelihood Prior">
and when I included: -e PG to vcf2db the gemini database worked...

Thanks for getting back to me so quickly and resolving my issue..

Andrew

@brentp
Copy link
Member

brentp commented May 2, 2019

hmm. did you run vt decompose -s ? or bcftools norm? I wonder why that field did not get normalized.

@atimms
Copy link
Author

atimms commented May 2, 2019

i ran vt decompose -s on the vcf before loading...

The only difference with this vcf was it had been put through the GATK refinement workflow i.e. https://gatkforums.broadinstitute.org/gatk/discussion/4723/genotype-refinement-workflow. I wonder if that affected something?

Andrew

@mmoisse
Copy link

mmoisse commented Sep 9, 2019

I also had that issue now I always do

filter=""
if [ `bcftools view --header input.vcf.gz | egrep '##INFO=<ID=GC,|##INFO=<ID=PG,' | wc -l` -gt 0 ]
then
   filter="-x INFO/GC,INFO/PC"
fi


vcf2db.py <(bcftools annotate $filter inpud.vcf.gz | bcftools +fixploidy) input.ped gemini.db

@huangk3
Copy link

huangk3 commented May 7, 2020

I ran into the same issue as well:

Traceback (most recent call last):
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 923, in <module>
    impacts_extras=a.impacts_field, aok=a.a_ok)
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 233, in __init__
    self.load()
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 321, in load
    self._load(self.vcf, create=False, start=i+1)
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 305, in _load
    self.insert(variants, expanded, keys, i)
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 373, in insert
    vilengths, variant_impacts)
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 401, in _insert
    self.__insert(v_objs, self.metadata.tables['variants'].insert())
  File "/sysapps/cluster/software/Anaconda2/2019.10/envs/vcf2dbenv/bin/vcf2db.py", line 435, in __insert
    raise e
sqlalchemy.exc.InterfaceError: (sqlite3.InterfaceError) Error binding parameter 170 - probably unsupported type.
[SQL: INSERT INTO variants (variant_id, chrom, start, "end", vcf_id, ref, alt, qual, filter, type, sub_type, call_rate, num_hom_ref, num_het, num_hom_alt, num_unknown, aaf, gene, ensembl_gene_id, transcript, is_exonic, is_coding, is_lof, is_splicing, is_canonical, exon, codon_change, aa_change, aa_length, biotype, impact, impact_so, impact_severity, polyphen_pred, polyphen_score, sift_pred, sift_score, ac, af, an, baseqranksum, clippingranksum, db, dp, ds, exome_chip, excesshet, fs, inbreedingcoeff, lcr, mleac, mleaf, mq, mqranksum, negative_train_site, old_multiallelic, old_variant, positive_train_site, qd, rvis, rvis_pct, rvis_pred, readposranksum, sor, vqslod, aaf_1kg_afr_float, aaf_1kg_all_float, aaf_1kg_amr_float, aaf_1kg_eas_float, aaf_1kg_eur_float, aaf_1kg_sas_float, aaf_esp_aa, aaf_esp_all, aaf_esp_ea, aaf_pid_711, ac_exac_afr, ac_exac_all, ac_exac_amr, ac_exac_eas, ac_exac_fin, ac_exac_nfe, ac_exac_oth, ac_exac_sas, acetyl_enh_33_cell_count, acetyl_enh_33_cell_list, acetyl_enh_all_127_tiss_count, active_enh_33_cell_count, active_enh_33_cell_list, active_enh_all_127_tiss_count, af_exac_afr, af_exac_all, af_exac_amr, af_exac_eas, af_exac_nfe, af_exac_oth, af_exac_sas, an_exac_afr, an_exac_all, an_exac_amr, an_exac_eas, an_exac_fin, an_exac_nfe, an_exac_oth, an_exac_sas, clinvar_disease_name, clinvar_pathogenic, common_pathogenic, cse_hiseq, culprit, dann_score, dbsnp_id, dpsi_max_tissue, dpsi_zscore, eigen_pc_phred, eigen_phred, fitcons, fuzzy_hgmd_class, fuzzy_hgmd_dna, fuzzy_hgmd_id, fuzzy_hgmd_orig_dna, fuzzy_hgmd_orig_prot, fuzzy_hgmd_pheno, fuzzy_hgmd_prot, gerp_elements, gno_exome_ac_afr, gno_exome_ac_all, gno_exome_ac_amr, gno_exome_ac_asj, gno_exome_ac_eas, gno_exome_ac_fin, gno_exome_ac_nfe, gno_exome_ac_oth, gno_exome_ac_sas, gno_exome_af_afr, gno_exome_af_all, gno_exome_af_amr, gno_exome_af_asj, gno_exome_af_eas, gno_exome_af_fin, gno_exome_af_nfe, gno_exome_af_oth, gno_exome_af_sas, gno_exome_an_afr, gno_exome_an_all, gno_exome_an_amr, gno_exome_an_asj, gno_exome_an_eas, gno_exome_an_fin, gno_exome_an_nfe, gno_exome_an_oth, gno_exome_an_sas, gno_exome_filter, gno_exome_id, gno_genome_ac_afr, gno_genome_ac_all, gno_genome_ac_amr, gno_genome_ac_asj, gno_genome_ac_eas, gno_genome_ac_fin, gno_genome_ac_nfe, gno_genome_ac_oth, gno_genome_af_afr, gno_genome_af_all, gno_genome_af_amr, gno_genome_af_asj, gno_genome_af_eas, gno_genome_af_fin, gno_genome_af_nfe, gno_genome_af_oth, gno_genome_af_sas, gno_genome_an_afr, gno_genome_an_all, gno_genome_an_amr, gno_genome_an_asj, gno_genome_an_eas, gno_genome_an_fin, gno_genome_an_nfe, gno_genome_an_oth, gno_genome_filter, gno_genome_id, gtex_gene_tissue_eqtl, hetaltab, hgmd_class, hgmd_dna, hgmd_indel_class, hgmd_indel_orig_dna, hgmd_indel_orig_prot, hgmd_indel_pheno, hgmd_overlap_indel_coords, hgmd_overlap_indel_id, hgmd_pheno, hgmd_prot, in_1kg, in_esp, in_exac, linsight_score, max_exac_aaf_all, max_gno_exome_aaf_all, max_gno_genome_aaf_all, mmind_cdna, mmind_id, mmind_prot, rap_score, rmsk, subgerp, subrvis, subrvis_pct, subrvis_pred, trap_cds_syn_splice_pred, trap_nc_splice_pred, weak_enh_33_cell_count, weak_enh_33_cell_list, weak_enh_all_127_tiss_count, allele, feature_type, intron, hgvsc, hgvsp, cdna_position, cds_position, existing_variation, distance, strand, flags, symbol_source, hgnc_id, ccds, hgvs_offset, appris, aloft_confidence, aloft_fraction_transcripts_affected, aloft_pred, aloft_prob_dominant, aloft_prob_recessive, aloft_prob_tolerant, ancestral_allele, cadd_phred, cadd_raw, deogen2_pred, deogen2_score, fathmm_pred, fathmm_score, genocanyon_score, interpro_domain, lrt_pred, lrt_score, m_cap_pred, m_cap_score, mpc_score, mvp_score, metalr_pred, metalr_score, metasvm_pred, metasvm_score, mutpred_aachange, mutpred_top5features, mutpred_protid, mutpred_score, mutationassessor_pred, mutationassessor_score, mutationtaster_pred, mutationtaster_score, provean_pred, provean_score, primateai_pred, primateai_score, revel_rankscore, revel_score, reliability_index, vest4_score, clinvar_clnsig, clinvar_review, clinvar_trait, lof, lof_filter, lof_flags, lof_info, mes_ncss_downstream_acceptor, mes_ncss_downstream_acceptor_seq, mes_ncss_downstream_donor, mes_ncss_downstream_donor_seq, mes_ncss_upstream_acceptor, mes_ncss_upstream_acceptor_seq, mes_ncss_upstream_donor, mes_ncss_upstream_donor_seq, mes_swa_acceptor_alt, mes_swa_acceptor_alt_context, mes_swa_acceptor_alt_frame, mes_swa_acceptor_alt_seq, mes_swa_acceptor_diff, mes_swa_acceptor_ref, mes_swa_acceptor_ref_comp, mes_swa_acceptor_ref_comp_seq, mes_swa_acceptor_ref_context, mes_swa_acceptor_ref_frame, mes_swa_acceptor_ref_seq, mes_swa_donor_alt, mes_swa_donor_alt_context, mes_swa_donor_alt_frame, mes_swa_donor_alt_seq, mes_swa_donor_diff, mes_swa_donor_ref, mes_swa_donor_ref_comp, mes_swa_donor_ref_comp_seq, mes_swa_donor_ref_context, mes_swa_donor_ref_frame, mes_swa_donor_ref_seq, maxentscan_alt, maxentscan_alt_seq, maxentscan_diff, maxentscan_ref, maxentscan_ref_seq, ada_score, rf_score, gts, gt_types, gt_phases, gt_depths, gt_ref_depths, gt_alt_depths, gt_quals, gt_alt_freqs) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)]
[parameters: (314700, u'21', 41554522, 41554750, u'rs114481025;rs34163425', u'GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC', u'G', 220.89999389648438, u'VQSRTrancheINDEL99.70to99.80', 'indel', 'del', 1.0, 128, 4, 0, 0, 0.015151515151515152, u'DSCAM', u'ENSG00000171587', u'ENST00000400454', 0, 0, 0, 0, 1, '', '', '', u'', u'protein_coding', u'intron_variant', u'intron_variant', 'LOW', u'', None, u'', None, 4, 0.0, 264, 0.6340000033378601, -0.22699999809265137, 0, 6786, 0, 0, 5.850200176239014, 2.736999988555908, -0.07159999758005142, 0, 5, 0.01899999938905239, 51.58000183105469, 4.010000228881836, 1, None, 'None', 0, 1.3300000429153442, None, None, None, -0.07100000232458115, 0.8799999952316284, -2.7079999446868896, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, None, None, 0, 0, u'MQRankSum', None, None, None, None, None, None, 0.06549999862909317, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, None, None, None, None, None, None, None, None, None, None, None, 1, 4, 0, 0, 1, 0, 2, 0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0, (5824, 1920), (25160, 9670), (708, 318), (262, 122), (1396, 554), (3008, 1536), (13152, 4898), (810, 322), None, u'rs114481025,rs34163425', 'None', 0.5601999759674072, None, None, None, None, None, None, None, None, None, None, 0, 0, 0, None, -1.0, -1.0, 0.0, None, None, None, None, u'trf,trf,trf', None, None, None, None, None, None, None, None, None, u'-', u'Transcript', u'14/32', u'ENST00000400454.1:c.2780-3729_2780-3503del', u'', u'', u'', u'', u'', u'-1', u'', u'HGNC', u'3039', u'CCDS42929.1', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'', u'-4.728', u'TGAAAAACAAAACCCAAAAGACT', u'10.655', u'CAGGTACGT', u'5.047', u'TCTTTCTGTTGATGGCACAGAGC', u'10.858', u'CAGGTAAGT', u'2.672', u'CGTTCTGTCCACGGTTTCTCTGCTGGCCCCTCCTGTCCACAGTT', u'6', u'TGTCCACGGTTTCTCTGCTGGCC', u'3.220', u'5.892', u'5.892', u'GTCCACGGTTTCTCTGGTAGCCC', u'CGTTCTGTCCACGGTTTCTCTGGTAACCCCCCCCGTCCACGGTTTCTCTGGTACCCCCCTCCTGTCCACGGTTTCTCTGGTACCCCCCCCCGTCCACGGTTTCTCTGGTAGCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCTGTCCACGGTTTCTCTGGTAACCCCCTCCTGTCCACGGTTTCTCTGGTGGCCCGTTCTGTCCACGGTTTCTCTGCTGGCCCCTCCTGTCCACAGTT', u'92', u'GTCCACGGTTTCTCTGGTAGCCC', u'-11.010', u'TTTCTCTGCTGGCCCC', u'6', u'CTGCTGGCC', u'15.518', u'4.508', u'4.508', u'CTGGTAACC', u'TTTCTCTGGTAACCCCCCCCGTCCACGGTTTCTCTGGTACCCCCCTCCTGTCCACGGTTTCTCTGGTACCCCCCCCCGTCCACGGTTTCTCTGGTAGCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCTGTCCACGGTTTCTCTGGTAACCCCCTCCTGTCCACGGTTTCTCTGGTGGCCCGTTCTGTCCACGGTTTCTCTGCTGGCCCC', u'6', u'CTGGTAACC', u'', u'', u'', u'', u'', u'', u'', <read-only buffer for 0x55555a1f2790, size -1, offset 0 at 0x2aaaf86baa70>, <read-only buffer for 0x2aaae10013f0, size -1, offset 0 at 0x2aaaf86baab0>, <read-only buffer for 0x2aaaea50d5a8, size -1, offset 0 at 0x2aaaf86baaf0>, <read-only buffer for 0x2aaaea2a5ab0, size -1, offset 0 at 0x2aaaf86bab30>, <read-only buffer for 0x2aaae9bfdc70, size -1, offset 0 at 0x2aaaf86bab70>, <read-only buffer for 0x2aaaea503d98, size -1, offset 0 at 0x2aaaf86babb0>, <read-only buffer for 0x2aaaea1e17b0, size -1, offset 0 at 0x2aaaf86babf0>, <read-only buffer for 0x2aaaea447eb0, size -1, offset 0 at 0x2aaaf86bac30>)]
(Background on this error at: http://sqlalche.me/e/rvf5)

The VCF had been processed by 'vt decompose'

21      41554523        rs114481025;rs34163425  GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAA
CCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC    G       220.9   VQSRTrancheINDEL99.70to99.80    AC=4;AF=0;AN=264
;BaseQRankSum=0.634;ClippingRankSum=-0.227;DP=6786;ExcessHet=5.8502;FS=2.737;InbreedingCoeff=-0.0716;MLEAC=5;MLEAF=0.019;MQ=51.58;MQRankSum=4.01;NEGATIVE_TRAIN_SITE;QD=1.33
;ReadPosRankSum=-0.071;SOR=0.88;VQSLOD=-2.708;culprit=MQRankSum;hetAltAB=0.5602;CSQ=-|intron_variant|MODIFIER|DSCAM|ENSG00000171587|Transcript|ENST00000400454|protein_codin
g||14/32|ENST00000400454.1:c.2780-3729_2780-3503del|||||||||-1||HGNC|3039|YES|CCDS42929.1|||||||||||||||||||||||||||||||||||||||||||||||||||||-4.728|TGAAAAACAAAACCCAAAAGACT|10.655|CAGGTACGT|5.047|TCTTTCTGTTGATGGCACAGAGC|10.858|CAGGTAAGT|2.672|CGTTCTGTCCACGGTTTCTCTGCTGGCCCCTCCTGTCCACAGTT|6|TGTCCACGGTTTCTCTGCTGGCC|3.220|5.892|5.892|GTCCACGGTTTCTCTGGTAGCCC|CGTTCTGTCCACGGTTTCTCTGGTAACCCCCCCCGTCCACGGTTTCTCTGGTACCCCCCTCCTGTCCACGGTTTCTCTGGTACCCCCCCCCGTCCACGGTTTCTCTGGTAGCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCTGTCCACGGTTTCTCTGGTAACCCCCTCCTGTCCACGGTTTCTCTGGTGGCCCGTTCTGTCCACGGTTTCTCTGCTGGCCCCTCCTGTCCACAGTT|92|GTCCACGGTTTCTCTGGTAGCCC|-11.010|TTTCTCTGCTGGCCCC|6|CTGCTGGCC|15.518|4.508|4.508|CTGGTAACC|TTTCTCTGGTAACCCCCCCCGTCCACGGTTTCTCTGGTACCCCCCTCCTGTCCACGGTTTCTCTGGTACCCCCCCCCGTCCACGGTTTCTCTGGTAGCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCTGTCCACGGTTTCTCTGGTAACCCCCTCCTGTCCACGGTTTCTCTGGTGGCCCGTTCTGTCCACGGTTTCTCTGCTGGCCCC|6|CTGGTAACC|||||||,-|intron_variant|MODIFIER|DSCAM|ENSG00000171587|Transcript|ENST00000404019|protein_coding||10/28|ENST00000404019.2:c.2036-3729_2036-3503del|||||||||-1|cds_start_NF|HGNC|3039|||||||||||||||||||||||||||||||||||||||||||||||||||||||-4.728|TGAAAAACAAAACCCAAAAGACT|10.655|CAGGTACGT|5.047|TCTTTCTGTTGATGGCACAGAGC|10.858|CAGGTAAGT|2.672|CGTTCTGTCCACGGTTTCTCTGCTGGCCCCTCCTGTCCACAGTT|6|TGTCCACGGTTTCTCTGCTGGCC|3.220|5.892|5.892|GTCCACGGTTTCTCTGGTAGCCC|CGTTCTGTCCACGGTTTCTCTGGTAACCCCCCCCGTCCACGGTTTCTCTGGTACCCCCCTCCTGTCCACGGTTTCTCTGGTACCCCCCCCCGTCCACGGTTTCTCTGGTAGCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCTGTCCACGGTTTCTCTGGTAACCCCCTCCTGTCCACGGTTTCTCTGGTGGCCCGTTCTGTCCACGGTTTCTCTGCTGGCCCCTCCTGTCCACAGTT|92|GTCCACGGTTTCTCTGGTAGCCC|-11.010|TTTCTCTGCTGGCCCC|6|CTGCTGGCC|15.518|4.508|4.508|CTGGTAACC|TTTCTCTGGTAACCCCCCCCGTCCACGGTTTCTCTGGTACCCCCCTCCTGTCCACGGTTTCTCTGGTACCCCCCCCCGTCCACGGTTTCTCTGGTAGCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCCGTCCACGGTTTCTCTGGTAACCCCCCCCTGTCCACGGTTTCTCTGGTAACCCCCTCCTGTCCACGGTTTCTCTGGTGGCCCGTTCTGTCCACGGTTTCTCTGCTGGCCCC|6|CTGGTAACC|||||||;fitcons=0.0655;rmsk=trf,trf,trf;gno_genome_ac_all=4;gno_genome_an_all=25160,9670;gno_genome_ac_afr=1;gno_genome_an_afr=5824,1920;gno_genome_ac_amr=0;gno_genome_an_amr=708,318;gno_genome_ac_asj=0;gno_genome_an_asj=262,122;gno_genome_ac_eas=1;gno_genome_an_eas=1396,554;gno_genome_ac_fin=0;gno_genome_an_fin=3008,1536;gno_genome_ac_nfe=2;gno_genome_an_nfe=13152,4898;gno_genome_ac_oth=0;gno_genome_an_oth=810,322;gno_genome_id=rs114481025,rs34163425;gno_genome_af_all=0;gno_genome_af_afr=0;gno_genome_af_amr=0;gno_genome_af_asj=0;gno_genome_af_eas=0;gno_genome_af_fin=0;gno_genome_af_nfe=0;gno_genome_af_oth=0;max_gno_genome_aaf_all=0  GT:AD:DP:GQ:PGT:PID:PL  0/0:56,0:56:99:.:.:0,99,1507    0/0:102,0:102:73:.:.:0,73,2644  0/0:44,0:44:0:.:.:0,0,1166      0/0:50,0:50:51:.:.:0,51,1297    0/0:37,0:37:99:.:.:0,99,1485    0/0:38,0:38:91:.:.:0,91,1176    0/0:78,0:78:44:.:.:0,44,2164    0/0:35,0:35:0:.:.:0,0,826       0/0:42,0:42:99:.:.:0,99,1131    0/1:18,21:39:17:0|1:41554523_GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC_G:17,0,767    0/0:43,0:43:99:.:.:0,105,1244   0/0:36,0:36:99:.:.:0,104,1149   0/0:35,0:35:40:.:.:0,40,940     0/0:54,0:54:99:.:.:0,119,1800   0/0:43,0:43:99:.:.:0,106,1415   0/0:109,0:109:0:.:.:0,0,2409    0/0:53,0:53:91:.:.:0,91,1565    0/0:49,0:49:26:.:.:0,26,1100    0/0:31,0:31:0:.:.:0,0,673       0/0:27,0:27:12:.:.:0,12,729     0/0:27,0:27:12:.:.:0,12,668     0/0:55,0:55:0:.:.:0,0,1345      0/0:29,0:29:52:.:.:0,52,937     0/0:41,0:41:90:.:.:0,90,1165    0/0:102,0:102:0:.:.:0,0,2518    0/0:71,0:71:0:.:.:0,0,1740      0/0:29,0:29:25:.:.:0,25,826     0/0:36,0:36:99:.:.:0,99,1485    0/0:43,0:43:99:.:.:0,105,1537   0/0:36,0:36:99:.:.:0,102,1530   0/0:39,0:39:69:.:.:0,69,1085    0/0:46,0:46:99:.:.:0,103,1302   0/0:28,0:28:0:.:.:0,0,499       0/0:41,0:41:91:.:.:0,91,1194    0/0:38,0:38:79:.:.:0,79,1110    0/0:46,0:46:99:.:.:0,101,1273   0/0:52,0:52:0:.:.:0,0,857       0/0:39,0:39:0:.:.:0,0,1070      0/0:39,0:39:99:.:.:0,105,1575   0/0:79,0:79:0:.:.:0,0,1968      0/0:43,0:43:99:.:.:0,108,1151   0/0:62,0:62:49:.:.:0,49,1683    0/0:46,0:46:99:.:.:0,100,1324   0/0:63,0:63:99:.:.:0,120,1800   0/0:71,0:71:99:.:.:0,120,1800   0/0:55,0:55:99:.:.:0,105,1573   0/0:31,0:31:0:.:.:0,0,646       0/0:62,0:62:1:.:.:0,1,1578      0/0:26,0:26:0:.:.:0,0,614       0/0:61,0:61:99:.:.:0,120,1800   0/0:51,0:51:0:.:.:0,0,1177      0/0:52,0:52:0:.:.:0,0,1282      0/0:67,0:67:94:.:.:0,94,1900    0/0:69,0:69:99:.:.:0,100,1800   0/0:49,0:49:77:.:.:0,77,1396    0/0:64,0:64:0:.:.:0,0,1407      0/0:69,0:69:0:.:.:0,0,1612      0/0:80,0:80:80:.:.:0,80,2032    0/1:17,19:36:68:0|1:41554523_GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC_G:68,0,803    0/0:36,0:36:0:.:.:0,0,876       0/0:47,0:47:48:.:.:0,48,1287    0/0:56,0:56:99:.:.:0,101,1678   0/0:45,0:45:0:.:.:0,0,1075      0/0:113,0:113:99:.:.:0,120,1800 0/0:55,0:55:0:.:.:0,0,1316      0/1:15,17:32:99:.:.:105,0,470   0/0:82,0:82:0:.:.:0,0,2146
      0/0:39,0:39:48:.:.:0,48,1064    0/0:32,0:32:87:.:.:0,87,1305    0/0:48,0:48:0:.:.:0,0,1030      0/0:48,0:48:0:.:.:0,0,1129      0/0:51,0:51:0:.:.:0,0,1398      0/0:40,0:40:36:.:.:0,36,964     0/0:30,0:30:60:.:.:0,60,973     0/0:77,0:77:99:.:.:0,107,1800   0/0:43,0:43:93:.:.:0,93,1242    0/0:42,0:42:61:.:.:0,61,1104    0/0:37,0:37:9:.:.:0,9,1021      0/0:44,0:44:0:.:.:0,0,760       0/0:47,0:47:99:.:.:0,117,1755   0/0:72,0:72:48:.:.:0,48,2014    0/0:54,0:54:49:.:.:0,49,1406    0/0:48,0:48:99:.:.:0,120,1474
   0/0:45,0:45:69:.:.:0,69,1251    0/0:37,0:37:42:.:.:0,42,1046    0/0:41,0:41:72:.:.:0,72,1109    0/0:164,0:164:99:.:.:0,120,1800 0/0:28,0:28:72:.:.:0,72,1080    0/0:16,0:16:13:.:.:0,13,504     0/0:42,0:42:22:.:.:0,22,1057    0/0:76,0:76:99:.:.:0,100,1800   0/0:76,0:76:99:.:.:0,120,1800   0/0:54,0:54:99:.:.:0,104,1670   0/0:62,0:62:0:.:.:0,0,1316      0/0:29,0:29:52:.:.:0,52,875     0/0:38,0:38:0:.:.:0,0,893       0/0:86,0:86:0:.:.:0,0,1970      0/0:35,0:35:83:.:.:0,83,992     0/0:89,0:89:90:.:.:0,90,2590    0/0:24,0:24:35:.:.:0,35,780     0/0:34,0:34:79:.:.:0,79,951     0/0:58,0:58:75:.:.:0,75,1602    0/0:78,0:78:54:.:.:0,54,2048    0/0:40,0:40:24:.:.:0,24,1023    0/0:95,0:95:0:.:.:0,0,2543      0/0:44,0:44:11:.:.:0,11,1179    0/0:123,0:123:57:.:.:0,57,3377  0/0:43,0:43:99:.:.:0,113,1451   0/0:56,0:56:99:.:.:0,120,1800   0/0:28,0:28:48:.:.:0,48,841     0/0:65,0:65:0:.:.:0,0,1420      0/0:53,0:53:0:.:.:0,0,924       0/0:56,0:56:99:.:.:0,108,1499   0/1:23,36:59:99:0|1:41554523_GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC_G:125,0,636   0/0:27,0:27:81:.:.:0,81,851     0/0:45,0:45:56:.:.:0,56,1189    0/0:50,0:50:78:.:.:0,78,1451    0/0:37,0:37:75:.:.:0,75,1068    0/0:54,0:54:0:.:.:0,0,1340      0/0:37,0:37:24:.:.:0,24,915     0/0:74,0:74:85:.:.:0,85,2078    0/0:31,0:31:90:.:.:0,90,1350    0/0:48,0:48:94:.:.:0,94,1258    0/0:42,0:42:99:.:.:0,105,1194   0/0:35,0:35:99:.:.:0,99,1485    0/0:35,0:35:77:.:.:0,77,1033    0/0:37,0:37:80:.:.:0,80,1076    0/0:68,0:68:99:.:.:0,104,1800   0/0:42,0:42:99:.:.:0,107,1404   0/0:57,0:57:99:.:.:0,118,1588   0/0:33,0:33:99:.:.:0,99,970     0/0:51,0:51:99:.:.:0,103,1289

@mmoisse
Copy link

mmoisse commented May 8, 2020

I believe the problem are the gno_genome_an_ fields
e.g. gno_genome_an_afr has multiple values (5824, 1920) and vcf2db.py can not handle that.
You could solve it by removing the field or one of the values see my previous post.

@huangk3
Copy link

huangk3 commented May 8, 2020

Thanks @mmoisse The error is fixed by removing the second value of these fields. The root of the problem is that there are duplicate records in the gnomad genome VCF:

21     41554523      **rs114481025**     GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC       G
21     41554523      rs114481025   G      C
21     41554523      rs114481025   GCAGAGAAACCGTGGACAGAACGGGCCAC      G
21     41554523      rs114481025   G      GCAGAGAAACCGTGGACAGAACGGGCCAC
21     41554523      **rs34163425**       GCAGAGAAACCGTGGACAGAACGGGCCACCAGAGAAACCGTGGACAGGAGGGGGTTACCAGAGAAACCGTGGACAGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGGGTTACCAGAGAAACCGTGGACGGGGGGGCTACCAGAGAAACCGTGGACGGGGGGGGGTACCAGAGAAACCGTGGACAGGAGGGGGGTACCAGAGAAACCGTGGACGGGGGGGGTTAC       G

vcfanno concatenated the allele numbers(ANs) from rs114481025 and rs34163425 with "op=["self"]". The error is gone after I set "op=["max"]"

@erinijapranckeviciene
Copy link

erinijapranckeviciene commented Feb 16, 2021

Hello,

I am experiencing similar issue. We have multiple exomes annotated with VEP from which we create a multisample vcf using bcftools merge. After the merge this multisample vcf is decomposed with vt decompose -s and is input to vcf2db.py to create a GEMINI db. Some sites previously multiallelic during the process generate error as is discussed in this issue here.

I can't figure out what is wrong. Your help is very much appreciated. I attach here the vcf.gz and vcf.gz.tbi and ped of 4 samples with only those two lines that prevent from loading. All in zip file.

If it would be possible to identify which field impairs the loading and how , then we would take care of it before using vcf2db.py .

Many thanks in advance!

n.vcf.gz.zip.zip

Update:
While asking for help, figured it out myself :) . In my case for multiallelic variants from the merge the INFO AC field gets two values, but it is defined as Number=1 . Changing into Number=. allows vcf2db.py to upload my multisample vcf into the GEMINI db.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants