Skip to content
This repository has been archived by the owner on May 15, 2019. It is now read-only.

IndexError: string index out of range #4

Open
bio-bench opened this issue Jan 31, 2018 · 15 comments
Open

IndexError: string index out of range #4

bio-bench opened this issue Jan 31, 2018 · 15 comments

Comments

@bio-bench
Copy link

Get this error when using human_g1k_v37.fasta. Is there a log file that I can check the cause of this error?

@endrebak
Copy link
Member

endrebak commented Feb 1, 2018

Thanks for the report.

Can you please post the output of

head -1 <your fasta file>
head <your bim file>

and the output snpflip posts to stdout including the error message?

Thanks,

Endre

@juliedwhite
Copy link

Hi, I'm getting the same error as @M-Saleem
Specifically:

snpflip --fasta-genome ~/work/ReferenceDatasets/1000G_hg19_fasta/human_g1k_v37.fasta 
  --bim-file ADAPT_2784ppl_567K_hg19_UpdateChrCodeXYM.bim 
  --output-prefix ADAPT_2784ppl_567K_hg19

Chromosome MT in .bim, but not in fasta file.
There were 1 'N' nucleotides in chromosome 1.
Traceback (most recent call last):
  File "/storage/home/jdw345/software/bin/snpflip", line 46, in <module>
    snp_table = create_snp_table(args["--bim-file"], args["--fasta-genome"])
  File "/storage/home/jdw345/software/lib/python3.6/site-packages/snp_flip/table.py", line 11, in create_snp_table
    reference_genome_data = get_reference_genome_data(bim_table, fa_file)
  File "/storage/home/jdw345/software/lib/python3.6/site-packages/snp_flip/reference_genome.py", line 37, in get_reference_genome_data
    snp_nucleotides = [snp.upper() for snp in _get_snps(str(nucleotides))]
IndexError: string index out of range

Here is the output from head -1 <fasta file>

head -l ~/work/ReferenceDatasets/1000G_hg19_fasta/human_g1k_v37.fasta
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1

Here is the output from head <bim file>

head ADAPT_2784ppl_567K_hg19.bim
1       rs4477212       0       82154   0       A
1       rs12564807      0       734462  0       A
1       rs3094315       0       752566  G       A
1       rs3131972       0       752721  A       G
1       rs148828841     0       760998  A       C
1       rs12562034      0       768448  0       G
1       rs12124819      0       776546  G       A
1       rs115093905     0       787173  T       G
1       rs11240777      0       798959  A       G
1       rs6681049       0       800007  0       C

In case it is also helpful, here are all of the headings in my fasta file:

grep ">" ~/work/ReferenceDatasets/1000G_hg19_fasta/human_g1k_v37.fasta
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1
>2 dna:chromosome chromosome:GRCh37:2:1:243199373:1
>3 dna:chromosome chromosome:GRCh37:3:1:198022430:1
>4 dna:chromosome chromosome:GRCh37:4:1:191154276:1
>5 dna:chromosome chromosome:GRCh37:5:1:180915260:1
>6 dna:chromosome chromosome:GRCh37:6:1:171115067:1
>7 dna:chromosome chromosome:GRCh37:7:1:159138663:1
>8 dna:chromosome chromosome:GRCh37:8:1:146364022:1
>9 dna:chromosome chromosome:GRCh37:9:1:141213431:1
>10 dna:chromosome chromosome:GRCh37:10:1:135534747:1
>11 dna:chromosome chromosome:GRCh37:11:1:135006516:1
>12 dna:chromosome chromosome:GRCh37:12:1:133851895:1
>13 dna:chromosome chromosome:GRCh37:13:1:115169878:1
>14 dna:chromosome chromosome:GRCh37:14:1:107349540:1
>15 dna:chromosome chromosome:GRCh37:15:1:102531392:1
>16 dna:chromosome chromosome:GRCh37:16:1:90354753:1
>17 dna:chromosome chromosome:GRCh37:17:1:81195210:1
>18 dna:chromosome chromosome:GRCh37:18:1:78077248:1
>19 dna:chromosome chromosome:GRCh37:19:1:59128983:1
>20 dna:chromosome chromosome:GRCh37:20:1:63025520:1
>21 dna:chromosome chromosome:GRCh37:21:1:48129895:1
>22 dna:chromosome chromosome:GRCh37:22:1:51304566:1
>X dna:chromosome chromosome:GRCh37:X:1:155270560:1
>Y dna:chromosome chromosome:GRCh37:Y:2649521:59034049:1
>MT gi|251831106|ref|NC_012920.1| Homo sapiens mitochondrion, complete genome
>GL000207.1 dna:supercontig supercontig::GL000207.1:1:4262:1
>GL000226.1 dna:supercontig supercontig::GL000226.1:1:15008:1
>GL000229.1 dna:supercontig supercontig::GL000229.1:1:19913:1
>GL000231.1 dna:supercontig supercontig::GL000231.1:1:27386:1
>GL000210.1 dna:supercontig supercontig::GL000210.1:1:27682:1
>GL000239.1 dna:supercontig supercontig::GL000239.1:1:33824:1
>GL000235.1 dna:supercontig supercontig::GL000235.1:1:34474:1
>GL000201.1 dna:supercontig supercontig::GL000201.1:1:36148:1
>GL000247.1 dna:supercontig supercontig::GL000247.1:1:36422:1
>GL000245.1 dna:supercontig supercontig::GL000245.1:1:36651:1
>GL000197.1 dna:supercontig supercontig::GL000197.1:1:37175:1
>GL000203.1 dna:supercontig supercontig::GL000203.1:1:37498:1
>GL000246.1 dna:supercontig supercontig::GL000246.1:1:38154:1
>GL000249.1 dna:supercontig supercontig::GL000249.1:1:38502:1
>GL000196.1 dna:supercontig supercontig::GL000196.1:1:38914:1
>GL000248.1 dna:supercontig supercontig::GL000248.1:1:39786:1
>GL000244.1 dna:supercontig supercontig::GL000244.1:1:39929:1
>GL000238.1 dna:supercontig supercontig::GL000238.1:1:39939:1
>GL000202.1 dna:supercontig supercontig::GL000202.1:1:40103:1
>GL000234.1 dna:supercontig supercontig::GL000234.1:1:40531:1
>GL000232.1 dna:supercontig supercontig::GL000232.1:1:40652:1
>GL000206.1 dna:supercontig supercontig::GL000206.1:1:41001:1
>GL000240.1 dna:supercontig supercontig::GL000240.1:1:41933:1
>GL000236.1 dna:supercontig supercontig::GL000236.1:1:41934:1
>GL000241.1 dna:supercontig supercontig::GL000241.1:1:42152:1
>GL000243.1 dna:supercontig supercontig::GL000243.1:1:43341:1
>GL000242.1 dna:supercontig supercontig::GL000242.1:1:43523:1
>GL000230.1 dna:supercontig supercontig::GL000230.1:1:43691:1
>GL000237.1 dna:supercontig supercontig::GL000237.1:1:45867:1
>GL000233.1 dna:supercontig supercontig::GL000233.1:1:45941:1
>GL000204.1 dna:supercontig supercontig::GL000204.1:1:81310:1
>GL000198.1 dna:supercontig supercontig::GL000198.1:1:90085:1
>GL000208.1 dna:supercontig supercontig::GL000208.1:1:92689:1
>GL000191.1 dna:supercontig supercontig::GL000191.1:1:106433:1
>GL000227.1 dna:supercontig supercontig::GL000227.1:1:128374:1
>GL000228.1 dna:supercontig supercontig::GL000228.1:1:129120:1
>GL000214.1 dna:supercontig supercontig::GL000214.1:1:137718:1
>GL000221.1 dna:supercontig supercontig::GL000221.1:1:155397:1
>GL000209.1 dna:supercontig supercontig::GL000209.1:1:159169:1
>GL000218.1 dna:supercontig supercontig::GL000218.1:1:161147:1
>GL000220.1 dna:supercontig supercontig::GL000220.1:1:161802:1
>GL000213.1 dna:supercontig supercontig::GL000213.1:1:164239:1
>GL000211.1 dna:supercontig supercontig::GL000211.1:1:166566:1
>GL000199.1 dna:supercontig supercontig::GL000199.1:1:169874:1
>GL000217.1 dna:supercontig supercontig::GL000217.1:1:172149:1
>GL000216.1 dna:supercontig supercontig::GL000216.1:1:172294:1
>GL000215.1 dna:supercontig supercontig::GL000215.1:1:172545:1
>GL000205.1 dna:supercontig supercontig::GL000205.1:1:174588:1
>GL000219.1 dna:supercontig supercontig::GL000219.1:1:179198:1
>GL000224.1 dna:supercontig supercontig::GL000224.1:1:179693:1
>GL000223.1 dna:supercontig supercontig::GL000223.1:1:180455:1
>GL000195.1 dna:supercontig supercontig::GL000195.1:1:182896:1
>GL000212.1 dna:supercontig supercontig::GL000212.1:1:186858:1
>GL000222.1 dna:supercontig supercontig::GL000222.1:1:186861:1
>GL000200.1 dna:supercontig supercontig::GL000200.1:1:187035:1
>GL000193.1 dna:supercontig supercontig::GL000193.1:1:189789:1
>GL000194.1 dna:supercontig supercontig::GL000194.1:1:191469:1
>GL000225.1 dna:supercontig supercontig::GL000225.1:1:211173:1
>GL000192.1 dna:supercontig supercontig::GL000192.1:1:547496:1

@endrebak
Copy link
Member

Might it be that the MT chromosome is called something else in your fasta? If you could send me the bim, I could debug it :)

@endrebak
Copy link
Member

Thanks for reporting this btw!

@juliedwhite
Copy link

Hi Endre! Thanks for getting back to me. I think my problem was more related to missing data - I upped my missing data filter and everything was solved.

Regarding the chromosome naming - is there any way to tell the program that 23 = X, 24 = Y, and 26 = MT?

@endrebak
Copy link
Member

I should have a flag for such a map-file. A bit busy now, but will keep this issue open as a reminder. Glad to hear it worked for you :)

@juliedwhite
Copy link

juliedwhite commented Mar 25, 2018

Hi Endre, just wanted to provide an update. I was working further with the dataset and found the same problem, even after removing the missing data. Based on this data, the index out of range error might have something to do with the treatment of MT data. As an illustration:

#We know this won't compare 23, 24, and 26, but let's do it anyway
$ snpflip --fasta-genome ~/work/ReferenceDatasets/1000G_hg19_fasta/human_g1k_v37.fasta --bim-file ADAPT_2784ppl_567K_hg19_geno0.1_mind0.1.bim --output-prefix ADAPT_2784ppl_567K_hg19_geno0.1_mind0.1
Chromosome 23 in .bim, but not in fasta file.
Chromosome 24 in .bim, but not in fasta file.
Chromosome 26 in .bim, but not in fasta file.
#Produces all the output files as usual

#Change 23=X, 24=Y, 26=M as per the snpflip --help instructions
$ awk -F'\t' -vOFS='\t' '{ gsub("23", "X", $1) ; gsub("24", "Y", $1) ; gsub ("26", "M", $1) ; print }' ADAPT_2784ppl_567K_hg19_geno0.1_mind0.1.bim > ADAPT_2784ppl_567K_hg19_geno0.1_mind0.1_XYM.bim

$ snpflip --fasta-genome ~/work/ReferenceDatasets/1000G_hg19_fasta/human_g1k_v37.fasta --bim-file ADAPT_2784ppl_567K_hg19_geno0.1_mind0.1
_XYM.bim --output-prefix ADAPT_2784ppl_567K_hg19_geno0.1_mind0.1_XYM
Traceback (most recent call last):
  File "/storage/home/jdw345/software/bin/snpflip", line 46, in <module>
    snp_table = create_snp_table(args["--bim-file"], args["--fasta-genome"])
  File "/storage/home/jdw345/software/lib/python3.6/site-packages/snp_flip/table.py", line 11, in create_snp_table
    reference_genome_data = get_reference_genome_data(bim_table, fa_file)
  File "/storage/home/jdw345/software/lib/python3.6/site-packages/snp_flip/reference_genome.py", line 37, in get_reference_genome_data
    snp_nucleotides = [snp.upper() for snp in _get_snps(str(nucleotides))]
IndexError: string index out of range
#Womp

#What if we change 26 to MT, since that is what is represented in the fasta file? 
$awk -F'\t' -vOFS='\t' '{ gsub("23", "X", $1) ; gsub("24", "Y", $1) ; gsub ("26", "MT", $1) ; print }' ADAPT_2784ppl_567K_hg19_geno0.1_mind0.1.bim > ADAPT_2784ppl_567K_hg19_geno0.1_mind0.1_XYMT.bim

snpflip --fasta-genome ~/work/ReferenceDatasets/1000G_hg19_fasta/human_g1k_v37.fasta --bim-file ADAPT_2784ppl_567K_hg19_geno0.1_mind0.1_XYMT.bim --output-prefix ADAPT_2784ppl_567K_hg19_geno0.1_mind0.1_XYMT
Chromosome MT in .bim, but not in fasta file.
#Produces all the files as expected, but this time without comparing the MT SNPs

This isn't actually a problem for me, as we're not analyzing the MT genome and I can just as easily remove it before running snpflip. But, I thought I'd post this for clarification.

Thanks for the great program!

@endrebak
Copy link
Member

Great work. Thanks. Ill give snpflip an update after I finish my PhD :)

@rcanovas
Copy link

Hi there I have the same error but I have not been able to fix it by following the solutions suggested.

snpflip -b FNLITUK_b37hqis.bim -f hs37d5_v2.fa -o snpflip_output
Traceback (most recent call last):
File "/home/rcanovas/.local/share/virtualenvs/testing_area-AP9TWkm_/bin/snpflip", line 46, in
snp_table = create_snp_table(args["--bim-file"], args["--fasta-genome"])
File "/home/rcanovas/.local/share/virtualenvs/testing_area-AP9TWkm_/lib/python3.5/site-packages/snp_flip/table.py", line 11, in create_snp_table
reference_genome_data = get_reference_genome_data(bim_table, fa_file)
File "/home/rcanovas/.local/share/virtualenvs/testing_area-AP9TWkm_/lib/python3.5/site-packages/snp_flip/reference_genome.py", line 37, in get_reference_genome_data
snp_nucleotides = [snp.upper() for snp in _get_snps(str(nucleotides))]
IndexError: string index out of range

Info of my .bim file

cut -f 1 FNLITUK_b37hqis.bim | sort -u
1
10
11
12
13
14
15
16
17
18
19
2
20
21
22
3
4
5
6
7
8
9

and the headings of my .fa file

grep ">" ~/assembly_builds/hs37d5_v2.fa

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

@endrebak
Copy link
Member

Would love to try to help. Could you send me the bim and fa or are they confidential?

@rcanovas
Copy link

rcanovas commented Sep 11, 2018 via email

@endrebak
Copy link
Member

endrebak commented Sep 11, 2018 via email

@nadavrap
Copy link

I had the same issue, but I finally found out that I used a different genome reference. Once I switched to the right one it worked properly.

@Victor0122
Copy link

Victor0122 commented Nov 24, 2018

Hi there I have the same error but I have not been able to fix it by following the solutions suggested.
Traceback (most recent call last):
File "/home/victor/.local/bin/snpflip", line 46, in
snp_table = create_snp_table(args["--bim-file"], args["--fasta-genome"])
File "/home/victor/.local/lib/python2.7/site-packages/snp_flip/table.py", line 11, in create_snp_table
reference_genome_data = get_reference_genome_data(bim_table, fa_file)
File "/home/victor/.local/lib/python2.7/site-packages/snp_flip/reference_genome.py", line 37, in
get_reference_genome_data
snp_nucleotides = [snp.upper() for snp in _get_snps(str(nucleotides))]I ndexError: string index out of
range'

My fa.files
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
tatgtgagaagatagctgaacgccttgtccacatcatcttactgctgaga
gttgagctcaccctcagtccctcacagttccacactgcctgcagagtgag
tttcccatgtcttcaccagagacttttgccagaggcttctgagacgcaag
ttaacaatgcagacctggagggtatctccaggtgcagtagagtggtaatc
tcggaacctcctgactcagaatactgctaccttcacactgtcataagaat
gcagcgagttgagagctggcttctaggcatgcttccttttgagagctgag
gacaggacagaaccctcccgcatcctgcctgactgtagacgtacctgcta

@Victor0122
Copy link

I have the other issue.
I try to do the SNP check by one chromosome by one chromsome.
I found out that I have wrong position in my output file.
chromosome 0_idx_position snp_name genetic_distance allele_1 allele_2 reference reference_rev strand
1 249579 1.24958 0 G A A T forward
1 251204 1.251205 0 G C G C ambiguous
1 266522 1.266523 0 C A G C reverse
1 273486 1.273487 0 A G A T forward
1 307562 1.307563 0 A C C G forward
1 320054 1.320055 0 G A A T forward
1 343358 1.343359 0 G A T A reverse
1 348209 1.3482100000000001 0 G A T A reverse
1 363617 1.363618 0 G A A T forward
1 373663 1.373664 0 A T T A ambiguous
1 398891 1.398892 0 A G C G reverse
1 412572 1.4125729999999999 0 A C A T forward
1 420035 1.420036 0 G A G C forward

For example 1.3482100000000001 should be 1.348209

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants