Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using SWAMPy for other pathogens #7

Open
mariaelf97 opened this issue Feb 13, 2024 · 10 comments
Open

Using SWAMPy for other pathogens #7

mariaelf97 opened this issue Feb 13, 2024 · 10 comments

Comments

@mariaelf97
Copy link

Dear authors!

Good day to you.
I was wondering how I can use SWAMPy for other pathogens? Which parts of the code should I modify?
I have different reference genomes and primer sets. Is there any parts of the code that uses Covid reference genome as default?

Regards,
Maryam

@rabiafidan
Copy link
Collaborator

Dear Maryam,

Thank you for your interest in SWAMPy! Yes, there are parts of the code that use SARS-CoV-2 genome as default. Which organism do you have in your mind? Genome size is an important consideration for this kind of modification. Also, the code is not structured to accommodate multiple chromosomes, if that is applicable. If your organism has a single chromosome and a relatively small genome size, it should be possible and we can give you pointers on how to do it.

All the best,
Rabia

@mariaelf97
Copy link
Author

Dear @rabiafidan
Thank you for your response. The pathogen of interest is a bacteria with 1 chromosome. The length is approximately 4 mbps.

Regards,
Maryam

@rabiafidan
Copy link
Collaborator

Dear Maryam,

Sorry for the late response. Sounds like you can go ahead and try it! Just remember, SWAMPy is intended to simulate whole genome amplicon sequencing with overlapping amplicons on an Illumina machine. If you have a different construct, it is also a consideration.

The step that uses SARS-CoV-2 genome as default is adding the PCR error.

You will see add_PCR_errors function (from PCR_error.py) takes WUHAN_REF argument. This argument is defined in the main script simulate_metagenome.py line 38.

  1. Firstly, you would need to add the reference files into the ref directory. You should add the FASTA file of your bacterium and the bowtie2 indices (the ones that have bt2 extension).
  2. You will need to modify line 38 at simulate_metagenome.py. Change "MN908947.3" with your FASTA file name, without the .fasta or .fa extension.
  3. Change the chromosome name "MN908947.3" at line 285 of PCR_error.py with the chromosome name of your bacterium.
  4. Then follow the steps of issue Adding custom amplicon panels #2 to add your suitable amplicon sets.

Hopefully, it should work. Let me know if you are unsure about how to do any of the steps and how it goes!

@rabiafidan
Copy link
Collaborator

I have just noticed a point that might not have been clear. The reference genome you want to add to the ref folder is the ancestral strain (equivalent to the Wuhan variant of SAS-CoV-2), not necessarily the strain you want to simulate. This is to ensure PCR errors are consistent across different variants of the bacterium in your sample.

Hope it makes sense!

@mariaelf97
Copy link
Author

mariaelf97 commented Feb 21, 2024

Hi, Thank you for the information.
I proceeded as you instructed and I am getting the following error:

python simulate_metagenome.py --genomes_file all_seqs.fasta --temp_folder temp --genome_abundances .abundances.tsv --primer_set a1 --output_folder output --read_length 400 --seed 10 --no_pcr_error```

The first error I got was:

import pandas as pd
2024-02-21 10:48:25,444 [INFO] Primer set: Artic v1
2024-02-21 10:48:25,444 [INFO] Number of reads: 100000
2024-02-21 10:48:25,445 [INFO] Random seed: 10
2024-02-21 10:48:25,445 [INFO] Amplicon pseudocounts/ i.e. quality parameter: 200
2024-02-21 10:48:25,488 [INFO] Spaces in the genomes and abundances files are processed as '_' characters if exist
2024-02-21 10:48:25,489 [INFO] Total of relative abundance values is 100.0, not 1.
2024-02-21 10:48:25,489 [INFO] Continuing, normalising total of genome abundances to 1.
2024-02-21 10:48:25,616 [INFO] Working on genome 1 of 4
2024-02-21 10:48:25,617 [INFO] Using bowtie2 to align primers to genome MTB-Oman-321769-L1
Building a SMALL index
Renaming temp/indices/1.3.bt2.tmp to temp/indices/L1.3.bt2
Renaming temp/indices/L1.4.bt2.tmp to temp/indices/L1.4.bt2
Renaming temp/indices/L1.1.bt2.tmp to temp/indices/L1.1.bt2
Renaming temp/indices/L1.2.bt2.tmp to temp/indices/L1.2.bt2
Renaming temp/indices/L1.rev.1.bt2.tmp to temp/indices/L1.rev.1.bt2
Renaming temp/indices/L1.rev.2.bt2.tmp to temp/indices/L1.rev.2.bt2
Traceback (most recent call last):
File "simulate_metagenome.py", line 344, in
df = align_primers(genome_path, genome_filename_short, INDICES_FOLDER, PRIMERS_FILE, False)
File "SWAMPy/src/create_amplicons.py", line 26, in align_primers
df = pd.read_csv(alignment, sep="\t", skiprows=[0, 1, 2], header=None, names=[i for i in range(19)])
File "envs/swampy/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1024, in read_csv
return _read(filepath_or_buffer, kwds)
File "envs/swampy/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 624, in _read
return parser.read(nrows)
File "envs/swampy/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1921, in read
) = self._engine.read( # type: ignore[attr-defined]
File "envs/swampy/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
chunks = self._reader.read_low_memory(nrows)
File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory
File "parsers.pyx", line 905, in pandas._libs.parsers.TextReader._read_rows
File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows
File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 19 fields in line 114, saw 20

I fixed it by changing line 26 in create_amplicons 
`df = pd.read_csv(alignment, sep="\t", skiprows=[0, 1, 2], header=None, names=[i for i in range(20)])`

Then I got another error

Traceback (most recent call last):
File "SWAMPy/src/simulate_metagenome.py", line 365, in
df_amplicons["hyperparameter"] = df_amplicons.apply(amplicon_hyperparameter_sampler, axis=1)
File "envs/swampy/lib/python3.10/site-packages/pandas/core/frame.py", line 10347, in apply
return op.apply().finalize(self, method="apply")
File "envs/swampy/lib/python3.10/site-packages/pandas/core/apply.py", line 916, in apply
return self.apply_standard()
File "envs/swampy/lib/python3.10/site-packages/pandas/core/apply.py", line 1063, in apply_standard
results, res_index = self.apply_series_generator()
File "envs/swampy/lib/python3.10/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
results[i] = self.func(v, *self.args, **self.kwargs)
File "SWAMPy/src/read_model.py", line 45, in hyperparam_sampler
return hyperparams[dataframe_row.amplicon_number]
KeyError: 0


Is this because I did not provide the amplicon_distribution file?
It seems that the amplicon fasta files were created in the temp folder.

@mariaelf97
Copy link
Author

Is this normal to have empty amplicon.fasta files?

@mariaelf97
Copy link
Author

all_seqs.fasta has 4 contigs, each size of about 4 MBP
genome_abundances look like this
L1 30
L2 40
L3 20
L4 10
primer file is the same as the format you provided in fastq and bed format

@rabiafidan
Copy link
Collaborator

Dear Maryam,

Not only the amplicon distribution file but also other steps are necessary to add your custom amplicon set. Please carefully follow #2 Empty amlicon files is not normal. I see that you ran your command with --primer_set a1, which is artic v1 primer set. That wouldn't align to your organism, hence no amplicons. You should use your custom amplicon set for your organism.

@mariaelf97
Copy link
Author

I did follow the instruction and replaced the files with my amplicons

@rabiafidan
Copy link
Collaborator

rabiafidan commented Mar 4, 2024

Then I don't know what the problem is, sorry! I can only suggest the obvious, debugging :) You got
pandas.errors.ParserError: Error tokenizing data. C error: Expected 19 fields in line 114, saw 20 so you can start by looking at line 114, how is it different from line 113, why is it 20 fields...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants