Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metagenomic mode Help with read_analysis.py #240

Open
florenmartino opened this issue Nov 21, 2024 · 1 comment
Open

Metagenomic mode Help with read_analysis.py #240

florenmartino opened this issue Nov 21, 2024 · 1 comment

Comments

@florenmartino
Copy link

florenmartino commented Nov 21, 2024

Hi there! Im new using this tool, so sorry in advance if I'm asking something silly.

I'm trying to use the read_analysis.py script for metagenome analysis, but I'm encountering issues with the genome_list input. Here's what I’ve done so far:

My genome_list.tsv file is structured like this:

Identifier FilePath
AB008394_Torque_teno_virus_1 References.split/AB008394.fasta
AB017613_Torque_teno_virus_16 References.split/AB017613.fasta
AB025946_Torque_teno_virus_19 References.split/AB025946.fasta
AB026929_Torque_teno_mini_virus_6 References.split/AB026929.fasta
AB026931_Torque_teno_mini_virus_1 References.split/AB026931.fasta
AB028668_Torque_teno_virus_15 References.split/AB028668.fasta
AB037926_Torque_teno_virus_14 References.split/AB037926.fasta
AB038621_Torque_teno_virus_29 References.split/AB038621.fasta
AB038627_Torque_teno_mini_virus_7 References.split/AB038627.fasta
AB038629_Torque_teno_mini_virus_2 References.split/AB038629.fasta
AB038630_Torque_teno_mini_virus_3 References.split/AB038630.fasta
AB038631_Torque_teno_mini_virus_9 References.split/AB038631.fasta
AB041957_Torque_teno_virus_4 References.split/AB041957.fasta
AB041958_Torque_teno_virus_26 References.split/AB041958.fasta
AB041959_Torque_teno_virus_25 References.split/AB041959.fasta
AB041960_Torque_teno_tamarin_virus References.split/AB041960.fasta
...

  • Each identifier corresponds to a reference genome.
  • File paths point to valid .fasta files in the specified directory.
  • Im using nanosim under conda environment, using the last version

I ran the command:

read_analysis.py metagenome -i /path/to/myfile.fastq.gz -gl genome_list.tsv --no_model_fit -o nanosim_output -t 16

The script failed with the following error:

(nanosim) [fmarti34@login02 NANOSIM-TEST]$ read_analysis.py metagenome -i /home/fmarti34/data_sclipma1/Anellome_outputs_hash/AS1_12_mo./AS1_12_mo..fastq.gz -gl genome_list.tsv --no_model_fit -o nanosim_1_test -t 16

Running the code with following parameters:

infile /home/fmarti34/data_sclipma1/Anellome_outputs_hash/AS1_12_mo./AS1_12_mo..fastq.gz
genome_list genome_list.tsv
g_alnm
prefix nanosim_1_test
num_threads 16
model_fit False
chimeric False
homopolymer False
fastq False
quantification False
2024-11-21 10:29:43: Read pre-process
2024-11-21 10:31:32: Processing reference genome
Traceback (most recent call last):
File "/home/fmarti34/.conda/envs/nanosim/bin/read_analysis.py", line 879, in
main()
File "/home/fmarti34/.conda/envs/nanosim/bin/read_analysis.py", line 675, in main
metagenome_list[species] = {'path': info[1]}

Questions:

  1. Could you clarify what the correct format for the genome_list file should be?
  2. As additional information, like abundance, required in this file? would output from tools like Bracken be appropriate as input?

Thank you in advance!

Best regards,

Flor

@florenmartino florenmartino changed the title Metagenomic mode Help with read_analysis.py: Understanding genome_list and abundanHelp with read_analysis.py: Understanding genome_list and abundance inputce input Metagenomic mode Help with read_analysis.py: Understanding genome_list and abundan Nov 21, 2024
@florenmartino florenmartino changed the title Metagenomic mode Help with read_analysis.py: Understanding genome_list and abundan Metagenomic mode Help with read_analysis.py Nov 21, 2024
@lcoombe
Copy link
Member

lcoombe commented Nov 21, 2024

Hi @florenmartino,

For your genome list file, here's the description from the help page:

  -gl GENOME_LIST, --genome_list GENOME_LIST
                        Reference metagenome list, tsv file, the first column
                        is species/strain name, the second column is the
                        reference genome fasta/fastq file directory, the third
                        column is optional, if provided, it contains the
                        expected abundance (sum up to 100)

From the formatting you pasted above, it looks like you may have space-separated each line instead of using tabs? That's worth double-checking first. At minimum you need the first two columns that are described.
In terms of abundance, it is optional for the characterization stage, but if you want to quantify, you can use the read_analysis.py quantify mode - there is more description of that one on the README.md.

Hope that helps - thank you for your interest in NanoSim!
Lauren

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants