Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of paired end illumina data for trio bining with meryl #122

Closed
ap1438 opened this issue Jan 20, 2023 · 16 comments
Closed

Use of paired end illumina data for trio bining with meryl #122

ap1438 opened this issue Jan 20, 2023 · 16 comments

Comments

@ap1438
Copy link

ap1438 commented Jan 20, 2023

Hi,
i wanted to use the tool for de novo assembly using trio binning approach.

I have paired end parental reads from illumina , F1 hifi reads and F1 nanopore reads .
I was wondering how to use the below commands mentioned in the manual of verkko.

meryl count compress k=30 threads=XX memory=YY maternal.*fastq.gz output maternal_compress.k30.meryl
meryl count compress k=30 threads=XX memory=YY paternal.*fastq.gz output paternal_compress.k30.meryl

So should i just cat the R1 and R2 of paired end reads to genetare a single file for both the parents and then run the command.
like cat R1 R2 > Maternal.fastq.gz
cat R1 R2 > Paternal.fastq.gz
and the use the above command.

Or should i use count forward and count reverse and the union command to generate the maternal_compress.k30.meryl and
Paternal_compress.k30.meryl.

@brianwalenz
Copy link
Member

You can run meryl with wildcards to specify filenames. It will process each file and write the combined set of kmers to the output.

Note there is a slight mistake in the child count command - the output should be child_compress.k30.meryl.

@ap1438
Copy link
Author

ap1438 commented Jan 22, 2023

Thank you for your quick reply.
So, the command to use will be this

meryl count compress k=30 threads=XX memory=YY maternal.*fastq.gz output   maternal_compress.k30.meryl
meryl count compress k=30 threads=XX memory=YY paternal.*fastq.gz output    paternal_compress.k30.meryl
meryl count compress k=30 threads=XX memory=YY    child.*fastq.gz output      child_compress.k30.meryl

And I don't have illumina for F1.
So in this case where illumina data of F1 is not available for meryl child. Should we use hifi reads/nanopore F1 reads as child in meryl or we can move forward without the child data.

I was confused because as per the manual below

 `$MERQURY/trio/hapmers.sh \
  maternal_compress.k30.meryl \
  paternal_compress.k30.meryl \
     child_compress.k30.meryl

verkko -d asm \
  --hifi hifi/*.fastq.gz \
  --nano  ont/*.fastq.gz \
  --hap-kmers maternal_compress.k30.hapmer.meryl \
              paternal_compress.k30.hapmer.meryl \
              trio`

Child data was generated using meryl but was not used in verkko trio mode.

So is it necessary to generate child k-mer count file for trio.
If yes can i use hifi/nanopre reads to do so in absence of illumina F1 reads.

@skoren
Copy link
Member

skoren commented Jan 22, 2023

As it says in the readme, Child data is optional, in this case use maternal_compress.k30.only.meryl and paternal_compress.k30.only.meryl in the verkko command above. So you can leave out the child data from the merqury command and use the resulting only.meryl databases instead.

You could also use the HiFi data as the child, definitely not the simplex ONT data.

@ap1438
Copy link
Author

ap1438 commented Jan 31, 2023

I ran the command as mentioned in the read me. But the input for verkko is missing .
/vol/cluster-data/aparida/anaconda/envs/merqury/share/merqury/trio/hapmers.sh /prj/pflaphy-trr341/Final_files/maternal_compress.k30.meryl /prj/pflaphy-trr341/Final_files/paternal_compress.k30.meryl.

But two files were not produced maternal_compress.k30.hapmer.meryl and paternal_compress.k30.hapmer.meryl.

It produced
maternal_compress.k30.meryl \ paternal_compress.k30.meryl \ maternal_compress.k30_not_paternal_compress.k30.meryl \ maternal_compress.k30_not_paternal_compress.k30.hist \ maternal_compress.k30_not_paternal_compress.k30.hist.ploidy \ maternal_compress.k30_not_paternal_compress.k30.gt4.meryl \ maternal_compress.k30_not_paternal_compress.k30.filt \ maternal_compress.k30.only.meryl \ paternal_compress.k30_not_maternal_compress.k30.meryl \ paternal_compress.k30_not_maternal_compress.k30.hist \ paternal_compress.k30_not_maternal_compress.k30.hist.ploidy \ paternal_compress.k30_not_maternal_compress.k30.gt5.meryl \ paternal_compress.k30_not_maternal_compress.k30.filt \ paternal_compress.k30.only.meryl \ maternal_compress.k30_and_paternal_compress.k30.meryl \ shrd.meryl \

@skoren
Copy link
Member

skoren commented Jan 31, 2023

Yes, that is normal, as I said before, the file names are only.meryl in the case of no child input.

@ap1438
Copy link
Author

ap1438 commented Jan 31, 2023

Thank you for the quick reply and support.
I tried to run the command
/vol/cluster-data/aparida/anaconda/envs/vereko/bin/verkko -d asm --hifi hifi/PB_F1.fastq.gz --nano ont/ont_F1.fastq.gz --hap-kmers maternal_compress.k30.only.meryl paternal_compress.k30.only.meryl trio

Is the command wrong .Because when i run the command it shows the help page with
Meryl database '/prj/pflaphy-trr341/Final_files/maternal_compress.k30.only.meryl' appears to be built using non-homopolymer compressed kmers. Meryl database '/prj/pflaphy-trr341/Final_files/paternal_compress.k30.only.meryl' appears to be built using non-homopolymer compressed kmers.

@skoren
Copy link
Member

skoren commented Jan 31, 2023

The command is fine but it appears you didn't build the initial maternal/paternal databases with the compressed option. This is done before you run hapmers.sh. How did you build the initial meryl databases? Try running meryl print /prj/pflaphy-trr341/Final_files/maternal_compress.k30.meryl |head and posting the output here.

@ap1438
Copy link
Author

ap1438 commented Jan 31, 2023

I used suggested command to generate the file

/vol/cluster-data/aparida/anaconda/envs/merqury/bin/meryl count compress k=30 threads=16 memory=320 /prj/pflaphy-trr341/Final_files/D111_merged_L2_4.*fastq.gz output /prj/pflaphy-trr341/Final_files/maternal_compress.k30.meryl
/vol/cluster-data/aparida/anaconda/envs/merqury/bin/meryl count compress k=30 threads=16 memory=320 /prj/pflaphy-trr341/Final_files/D654_merged_L4.*fastq.gz output /prj/pflaphy-trr341/Final_files/paternal_compress.k30.meryl
Found 1 command tree.

PROCESSING TREE #1 using 28 threads.
  opLessThan
    /prj/pflaphy-trr341/Final_files/maternal_compress.k30.meryl
    print to (stdout)
CTGAAAAAAAAAAAAAAAAAAAAAAAAAAA	234
CTGAAAAAAAAAAAAAAAAAAAAAAAAAAC	4
CTGAAAAAAAAAAAAAAAAAAAAAAAAACA	3
CTGAAAAAAAAAAAAAAAAAAAAAAAAACC	2
CTGAAAAAAAAAAAAAAAAAAAAAAAAATA	6
CTGAAAAAAAAAAAAAAAAAAAAAAAAAGA	3
CTGAAAAAAAAAAAAAAAAAAAAAAAACAA	5
CTGAAAAAAAAAAAAAAAAAAAAAAAACAC	1
CTGAAAAAAAAAAAAAAAAAAAAAAAACAG	1
CTGAAAAAAAAAAAAAAAAAAAAAAAATAA	9

@skoren
Copy link
Member

skoren commented Jan 31, 2023

Yeah those are definitely not compressed databases, there shouldn't be any homopolymers runs.

What's the version of meryl? Did you first remove all previous output meryl databases before running the commands? Try removing all your meryl databases and running the command for just one of the parents and post the full log it generates here.

@ap1438
Copy link
Author

ap1438 commented Jan 31, 2023

log.txt
This time i deleted all the previous files.
how to find the meryl version .I installed it using conda recently .
Can you also tell me how a compressed database file should look like.

@skoren
Copy link
Member

skoren commented Feb 3, 2023

Unfortunately it seems the meryl log doesn't report whether it's doing compressed or uncompressed counts. Older versions of meryl that didn't support compression would complain that the parameter is invalid. Newer versions all worked correctly for me as long as count compress or compress count was specified. What is your exact meryl command? You're running two instances to count each maternal and paternal databases?

That is:

/vol/cluster-data/aparida/anaconda/envs/merqury/bin/meryl count compress k=30 threads=16 memory=320 /prj/pflaphy-trr341/Final_files/D111_merged_L2_4.*fastq.gz output /prj/pflaphy-trr341/Final_files/maternal_compress.k30.meryl

and

/vol/cluster-data/aparida/anaconda/envs/merqury/bin/meryl count compress k=30 threads=16 memory=320 /prj/pflaphy-trr341/Final_files/D654_merged_L4.*fastq.gz output /prj/pflaphy-trr341/Final_files/paternal_compress.k30.meryl

as two separate commands.

meryl --version will report the version. The compressed DB should look the same just not have any homopolymers tracts, e.g.:

ACGACACACACACACACACACACACACACG  1
ACGACACACACACACACACACACACACATA  7
ACGACACACACACACACACACACACACAGA  4
ACGACACACACACACACACACACACACAGC  1
ACGACACACACACACACACACACACACTCT  2
ACGACACACACACACACACACACACACTGA  1

@ap1438
Copy link
Author

ap1438 commented Feb 19, 2023

I traced the problem to be with the tool verkko. It has nothing to do with the meryl .
Reason-
I ran it successfully after using the F1 hifi reads as child data and creating meryl db with it. It worked fine.
It seems that the tool cannot work without the child data in the verkko command.
Please check it and if needed update it in the read me file.

@ap1438 ap1438 closed this as completed Feb 19, 2023
@skoren
Copy link
Member

skoren commented Feb 20, 2023

Your conclusion is incorrect, verkko can work with our without any child data. Since the input to it is just two meryl databases, it has no way to know if they are constructed from just the parents, the parents + children, or even reads binned using HiC or Strand-seq data as we did in the manuscript.

The initial databases you had were not homopolymer compressed which verkko definitely cannot work with. It is likely your second run was properly compressed and thus the run worked. We did recently see a bug with meryl when empty files are passed that databases are not compressed: marbl/meryl#31, perhaps this caused your issue building DBs.

@coolmaksat
Copy link

coolmaksat commented Aug 31, 2023

Hello, I'm also having the same issue with meryl-1.4. Running meryl count compress k=30 threads=64 memory=32 P01_*.fastq.gz output P01_compress.k30.meryl is not resulting in homopolymer compressed databases


Found 1 command tree.

PROCESSING TREE #1 using 40 threads.
  opLessThan
    P01_compress.k30.meryl
    print to (stdout)
CCGAAAAAAAAAAAAAAAAAAAAAAAAAAA  2461
CCGAAAAAAAAAAAAAAAAAAAAAAAAAAC  73
CCGAAAAAAAAAAAAAAAAAAAAAAAAAAG  100
CCGAAAAAAAAAAAAAAAAAAAAAAAAACA  70
CCGAAAAAAAAAAAAAAAAAAAAAAAAACC  11
CCGAAAAAAAAAAAAAAAAAAAAAAAAACG  4
CCGAAAAAAAAAAAAAAAAAAAAAAAAATA  41
CCGAAAAAAAAAAAAAAAAAAAAAAAAATC  3
CCGAAAAAAAAAAAAAAAAAAAAAAAAAGA  67
CCGAAAAAAAAAAAAAAAAAAAAAAAAAGC  10

I tried removing everything and rebuilding them, but it did not help

@coolmaksat
Copy link

Compiling meryl from the sources helped to solve this issue, before I was using the released binaries.

@CCyeah
Copy link

CCyeah commented Dec 12, 2024

Compiling meryl from the sources helped to solve this issue, before I was using the released binaries.
Can you share your compiled version? I have the same problem with meryl-1.4. Running is not resulting in homopolymer compressed databasesmeryl count compress k=20 threads=20 memory=300GB 1.clean.fq.gz 2.clean.fq.gz output output maternal_compress.k20.meryl
Found 1 command tree.

PROCESSING TREE #1 using 128 threads.
opLessThan
./child_compress.k20.meryl/
print to (stdout)
GAGACACACACACACACAGC 4977
GAGACACACACACACACTAC 2129
GAGACACACACACACACTGC 1979
GAGACACACACACACACGAC 992
GAGACACACACACACATAGC 990
GAGACACACACACACATCAC 1805
GAGACACACACACACATCTC 1166
GAGACACACACACACATCGC 165
GAGACACACACACACATGAC 998
GAGACACACACACACAGAGC 1752

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants