Two question #86

123chenshixin · 2025-01-15T08:34:02Z

Thanks for developing Strainy!
I'm interested in Strainy. On my own test, I want to ask two question. First, is there an intermediate file in the output indicating which reads belong to which unitig? Second, what does the "ex" at the end of the "L" line mean in the output gfa file?

katerinakazantseva · 2025-01-15T11:15:32Z

Hello, thank you for using Strainy!
You can map read names and unitig names using intermediate/clusters/.csv files. For example for the provided toy data you can open intermediate/clusters/clusters_edge_188_1000_0.2 and find cluster number for each read (N). This means that the unit name in 20_extended_haplotypes.gfa will be edge188_N.
Please note that the final graph 30_links_simplification.gfa contains merged unitigs and normally their names should include the names of all the units of which they are composed, but i.e. fasta files may have truncated names due to a format limitation.

katerinakazantseva · 2025-01-15T12:11:10Z

As for the second question, the tag ex:i is used to mark the number of supporting reads between clusters of different unitigs (except for special values like 555, 666, 888). This is a test information field and has no use in any part of the algorithm.

123chenshixin · 2025-01-17T07:10:01Z

Thanks for your reply. I also observed that the file intermediate/10_fine_clusters.gfa corresponds to the final output file strain_unitigs.gfa and the file intermediate/20_extended_haplotypes.gfa corresponds to the final output file strain_contigs.gfa. Is that the truth?
Meanwhile, when I check the clustering results of some strain-collapsed contig, I find that many clusters, one cluster corresponds to a unitig. But only a few unitigs appear in file intermediate/10_fine_clusters.gfa, as described in the article as "In addition, short strain unitigs that do not belong to any walk from source to sink are removed. ". Doing so would greatly reduce the utilization of reads, and the depth estimation would be inaccurate. Is it considered to compare these unitigs with the unitigs in walk and merge similar ones together?

katerinakazantseva · 2025-01-27T11:49:31Z

Hello, If only single cluster corresponds to a unitig it means that it has not been phased, this can happen for various reasons such as low unitig coverage or low heterozygosity zone. Сan you send me the folder and name uniting as an example so I can take a closer look?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two question #86

Two question #86

123chenshixin commented Jan 15, 2025

katerinakazantseva commented Jan 15, 2025

katerinakazantseva commented Jan 15, 2025

123chenshixin commented Jan 17, 2025

katerinakazantseva commented Jan 27, 2025

Two question #86

Two question #86

Comments

123chenshixin commented Jan 15, 2025

katerinakazantseva commented Jan 15, 2025

katerinakazantseva commented Jan 15, 2025

123chenshixin commented Jan 17, 2025

katerinakazantseva commented Jan 27, 2025