Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two question #86

Open
123chenshixin opened this issue Jan 15, 2025 · 4 comments
Open

Two question #86

123chenshixin opened this issue Jan 15, 2025 · 4 comments

Comments

@123chenshixin
Copy link

Thanks for developing Strainy!
I'm interested in Strainy. On my own test, I want to ask two question. First, is there an intermediate file in the output indicating which reads belong to which unitig? Second, what does the "ex" at the end of the "L" line mean in the output gfa file?

@katerinakazantseva
Copy link
Owner

Hello, thank you for using Strainy!
You can map read names and unitig names using intermediate/clusters/.csv files. For example for the provided toy data you can open intermediate/clusters/clusters_edge_188_1000_0.2 and find cluster number for each read (N). This means that the unit name in 20_extended_haplotypes.gfa will be edge188_N.
Please note that the final graph 30_links_simplification.gfa contains merged unitigs and normally their names should include the names of all the units of which they are composed, but i.e. fasta files may have truncated names due to a format limitation.

@katerinakazantseva
Copy link
Owner

As for the second question, the tag ex:i is used to mark the number of supporting reads between clusters of different unitigs (except for special values like 555, 666, 888). This is a test information field and has no use in any part of the algorithm.

@123chenshixin
Copy link
Author

Thanks for your reply. I also observed that the file intermediate/10_fine_clusters.gfa corresponds to the final output file strain_unitigs.gfa and the file intermediate/20_extended_haplotypes.gfa corresponds to the final output file strain_contigs.gfa. Is that the truth?
Meanwhile, when I check the clustering results of some strain-collapsed contig, I find that many clusters, one cluster corresponds to a unitig. But only a few unitigs appear in file intermediate/10_fine_clusters.gfa, as described in the article as "In addition, short strain unitigs that do not belong to any walk from source to sink are removed. ". Doing so would greatly reduce the utilization of reads, and the depth estimation would be inaccurate. Is it considered to compare these unitigs with the unitigs in walk and merge similar ones together?

@katerinakazantseva
Copy link
Owner

Hello, If only single cluster corresponds to a unitig it means that it has not been phased, this can happen for various reasons such as low unitig coverage or low heterozygosity zone. Сan you send me the folder and name uniting as an example so I can take a closer look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants