Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many sequences below identity #3

Open
ksahlin opened this issue Mar 23, 2018 · 5 comments
Open

Too many sequences below identity #3

ksahlin opened this issue Mar 23, 2018 · 5 comments

Comments

@ksahlin
Copy link

ksahlin commented Mar 23, 2018

Hi again,

I tried running MeShClust on 500 sequences that I simulated, all of length ~900nucleotides with most of the sequences highly similar (edit distances 1-20bp). A small portion of these sequences might have a high error rate, roughly Pacbios error rate of 10-15%. This is suppose to mimic PacBio Iso-Seq data. Any idea on how I should run MeShClust on such a dataset? Is it suitable for such sequences?

Thanks for your help!

[ksahlin@desmond bin]$ ./meshclust /nfs/brubeck.bx.psu.edu/scratch6/ksahlin/IsoCon_paper_n_10000/pacbio_reads/MEMBER_EXPERIMENT/TSPY13P_8_exponential_0.0001_500_1.fa --output ~/tmp/MESHCLUST/TSPY.clstr
avg length: 915
Recommended K: 4
Reading in sequences [=================================================] 100 %
Using 8 bit histograms
Counting 4-mers [======================================================] 100 %
Splitting data
Point pairs: 38
Sorting data [=========================================================] 100 %
Warning: Alignment may be too large for sampling
Before Pair: >TSPY13P;member:10;exons:1,2,3,4,5,6:copy10_read_36_error_rate_0.0_total_errors_0, >TSPY13P;member:10;exons:1,2,3,4,5,6:copy14_read_170_error_rate_0.010857763300760043_total_errors_10
Before Pair: >TSPY13P;member:10;exons:1,2,3,4,5,6:copy10_read_36_error_rate_0.0_total_errors_0, >TSPY13P;member:10;exons:1,2,3,4,5,6:copy5_read_242_error_rate_0.001092896174863388_total_errors_1
Before Pair: >TSPY13P;member:10;exons:1,2,3,4,5,6:copy10_read_36_error_rate_0.0_total_errors_0, >TSPY13P;member:8;exons:1,2,3,4,5,6:copy34_read_418_error_rate_0.003278688524590164_total_errors_3
Before Pair: >TSPY13P;member:10;exons:1,2,3,4,5,6:copy10_read_36_error_rate_0.0_total_errors_0, >TSPY13P;member:8;exons:1,2,3,4,5,6:copy65_read_267_error_rate_0.002185792349726776_total_errors_2
Alignment [============================================================] 100 %
positive=0 negative=986
Identity value does not match sampled data: Too many sequences below identity
@benjamin-james
Copy link
Member

Since the data set size is very low (500 sequences), an easy workaround for now may be to provide alignment scores via --align instead of using the classification which doesn't seem to work for your case.

It seems highly peculiar that all the sequences with very high similarity are showing up as negatives. If you don't mind, can I have a sample of the data?

@ksahlin
Copy link
Author

ksahlin commented Mar 23, 2018

Ok, thanks! Attached is the full 500 simulated dataset. I'll try on a larger simulation and let you know.
TSPY_simulated_500.txt

@ksahlin
Copy link
Author

ksahlin commented Mar 23, 2018

I get the same error message for datasets with the same simulation parameters but with 2500 and 12500 sequences as well.

When I try with the parameter --align I get the following message:

avg length: 915
Recommended K: 4
Reading in sequences [=================================================] 100 %
Using 8 bit histograms
Counting 4-mers [======================================================] 100 %
Adding combo 1
new single feature 1
error: list not sorted                                                 ] 0 %
terminate called after throwing an instance of 'int'
Aborted

@benjamin-james
Copy link
Member

Ok I had success, I manually specified the identity parameter to be anything above 0.9 and it worked. The alignment error message is a known bug, and I have been working on it.

@ksahlin
Copy link
Author

ksahlin commented Mar 23, 2018

Great that works, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants