Too many sequences below identity #3

ksahlin · 2018-03-23T00:22:01Z

Hi again,

I tried running MeShClust on 500 sequences that I simulated, all of length ~900nucleotides with most of the sequences highly similar (edit distances 1-20bp). A small portion of these sequences might have a high error rate, roughly Pacbios error rate of 10-15%. This is suppose to mimic PacBio Iso-Seq data. Any idea on how I should run MeShClust on such a dataset? Is it suitable for such sequences?

Thanks for your help!

[ksahlin@desmond bin]$ ./meshclust /nfs/brubeck.bx.psu.edu/scratch6/ksahlin/IsoCon_paper_n_10000/pacbio_reads/MEMBER_EXPERIMENT/TSPY13P_8_exponential_0.0001_500_1.fa --output ~/tmp/MESHCLUST/TSPY.clstr
avg length: 915
Recommended K: 4
Reading in sequences [=================================================] 100 %
Using 8 bit histograms
Counting 4-mers [======================================================] 100 %
Splitting data
Point pairs: 38
Sorting data [=========================================================] 100 %
Warning: Alignment may be too large for sampling
Before Pair: >TSPY13P;member:10;exons:1,2,3,4,5,6:copy10_read_36_error_rate_0.0_total_errors_0, >TSPY13P;member:10;exons:1,2,3,4,5,6:copy14_read_170_error_rate_0.010857763300760043_total_errors_10
Before Pair: >TSPY13P;member:10;exons:1,2,3,4,5,6:copy10_read_36_error_rate_0.0_total_errors_0, >TSPY13P;member:10;exons:1,2,3,4,5,6:copy5_read_242_error_rate_0.001092896174863388_total_errors_1
Before Pair: >TSPY13P;member:10;exons:1,2,3,4,5,6:copy10_read_36_error_rate_0.0_total_errors_0, >TSPY13P;member:8;exons:1,2,3,4,5,6:copy34_read_418_error_rate_0.003278688524590164_total_errors_3
Before Pair: >TSPY13P;member:10;exons:1,2,3,4,5,6:copy10_read_36_error_rate_0.0_total_errors_0, >TSPY13P;member:8;exons:1,2,3,4,5,6:copy65_read_267_error_rate_0.002185792349726776_total_errors_2
Alignment [============================================================] 100 %
positive=0 negative=986
Identity value does not match sampled data: Too many sequences below identity

The text was updated successfully, but these errors were encountered:

benjamin-james · 2018-03-23T18:04:56Z

Since the data set size is very low (500 sequences), an easy workaround for now may be to provide alignment scores via --align instead of using the classification which doesn't seem to work for your case.

It seems highly peculiar that all the sequences with very high similarity are showing up as negatives. If you don't mind, can I have a sample of the data?

ksahlin · 2018-03-23T20:41:16Z

Ok, thanks! Attached is the full 500 simulated dataset. I'll try on a larger simulation and let you know.
TSPY_simulated_500.txt

ksahlin · 2018-03-23T21:05:12Z

I get the same error message for datasets with the same simulation parameters but with 2500 and 12500 sequences as well.

When I try with the parameter --align I get the following message:

avg length: 915
Recommended K: 4
Reading in sequences [=================================================] 100 %
Using 8 bit histograms
Counting 4-mers [======================================================] 100 %
Adding combo 1
new single feature 1
error: list not sorted                                                 ] 0 %
terminate called after throwing an instance of 'int'
Aborted

benjamin-james · 2018-03-23T21:16:16Z

Ok I had success, I manually specified the identity parameter to be anything above 0.9 and it worked. The alignment error message is a known bug, and I have been working on it.

ksahlin · 2018-03-23T21:24:37Z

Great that works, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many sequences below identity #3

Too many sequences below identity #3

ksahlin commented Mar 23, 2018

benjamin-james commented Mar 23, 2018

ksahlin commented Mar 23, 2018

ksahlin commented Mar 23, 2018

benjamin-james commented Mar 23, 2018

ksahlin commented Mar 23, 2018

Too many sequences below identity #3

Too many sequences below identity #3

Comments

ksahlin commented Mar 23, 2018

benjamin-james commented Mar 23, 2018

ksahlin commented Mar 23, 2018

ksahlin commented Mar 23, 2018

benjamin-james commented Mar 23, 2018

ksahlin commented Mar 23, 2018