-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect E. coli sequences being represented by PanGraph (large dataset) #68
Comments
Hi Harsh, |
Hi Marco, Thanks for the quick response. We used the version We understand that this dataset is very big, and we can try looking for issues in other datasets. However, before that, can you confirm whether we have identified the issue with the PanGraph correctly as it might be possible that we aren't properly interpreting some part of the JSON file? Is it possible that in the PanGraph JSON file that we have provided, you compute the length of the sequence Thanks, |
Hi Harsh, I checked the full sequence reconstruction for all isolates in the graph. It looks like 64/1000 isolates have minor problems in their sequence. In particular I agree that isolate Here is a full list of the sequences containing small inconsistencies
It looks like in these cases there are few tens of bp of mismatches. I'll be investigating this further but it might take some time since it looks like these inconsistencies appear in complicated edge-cases that only happen when graphs are big and complex enough. We're working on a more robust re-implementation of some of the core functions of pangraph that will hopefully remove all of these inconsistencies once and for all. I'll keep you posted. In the meantime thanks again for your feedback! Marco |
Hi Marco, Thanks for confirming this issue. We will investigate other datasets to look for mismatches and let you know if we find issues so if it can help debug the issue. Thanks, |
In case it can be useful for this we added the command line option Thank you for all of the feedback! Marco |
Hi there,
We want to report an issue with a PanGraph that we generated on a dataset representing 1000 E. coli sequences. We believe that 64 of these sequences are not represented correctly by the PanGraph.
Thankfully, since we think the sequence lengths are also wrong, we manually verified the issue by simply computing the lengths of one of the mismatching sequences. We did this by adding up the lengths of the consensus sequences of the blocks on its path and adding the lengths of the insertions in the sequences and subtracting the lengths of the deletions on the path.
We find that the sequence length of the sequence ‘NZ_AP019856.1’ is computed by the PanGraph to be 4800017 bases. However, its true length is 4800098 bases.
We have uploaded the three relevant files to the following folder: https://drive.google.com/drive/folders/1JAliSaWokYX2i5KaUjQiOPnCdL_uyZqG?usp=sharing
We believe the mismatching sequences are:
NZ_AP019856.1
,NZ_CP054407.1
,NZ_CP010219.1
,NZ_CP036202.1
,NZ_CP014583.1
,NZ_CP027587.1
,NZ_CP027325.1
,NZ_CP013029.1
,NZ_CP027459.1
,NZ_CP050865.1
,NZ_CP050862.1
,NZ_CP027534.1
,NZ_CP014316.1
,NZ_CP015085.1
,NZ_CP018970.1
,NZ_CP023826.1
,NZ_CP032201.1
,NZ_CP023844.1
,NZ_CP015138.1
,NZ_CP018983.1
,NZ_CP018991.1
,NZ_CP049077.2
,NZ_CP010876.1
,NZ_CP036245.1
,NZ_CP049085.2
,NZ_CP035476.1
,NZ_CP035477.1
,NZ_CP014522.1
,NZ_CP014495.1
,NZ_CP024720.1
,NZ_CP024717.1
,NZ_CP021207.1
,NZ_CP019008.1
,NZ_CP019020.1
,NZ_CP035498.1
,NZ_CP053245.1
,NZ_CP037449.1
,NZ_CP048304.1
,NZ_CP048920.1
,NZ_CP040456.1
,NZ_CP024886.1
,NZ_CP051700.1
,NZ_CP030111.1
,NZ_AP022650.1
,NZ_CP053251.2
,NZ_CP051688.1
,NZ_CP033762.1
,NZ_CP019273.1
,NZ_AP017610.1
,NZ_CP033850.1
,NZ_CP019029.1
,NZ_CP015834.1
,NZ_CP009859.1
,NZ_CP040919.1
,NZ_CP023366.1
,NZ_CP041300.1
,NZ_CP033605.1
,NZ_CP041452.1
,NZ_CP041448.1
,NZ_CP028166.1
,NZ_AP021896.1
,NZ_CP031833.1
Thanks,
Harsh
The text was updated successfully, but these errors were encountered: