Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering Dataset #1 (Group 0) #6

Open
loodvn opened this issue Jul 19, 2023 · 1 comment
Open

Filtering Dataset #1 (Group 0) #6

loodvn opened this issue Jul 19, 2023 · 1 comment

Comments

@loodvn
Copy link

loodvn commented Jul 19, 2023

Hi there!

I'm trying to filter the data to get the Group 0 set, but I'm getting slightly different # sequences than those in the paper.

Q1: I assume that Processed_K50_dG_datasets/Tsuboyama2023_Dataset2_Dataset3_20230416.csv is the same as the "Dataset 1 and Dataset 2" file referenced in the paper (and that this comes from K50_dG_Dataset1_Dataset2)?

  • Since this processed file has 776,299 lines but Tsuboyama2023_Dataset1_20230416.csv has 1,841,286 lines?

Q2: How do I filter this file to get the Group0 variants? I'm trying to reproduce the number of sequences from Table S1 (586,938 total sequences, 434,556 singles and 152,382 doubles)

I tried using the Single list CSV file, filtering for DMS_group == G0, filtering out low-confidence values from the ddG_ML_float column.
But then I get:

  • 607,839 total instead of 586,938 in Table S1
  • 159,051 doubles instead of 152,382 in Table S1

Could you please let me know what I've missed? I assume there's another step of filtering I haven't done.

@loodvn
Copy link
Author

loodvn commented Jul 20, 2023

Actually, I see these numbers match the new Nature manuscript's Extended Data Table 1, so all good!

Could you perhaps update the README (under Zenodo/Processed_K50_dG_datasets.zip) to avoid future confusion about the dataset numbering? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant