-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filter: support filtering of ambiguous dates #602
Comments
Hello. May I ask what constitutes ambiguous dates? Are these just dates with missing values or just wrongly formatted dates? |
👍 I'd be in favor of exposing arguments here, e.g. |
@jtboing What we normally mean with ambiguous dates are dates like the following (if a 'correct' date is like this:
These are all format we accept, and will 100% work in the pipeline (we'll try to infer such dates). For many organisms we really need to include them, as we'd have almost no sequences without them! But for some analyses (like SARS-CoV-2, we do not want to include them. To expand on what @jameshadfield says above a full range of options for I do not believe we handle incorrectly formatted dates - not 100% sure what happens but I think we do expect them to be in YYYY-MM-DD format. However, it might be worth checking if we support the 'alternative' to our 'XX' format:
|
@huddlej @emmahodcroft can I take this issue? |
Yes! I don't believe there's been any progress on it but it would be great to address! |
Thank you for volunteering to work on this issue, @saikirankv! We have not discussed this further as a team since @emmahodcroft's last comment, but the solution she describes would be a great one to implement. This would also be a great opportunity to do some test-driven development by adding appropriate tests to You may want to check out our development docs sections on writing tests and running tests. Running just the tests you're interested in with the |
@emmahodcroft @huddlej how can I get sample data files? |
nvm, I found the test file, started working on it. |
@emmahodcroft @huddlej can you review the PR, I'll add tests for them. I made changes to get_numerical_dates util method. |
Context
The augur filter command currently supports filtering by min and max date, but it does so by converting any potentially ambiguous dates to reasonable date ranges before applying the filter. A common use case with augur is to exclude these ambiguous dates completely from an analysis (c.f., ncov and seasonal flu workflows).
Ambiguous dates can cause more harm than good if the missing date information inferred by TreeTime is unrealistic (far in the future from other closely related strains). Additionally, maintaining a list of dates to exclude by "XX" per month (as in the ncov workflow) is laborious and error-prone.
Description
I propose a new
--exclude-ambiguous-dates
flag for augur filter that will exclude records with ambiguous dates before the min/max date filters are applied. The statistics output from augur filter should clearly indicate how many records were excluded for this reason.The text was updated successfully, but these errors were encountered: