Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional date format detection, DD/MM/YYYY #275

Closed
EricSoroos opened this issue Apr 28, 2022 · 5 comments · Fixed by #329
Closed

Additional date format detection, DD/MM/YYYY #275

EricSoroos opened this issue Apr 28, 2022 · 5 comments · Fixed by #329
Labels
enhancement New feature or request. Once marked with this label, its in the backlog. timeseries time series related

Comments

@EricSoroos
Copy link
Contributor

In #174 and #177, there were some additional date formats and styles parsed.

We would like to disambiguate between the mm/dd/yyyy and dd/mm/yyyy formats. This is possible on a large enough dataset (at least, one that contains dates after the 13th of the month.

The current dateformats provided by dateparser don't include any of the ambiguous date formats, and the current ones are US centric.

As a start, we'd like to be able to detect this date format, but ultimately I think we'd end up wanting to rewrite this into the ISO standard date format.

@jqnatividad
Copy link
Collaborator

Hi @EricSoroos , inferring date formats is non-trivial, that's why I made --infer-dates an option for stats.

schema uses the stats engine with --infer-dates on by default, so as you pointed out, the jsonschema it generates is US-centric.

As for converting dates to the standard ISO format, that is the behavior of the apply datefmt operation. You may want to check it out.

qsv's date-parsing engine is powered by https://crates.io/crates/qsv-dateparser, which is aa qsv-optimized fork of https://crates.io/crates/dateparser. I removed some obscure date formats from there to make qsv more performant as I have --infer-dates on by default with Datapusher+ when scanning for data types.

FYI, I really want to further optimize qsv's date parser but holding it off until a critical dependency - chrono is updated. It has an unresolved security advisory which doesn't really affect qsv, but I don't want to spend more time on the date parser, until all the upstream fixes are released.

Regardless, if you want to submit a PR to https://crates.io/crates/qsv-dateparser to make its date fornat inferences smarter and support dd/mm/yyyy, please do, as I anticipate I'll abstract away the chrono dependency there if chrono needs to be replaced.

@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Apr 29, 2022
@EricSoroos
Copy link
Contributor Author

I'm aware that it's non-trivial, but at the very least recognizing the non-us DD/MM/YYYY format is important for my use in the EU. So long as the dataset is consistent and large enough, it's possible to disambiguate between that and the MM/DD/YYYY format.

The most non-trivial part of it might be maintaining compatibility with the standardized schema types, because that is a finer distinction than the schema seems to account for (it being concerned with date/datetime/time level metadata)

I was able to hack this into messytables/datapusher at least well enough that we'd generally detect the right date format.

@jqnatividad
Copy link
Collaborator

jqnatividad commented Apr 29, 2022

There is an identical open issue on the upstream crate

waltzofpearls/dateparser#21

I'll be sure to incorporate it into the qsv-dateparser fork once its implemented upstream.

And per @waltzofpearls, we'll probably need a date preference setting given the challenge of disambiguating between dd/mm/yyyy and mm/dd/yyyy, that qsv can in turn expose as an environment variable.

@jqnatividad
Copy link
Collaborator

jqnatividad commented May 19, 2022

@EricSoroos , would adding an option and a related env var to prefer DD/MM/YYYY be sufficient?

Insofar as Datapusher+ is concerned, we can have a DP+ env var for the default preferred date format, that can be overridden on a resource-by-resource basis "DMY format" resource extra.

I'm expediting implementing this as DMY is more prevalent than I thought
https://en.wikipedia.org/wiki/Date_format_by_country

@EricSoroos
Copy link
Contributor Author

I think that combination is ok, but it would absolutely have to be coordinated with postgres if you're using it's native dateparsing in copy.

If there was a way to do date detection where the entire column would need to be one order or the other, that would be ideal (and that's basically what I did to messytables). But in order to do that, you'd need to be able to assert that all of the rows of a column were the same/compatible format, and that's not how QSV does things.

The worst thing is an adaptive solution (which was native postgres for a while) that would try one, and if it failed try the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Once marked with this label, its in the backlog. timeseries time series related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants