-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional date format detection, DD/MM/YYYY #275
Comments
Hi @EricSoroos , inferring date formats is non-trivial, that's why I made
As for converting dates to the standard ISO format, that is the behavior of the qsv's date-parsing engine is powered by https://crates.io/crates/qsv-dateparser, which is aa qsv-optimized fork of https://crates.io/crates/dateparser. I removed some obscure date formats from there to make qsv more performant as I have FYI, I really want to further optimize qsv's date parser but holding it off until a critical dependency - chrono is updated. It has an unresolved security advisory which doesn't really affect qsv, but I don't want to spend more time on the date parser, until all the upstream fixes are released. Regardless, if you want to submit a PR to https://crates.io/crates/qsv-dateparser to make its date fornat inferences smarter and support dd/mm/yyyy, please do, as I anticipate I'll abstract away the chrono dependency there if chrono needs to be replaced. |
I'm aware that it's non-trivial, but at the very least recognizing the non-us DD/MM/YYYY format is important for my use in the EU. So long as the dataset is consistent and large enough, it's possible to disambiguate between that and the MM/DD/YYYY format. The most non-trivial part of it might be maintaining compatibility with the standardized schema types, because that is a finer distinction than the schema seems to account for (it being concerned with date/datetime/time level metadata) I was able to hack this into messytables/datapusher at least well enough that we'd generally detect the right date format. |
There is an identical open issue on the upstream crate I'll be sure to incorporate it into the qsv-dateparser fork once its implemented upstream. And per @waltzofpearls, we'll probably need a date preference setting given the challenge of disambiguating between dd/mm/yyyy and mm/dd/yyyy, that qsv can in turn expose as an environment variable. |
@EricSoroos , would adding an option and a related env var to prefer DD/MM/YYYY be sufficient? Insofar as Datapusher+ is concerned, we can have a DP+ env var for the default preferred date format, that can be overridden on a resource-by-resource basis "DMY format" resource extra. I'm expediting implementing this as DMY is more prevalent than I thought |
I think that combination is ok, but it would absolutely have to be coordinated with postgres if you're using it's native dateparsing in copy. If there was a way to do date detection where the entire column would need to be one order or the other, that would be ideal (and that's basically what I did to messytables). But in order to do that, you'd need to be able to assert that all of the rows of a column were the same/compatible format, and that's not how QSV does things. The worst thing is an adaptive solution (which was native postgres for a while) that would try one, and if it failed try the other. |
In #174 and #177, there were some additional date formats and styles parsed.
We would like to disambiguate between the mm/dd/yyyy and dd/mm/yyyy formats. This is possible on a large enough dataset (at least, one that contains dates after the 13th of the month.
The current dateformats provided by dateparser don't include any of the ambiguous date formats, and the current ones are US centric.
As a start, we'd like to be able to detect this date format, but ultimately I think we'd end up wanting to rewrite this into the ISO standard date format.
The text was updated successfully, but these errors were encountered: