Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default date is not configurable #39

Open
haydenflinner opened this issue Jun 21, 2023 · 6 comments
Open

Default date is not configurable #39

haydenflinner opened this issue Jun 21, 2023 · 6 comments

Comments

@haydenflinner
Copy link

Would a patch be welcomed that's similar to the below, but for setting the default date to be used? Right now there are a lot of places that default to utc::now(), which doesn't work for my use-case. I need to set some context of what day I expect dateless timestamps to be.

826b8c4

@waltzofpearls
Copy link
Owner

Hey @haydenflinner, pull requests are always welcome. I had a look at datetime.rs, and found 2 Parse methods that didn't use default_time, one in hms_z and another in month_md_hms. If I made them use default_time, would that meet your requirement?

@haydenflinner
Copy link
Author

Unfortunately not, and I don't think this lib will be made to fit my use-case as easily as I thought it would during prototyping. I really need to know what was found, whether it was a time without a date, or a date without a time, or a real timestamp. Once I know that, falling back to a default date or time is easy, and I can use the context of what time or date was parsed to decide what my fallback value is. With the API as-is, I need to know my default ahead of time, and at return-time I can't tell if what I got back was my default or just a value which happened to match it. For example, suppose I'm parsing a list of timestamps, and they suddenly change from 23:59:00 to 00:01. Probably, it is a new day. Detecting that with default_date set requires doing some math like 'did my timestamps suddenly jump back >20 hours).

For now, I copy-pasted the whole thing into my wd repo along with some other tweaks that I needed. It is an awesome resource already, thanks for this!

Btw, I added support for fractional seconds in fn hms, had to change the regex and add %.f to the parse_from_str call. Copied that from another time parsing thing later on in that file which had support for fractionals.
haydenflinner/wd@4af850b#diff-e587ba841624ffc4321d69ca6ae1830b3bd221f539df0388668c9d8f59b53969R332

@waltzofpearls
Copy link
Owner

Thanks for sharing this information! I'm planning for a new version of this rust crate, and it might involve a complete rewrite to improve parsing speed and provide more flexible configuration options for the parser. I will take your use case into consideration.

When I first wrote this parser, it was part of a cli tool, and I made some opinionated decisions on the library's api design. As time went by, I've noticed a growing demand for increased flexibility. Addressing these requests would require some degree of redesign. And I've always wanted to change the way how the date time string is parsed, transitioning away from regex towards a character by character parsing approach.

@haydenflinner
Copy link
Author

That's cool news, thanks! On the regex note, it would be neat if Rust had compile-time regexes rather than always requiring the runtime, but I suspect the reason it doesn't is that regexing is rarely the bottleneck. Make sure you've profiled (cargo flamegraph) before throwing out the easy to use approach 😄

From a performance perspective, there are two main things that I noticed. One is that most users who care about speed are likely parsing large sequences of the same data, so if we were to guess the right format the first time we'd be far faster than always iterating through the whole list (if their format is towards the bottom). To that end just remembering which format was last parsed in this parser would be 95+% of the way there, if you really wanted to overkill you could keep some sort of data structure to sort them by observed frequency.

The other thing that might help is joining all of the regexes into one big one (using |) and using capturing groups to figure out which one matched. Once in this singular big regex, common prefixes might be more obvious to the dev, but they might also get compiled out by the regex engine, don't know for sure. Either way it more closely matches what we're trying to do, find which of a bunch of expressions matches. Make sure you set up some benches before trying this though because it is always possible that this is slower, and the regex support for many matches was more-so intended for searching within large strings. Users of the lib should probably be responsible for ensuring they give you close to the minimal string that would contain the timestamp, as they have the domain knowledge (for example I am taking the first 3 "words" from a potentially multi-kb line, by space-splitting). But if you are interested in supporting arbitrary search within large strings, the singular regex would most likely be faster, because you're visiting all of the data only once, and this is what regex engines are optimized for, using SIMD and all sorts of other tricks to do it faster and more expressively than a naive char-by-char approach.

@waltzofpearls
Copy link
Owner

Agreed on what you suggested. Thanks, Hayden 🙏

I will first try statically declare regex, and replace the use of lazy_static macro. Profiling is another good idea. And yes, regex is easier to understand and maintain.

For overall speed, I just ran make bench again, and the results on Mac with M1 Pro is reasonable, compare to a few years ago when I did that on Intel i9. Now the medians for different date formats range from 35 ns to 1140 ns. What I really hope is to even out those numbers, and nothing should be over 500 ns.

@haydenflinner
Copy link
Author

Awesome, gonna be hard to beat that! Have you given any thought to how the API redesign will look? Maybe returning a struct containing Option Option? This is not yet at the top of my priorities either, I can always just make my users type in the exact date/time they're looking for 😄 but it is a little weird in my log-viewer when someone types "2023/04/10" and the viewer seeks to "current time" on that date haha.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants