-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abstract regex and add fancy_regex
backend
#908
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi since this is your first contribution, and even if I don't approve of just using a different regex engine, I did review it so you can have in mind what this projects concerns are.
Cheers !
SplitPattern::String(s) => Regex::new(®ex::escape(s))?, | ||
SplitPattern::Regex(r) => Regex::new(r)?, | ||
SplitPattern::String(s) => Regex::new(regex::escape(s)), | ||
SplitPattern::Regex(r) => Regex::new(r.to_owned()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to refrain making everything owned when the compiler asks you too.
It's the easy way out, but the incorrect way out most likely.
Enjoy the ride of understanding lifetimes :) . In my experience, it should be roughly trivial to add lifetimes (when you know how that works), and when it's not, usually the design has some flaw actually (which the compiler warns me about actually)
@@ -3,6 +3,7 @@ | |||
|
|||
.vim | |||
.env | |||
.venv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. That's a user thing it does not belong here.
Please use a global .gitignore
for you (then you also only define it once on your machine).
@@ -59,11 +58,18 @@ cached-path = { version = "0.5", optional = true } | |||
aho-corasick = "0.7" | |||
paste = "1.0.6" | |||
proc_macros = { path = "./src/utils/proc_macros" } | |||
once_cell = "1.8" | |||
cfg-if = "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's needed actually.
You can probably write this as pure rust compilation #cfg
.
impl Regex { | ||
/// Create a new regex from the pattern string. | ||
/// | ||
/// Note that the regex compilation happens on first use, which is why this method does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think this library wants to compile the regex on load, not on-use, that changes this to a Result.
It also probably enables your to pass &str
since the Regex compilation will create the object without making the copy and doesn't need to keep a reference to the string, (so no lifetime issues here it seems).
@@ -59,11 +58,18 @@ cached-path = { version = "0.5", optional = true } | |||
aho-corasick = "0.7" | |||
paste = "1.0.6" | |||
proc_macros = { path = "./src/utils/proc_macros" } | |||
once_cell = "1.8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need that either.
Even keeping the idea of fancy-regex
, a simple enum is OK.
And even then, we don't want an enum. We want something that uses the correct regex engine ( onig
or onig-rs
) directly at compilation time.
@@ -0,0 +1,497 @@ | |||
// All code and comments below this first line are copied from here (with some edits): https://github.com/bminixhofer/nlprule/blob/main/nlprule/src/utils/regex.rs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, overall refrain from doing things this way.
Copy pasting code from another repo is likely to be the wrong solution.
They have a different context, different goals, you can't just reuse and hope it fits the current projects needs.
Taking an inspiration from other sources is GREAT, and you should definitely continuing looking at other repos for ideas on how to handle problems.
But you cannot blindly apply things, you need to understand the current project scopes, philosophy, concerns.
Rule of thumb:
- Less changes is better
- Imitate the style of previous code is better
This is the first part of one potential approach to getting Wasm and pure rust builds of tokenizers.
Relevant issue: #63
Here's the command to test the
fancy_regex
backend:It's a bit strange that there's not a nicer way to do that (IIUC), but it seems like this is a known problem.
You can also test both backends at the same time (return values checked against one another for consistency):
The abstraction layer code was created by the nlprule and syntect project devs, with only minor edits from myself: