-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add URLMatcher.match_all(). #9
Conversation
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #9 +/- ##
==========================================
+ Coverage 98.99% 99.00% +0.01%
==========================================
Files 6 6
Lines 299 303 +4
==========================================
+ Hits 296 300 +4
Misses 3 3
|
Could you please check pre-commit failure? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Any documentation changes?
for matcher in chain(self.matchers_by_domain.get(domain) or [], self.matchers_by_domain.get("") or []): | ||
if matcher.match(url): | ||
return matcher.identifier | ||
return None | ||
if matcher.identifier not in unique: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something but doesn't URL Matcher guarantee uniqueness in its identifier because of how the .add_or_update()
method currently works? wherein if an existing identifier is provided, it would simply update the old URL Pattern with the new one.
Perhaps it will be more clear if we have a test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only test that fails without this change is test_matcher_single_rule[Match product lists in books.toscrape.com]
, match_all()
returns the identifier 3 times there. That's because there are 3 include patterns in that test case. Is this the expected behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, I get it now. So if we have more than 1 URL Pattern when adding it to the URLMatcher, like this:
from url_matcher import Patterns, URLMatcher
matcher = URLMatcher()
matcher.add_or_update(23, Patterns(['example.com/products', 'example.com/category']))
matcher.matchers_by_domain
It'd result in a repeated PatternsMatcher
instance:
{
'example.com': [
PatternsMatcher(
identifier=23,
patterns=Patterns(
include=('example.com/products', 'example.com/category'),
exclude=(),
priority=500
),
include_matchers=[
<url_matcher.patterns.PatternMatcher object at 0x103e57c10>,
<url_matcher.patterns.PatternMatcher object at 0x103e57d50>],
exclude_matchers=[]
),
PatternsMatcher(
identifier=23,
patterns=Patterns(
include=('example.com/products', 'example.com/category'),
exclude=(),
priority=500
),
include_matchers=[
<url_matcher.patterns.PatternMatcher object at 0x103e57c10>,
<url_matcher.patterns.PatternMatcher object at 0x103e57d50>],
exclude_matchers=[]
)
]
}
Not sure why this is the case but it doesn't look efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #10 to track this.
I wasn't sure where to document this, as there is no explicit documentation for |
for matcher in chain(self.matchers_by_domain.get(domain) or [], self.matchers_by_domain.get("") or []): | ||
if matcher.match(url): | ||
return matcher.identifier | ||
return None | ||
if matcher.identifier not in unique: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #10 to track this.
I'm fine with proceeding without a docs update. |
No description provided.