Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add URLMatcher.match_all(). #9

Merged
merged 2 commits into from
Apr 1, 2024
Merged

Add URLMatcher.match_all(). #9

merged 2 commits into from
Apr 1, 2024

Conversation

wRAR
Copy link
Member

@wRAR wRAR commented Mar 29, 2024

No description provided.

@codecov-commenter
Copy link

codecov-commenter commented Mar 29, 2024

Codecov Report

Merging #9 (aec50b4) into main (414671d) will increase coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head aec50b4 differs from pull request most recent head 7298c5a. Consider uploading reports for the commit 7298c5a to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main       #9      +/-   ##
==========================================
+ Coverage   98.99%   99.00%   +0.01%     
==========================================
  Files           6        6              
  Lines         299      303       +4     
==========================================
+ Hits          296      300       +4     
  Misses          3        3              
Files Coverage Δ
url_matcher/example.py 98.18% <100.00%> (ø)
url_matcher/matcher.py 99.09% <100.00%> (+0.03%) ⬆️

@kmike
Copy link
Contributor

kmike commented Mar 31, 2024

Could you please check pre-commit failure?

Copy link
Contributor

@kmike kmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
Any documentation changes?

for matcher in chain(self.matchers_by_domain.get(domain) or [], self.matchers_by_domain.get("") or []):
if matcher.match(url):
return matcher.identifier
return None
if matcher.identifier not in unique:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something but doesn't URL Matcher guarantee uniqueness in its identifier because of how the .add_or_update() method currently works? wherein if an existing identifier is provided, it would simply update the old URL Pattern with the new one.

Perhaps it will be more clear if we have a test for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only test that fails without this change is test_matcher_single_rule[Match product lists in books.toscrape.com], match_all() returns the identifier 3 times there. That's because there are 3 include patterns in that test case. Is this the expected behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I get it now. So if we have more than 1 URL Pattern when adding it to the URLMatcher, like this:

from url_matcher import Patterns, URLMatcher 

matcher = URLMatcher()
matcher.add_or_update(23, Patterns(['example.com/products', 'example.com/category']))
matcher.matchers_by_domain

It'd result in a repeated PatternsMatcher instance:

{
  'example.com': [
    PatternsMatcher(
      identifier=23,
      patterns=Patterns(
        include=('example.com/products', 'example.com/category'),
        exclude=(),
        priority=500
      ),
      include_matchers=[
        <url_matcher.patterns.PatternMatcher object at 0x103e57c10>,
        <url_matcher.patterns.PatternMatcher object at 0x103e57d50>],
      exclude_matchers=[]
    ),
    PatternsMatcher(
      identifier=23,
      patterns=Patterns(
        include=('example.com/products', 'example.com/category'),
        exclude=(),
        priority=500
      ),
      include_matchers=[
        <url_matcher.patterns.PatternMatcher object at 0x103e57c10>,
        <url_matcher.patterns.PatternMatcher object at 0x103e57d50>],
      exclude_matchers=[]
    )
  ]
}

Not sure why this is the case but it doesn't look efficient.

Copy link
Contributor

@BurnzZ BurnzZ Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #10 to track this.

@wRAR
Copy link
Member Author

wRAR commented Apr 1, 2024

Any documentation changes?

I wasn't sure where to document this, as there is no explicit documentation for match() either, just usage examples: https://url-matcher.readthedocs.io/en/stable/intro.html

for matcher in chain(self.matchers_by_domain.get(domain) or [], self.matchers_by_domain.get("") or []):
if matcher.match(url):
return matcher.identifier
return None
if matcher.identifier not in unique:
Copy link
Contributor

@BurnzZ BurnzZ Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #10 to track this.

@kmike
Copy link
Contributor

kmike commented Apr 1, 2024

I'm fine with proceeding without a docs update.

@Gallaecio Gallaecio merged commit d7970ac into main Apr 1, 2024
8 checks passed
@wRAR wRAR deleted the match_all branch April 2, 2024 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants