Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add possibility to use None as no proxy in tiered proxies #760

Merged
merged 6 commits into from
Dec 10, 2024

Conversation

Pijukatel
Copy link
Contributor

@Pijukatel Pijukatel commented Nov 28, 2024

Using None in input lists for tiered proxy configuration will be interpreted as no proxy. This allows to have for example lowest tier without proxy.

Add tests for it.
Update docs.

Add tests for it.
Update docs.

Unrelated:
Also delete ignore that was no longer needed.
@Pijukatel Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Nov 28, 2024
@github-actions github-actions bot added this to the 103rd sprint - Tooling team milestone Nov 28, 2024
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Nov 28, 2024
@@ -85,7 +85,7 @@ def __init__(
the proxy selection mechanism.
"""
self._next_custom_url_index = 0
self._used_proxy_urls = dict[str, URL]()
self._used_proxy_urls = dict[str, Union[URL, None]]()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mypy was very aggressive when I wanted to use | on this specific line. I got bullied into using Union

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, AFAIK in generics the pipe is still not possible

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!



async def test_rotates_proxies_uniformly_with_no_request_no_proxy() -> None:
"""Not sure how real is this use case, but it is supported."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I keep this test? I am not even sure if this usecase makes much sense.

Remove unrelated pyproject change as it was done in different PR.
@Pijukatel Pijukatel marked this pull request as ready for review November 29, 2024 10:42
@Pijukatel Pijukatel assigned vdusek and unassigned vdusek Nov 29, 2024
@Pijukatel Pijukatel requested review from vdusek and barjin November 29, 2024 10:43
@barjin
Copy link
Contributor

barjin commented Nov 29, 2024

Looking good to me 👏🏽 (I'm glad to see this really was as simple as in the JS version). Have you tested this locally with a crawler (i.e. won't other parts of Crawlee mind the None instead of a proxy URL?) Do we already support this case with new_url_function returning None?

@Pijukatel
Copy link
Contributor Author

Pijukatel commented Nov 29, 2024

Looking good to me 👏🏽 (I'm glad to see this really was as simple as in the JS version). Have you tested this locally with a crawler (i.e. won't other parts of Crawlee mind the None instead of a proxy URL?) Do we already support this case with new_url_function returning None?

I tested running

...
crawler = BeautifulSoupCrawler(
            proxy_configuration=ProxyConfiguration(tiered_proxy_urls=[[None, None], [None]]),
        )
await crawler.run(["https://crawlee.dev"])
...

I am not sure what you mean by "Do we already support this case with new_url_function returning None?". Is there a usecase example?

@Pijukatel
Copy link
Contributor Author

Looking good to me 👏🏽 (I'm glad to see this really was as simple as in the JS version). Have you tested this locally with a crawler (i.e. won't other parts of Crawlee mind the None instead of a proxy URL?) Do we already support this case with new_url_function returning None?

I tested running

...
crawler = BeautifulSoupCrawler(
            proxy_configuration=ProxyConfiguration(tiered_proxy_urls=[[None, None], [None]]),
        )
await crawler.run(["https://crawlee.dev"])
...

I am not sure what you mean by "Do we already support this case with new_url_function returning None?". Is there a usecase example?

Ok I think I get it. I looked only at tiered proxies. Currently this will still not work:

crawler = BeautifulSoupCrawler( proxy_configuration=ProxyConfiguration(proxy_urls=[None, None]),)
await crawler.run(["https://crawlee.dev"])

I will make it work consistently for all possible ways of setting the proxies to None

@Pijukatel Pijukatel marked this pull request as draft November 29, 2024 14:26
@barjin
Copy link
Contributor

barjin commented Nov 29, 2024

Right, sorry, I might have been confusing - in JS version, we support the newUrlFunction (alternative of new_url) returning null to turn the proxy off. I'm actually not sure if you can pass null in proxyUrls (although it would make sense).

obrazek

I will make it work consistently for all possible ways of setting the proxies to None

Sounds good! (and hopefully not too much work 🤞🏽 )

@Pijukatel Pijukatel marked this pull request as ready for review December 3, 2024 13:56
@vdusek vdusek changed the title feat: Add possibility to use None as no proxy in tiered proxies. feat: Add possibility to use None as no proxy in tiered proxies Dec 5, 2024
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one note about the test documentation; otherwise, it looks good. Thanks.

tests/unit/proxy_configuration/test_tiers.py Show resolved Hide resolved
tests/unit/proxy_configuration/test_tiers.py Outdated Show resolved Hide resolved
@Pijukatel Pijukatel merged commit 0fbd017 into master Dec 10, 2024
23 checks passed
@Pijukatel Pijukatel deleted the none-tiered-proxy branch December 10, 2024 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tieredProxyUrls accept null for switching the proxy off
3 participants