Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Enable additional status codes arguments to PlaywrightCrawler #959

Merged
merged 7 commits into from
Feb 19, 2025

Conversation

Pijukatel
Copy link
Contributor

@Pijukatel Pijukatel commented Feb 5, 2025

Description

Add additional_http_error_status_codes and ignore_http_error_status_codes to PlaywrightCrawler.
Since they exist now on all crawlers, move them to BasicCrawler level.
Do not use _http_client attributes for getting additional status codes related variables.

Breaking: Remove HttpCrawlerOptions -> No unique options compared to BasicCrawlerOptions anymore.

Issues

Since they exist now on all crawlers, move them to basic crawler level.
@github-actions github-actions bot added this to the 107th sprint - Tooling team milestone Feb 5, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Feb 5, 2025
@Pijukatel Pijukatel added the enhancement New feature or request. label Feb 6, 2025
@Pijukatel Pijukatel marked this pull request as ready for review February 6, 2025 12:15
@Pijukatel Pijukatel requested review from vdusek, Mantisus and janbuchar and removed request for vdusek and Mantisus February 6, 2025 12:15

if self._http_client.additional_blocked_status_codes != self._additional_http_error_status_codes:
raise ValueError(
'Used `additional_blocked_status_codes` argument does not match with with '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double with

Sorry, can't commit with the quick fix due to limited permissions.

Comment on lines +292 to +313
self._additional_http_error_status_codes = (
set(additional_http_error_status_codes) if additional_http_error_status_codes else set()
)
self._ignore_http_error_status_codes = (
set(ignore_http_error_status_codes) if ignore_http_error_status_codes else set()
)

self._http_client = http_client or HttpxHttpClient(
additional_http_error_status_codes=self._additional_http_error_status_codes,
ignore_http_error_status_codes=self._ignore_http_error_status_codes,
)

if self._http_client.additional_blocked_status_codes != self._additional_http_error_status_codes:
raise ValueError(
'Used `additional_blocked_status_codes` argument does not match with '
f'{self._http_client.additional_blocked_status_codes=}. They have to be the same.'
)
if self._http_client.ignore_http_error_status_codes != self._ignore_http_error_status_codes:
raise ValueError(
'Used `ignore_http_error_status_codes` argument does not match with '
f'{self._http_client.ignore_http_error_status_codes=}. They have to be the same.'
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we just keep them only in the http_client instance? (PW Crawler has HTTP client as well)

Copy link
Contributor Author

@Pijukatel Pijukatel Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering that option, but it felt like misuse to me, especially when it comes to PlaywrightCrawler. PlaywrightCrawler is not using HTTP client for page.navigate so it would be really strange if it would use some attribute of this unrelated component to decide whether response status code of page.navigate is ok or not.
(Mentioned : #953 (comment))

But I see it looks like unnecessary code duplication, so I am not 100% happy with this either.

Copy link
Collaborator

@vdusek vdusek Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I got it... However, having it duplicated seems like a worse option to me.

@janbuchar Your opinion please?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can think about taking this logic out of the http client. And put it in the BasicCrawler. Then it will work uniformly for any crawler and we will avoid code duplication

Copy link
Collaborator

@janbuchar janbuchar Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - in the long run, we want to have this logic factored out of the http client. I believe there was an issue to track that, but I only found #830.

It's probably fine to duplicate now and make an issue for refactoring this later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I don't see a problem if we keep the duplication of code at this point. It will be solved during refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Maybe it will be solved at the same time as #830, but if not, here is the issue: #998

Copy link
Collaborator

@Mantisus Mantisus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Pijukatel Pijukatel merged commit 87cf446 into master Feb 19, 2025
23 checks passed
@Pijukatel Pijukatel deleted the additional-status-codes branch February 19, 2025 14:26
Mantisus pushed a commit to Mantisus/crawlee-python that referenced this pull request Feb 19, 2025
…pify#959)

Add `additional_http_error_status_codes` and
`ignore_http_error_status_codes` to PlaywrightCrawler.
Since they exist now on all crawlers, move them to `BasicCrawler` level.
Do not use `_http_client` attributes for getting additional status codes
related variables.

**Breaking:** Remove `HttpCrawlerOptions` -> No unique options compared
to `BasicCrawlerOptions` anymore.

- Closes: apify#953
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ignore_http_error_status_codes and additional_http_error_status_codes arguments to PlaywrightCrawler
4 participants