Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Integrate browserforge fingerprints #829

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

Pijukatel
Copy link
Contributor

@Pijukatel Pijukatel commented Dec 18, 2024

Description

Integrate browserforge package and use it's fingerprint and header generation capabilities to enable fingerprint generation in PlaywrightCrawler

How can users use it?
Closely following JS implementation would lead to:

crawler = PlaywrightCrawler(
    headless=True,
    browser_pool_options={
        'use_fingerprints': True,
        'fingerprint_generator_options': {
            'browsers': 'firefox',
            'os': 'android',
            'screen': Screen(
                min_width=min_width, max_width=max_width, min_height=min_height, max_height=max_height
            ),
        },
    },
)

I suggest this instead:

from crawlee.fingerprint_suite import DefaultFingerprintGenerator, HeaderGeneratorOptions, ScreenOptions

fingerprint_generator = DefaultFingerprintGenerator(
    header_options=HeaderGeneratorOptions(browsers=['firefox'], operating_systems=['android']),
    screen=ScreenOptions(min_width=min_width, max_width=max_width, min_height=min_height, max_height=max_height),
)

crawler = PlaywrightCrawler(
    headless=True,
    fingerprint_generator=fingerprint_generator
)

Reasoning:

  • FingerprintGeneratorOptions are too complex to be represented by nested dicts - not very readable.
  • avoid having arguments use_fingerprints and fingerprint_generator_options. Instead have argument fingerprint_generator that is implicit for use_fingerprints, it contains all the fingerprint_generator_options and it also allows custom implementations of generator to be passed.
  • DefaultFingerprintGenerator is very convenient for us, as we can set it to browserforge adapter for now and change it later.

Currently we do not have our own injector, so we have to use browser forge injector. That means that temporarily we have to accept Fingerprint definition from browserforge as output that is expected from AbstractFingerprintGenerator.generate().
It is explicitly mentioned there as something that is subject to change.

Why doing it like this?
This change is already usable as it is and brings most value now. Common user can use some sort of fingerprint generation offered by browserforge.
Adding our own type Fingerprint and our own injector is some additional effort and has almost no value for external users and can be part of future refactoring. This refactoring would affect only very advanced users that created their own custom fingerprint generator (quite advanced for experimental feature that is not yet even available. So purely theoretical.)

Issues

TODO: Tests, JS page function for injecting fingerprint json.
Todo: Make it work with page.add_init_script
Added dev test
TODO:
Add proper tests
Sync header generation with fingerprint generation (use gen from browserforge)
Added more tests.
Use browser_pool_options from crawler to pass fingerprint related stuff to be similar to JS
@Pijukatel Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Dec 18, 2024
@barjin
Copy link
Contributor

barjin commented Dec 18, 2024

Good job! Why don't we use browserforge's injector module for this? Do we need more control over the fingerprint than browserforge can offer us?

@Pijukatel
Copy link
Contributor Author

Good job! Why don't we use browserforge's injector module for this? Do we need more control over the fingerprint than browserforge can offer us?

I was not aware of that. Wow, that guy did everything. I will try to use that injector,

@Pijukatel Pijukatel changed the title Integrate browserforge fingerprints feat: Integrate browserforge fingerprints Dec 19, 2024
@@ -152,8 +143,45 @@ async def _create_browser_context(self, proxy_info: ProxyInfo | None = None) ->
else None
)

return await self._browser.new_context(
if self._use_fingerprints:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This can be simpler if we decide to rely purely on browserforge and remove our own header generation. )

@github-actions github-actions bot added this to the 105th sprint - Tooling team milestone Dec 19, 2024
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Dec 19, 2024
@Pijukatel Pijukatel force-pushed the integrate-browserforge-fingerprints branch from 7bd985c to 25aa4e2 Compare January 14, 2025 10:13
(Hint about browserforge being just implementation detail and not core functionality.)
@vdusek
Copy link
Collaborator

vdusek commented Jan 17, 2025

  1. How can users use it?

Why not expose the entire class directly, instead of creating an additional one just for the options and exposing that?

fingerprint_generator = FingerprintGenerator(
    header_generator=HeaderGenerator(...),
    screen=Screen(...),
)

crawler = PlaywrightCrawler(
    headless=True,
    fingerprint_generator=fingerprint_generator,
)
  1. Keep our own header generation or rely on browserforge?

I would completely remove our implementation of HeaderGenerator and rely on the one from browserforge to avoid mixing the two.

@Pijukatel
Copy link
Contributor Author

  1. How can users use it?

Why not expose the entire class directly, instead of creating an additional one just for the options and exposing that?

At least for the HeaderGenaratorOptions that I think is necessary to avoid users mixing various header generators with fingerprint generators. HeaderGenerator is internal to FingerprintGenerator to ensure that generated fingerprint and headers are consistent.

FingerprintGeneratorOptions can be removed and the inputs defined in abstract class instead.

@Pijukatel
Copy link
Contributor Author

  1. Keep our own header generation or rely on browserforge?

I would completely remove our implementation of HeaderGenerator and rely on the one from browserforge to avoid mixing the two.

I agree, but that step I would rather take in separate follow up PR to not make this PR too large.

@Pijukatel Pijukatel marked this pull request as ready for review January 23, 2025 08:20
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know this part of the codebase that well, so I found just some minor design nitpicks

src/crawlee/fingerprint_suite/_fingerprint_generator.py Outdated Show resolved Hide resolved
src/crawlee/fingerprint_suite/_fingerprint_generator.py Outdated Show resolved Hide resolved
@Pijukatel Pijukatel requested a review from janbuchar January 23, 2025 13:20
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'd prefer to wait for @vdusek's approval as well

@vdusek
Copy link
Collaborator

vdusek commented Jan 24, 2025

I agree, but that step I would rather take in separate follow up PR to not make this PR too large.

Do I understand it correctly that it will not require any changes on the interface?

Also, could you open an issue for it?

Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be documented later by #481 , which already has decent PR #576

I would prefer to include the documentation along with this PR. It could provide a clearer understanding and demonstrate how it is being used. This is one of the reasons we keep the documentation in the same repository as the source code.

Otherwise it's great.

src/crawlee/fingerprint_suite/_fingerprint_generator.py Outdated Show resolved Hide resolved
src/crawlee/browsers/_browser_pool.py Outdated Show resolved Hide resolved
src/crawlee/browsers/_playwright_browser_controller.py Outdated Show resolved Hide resolved
src/crawlee/browsers/_playwright_browser_plugin.py Outdated Show resolved Hide resolved
from browserforge.fingerprints import Fingerprint


class FingerprintGenerator(ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing class docstring? It should explain what it is and what it is for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

from ._types import HeaderGeneratorOptions, ScreenOptions


class BrowserforgeFingerprintGenerator(FingerprintGenerator):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing class docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@Pijukatel
Copy link
Contributor Author

I agree, but that step I would rather take in separate follow up PR to not make this PR too large.

Do I understand it correctly that it will not require any changes on the interface?

Also, could you open an issue for it?

Yes, I think it should be without any interface changes.
#937

Update docstrings.
Add example code + doc page.
(To avoid some import race conditions downloads when running pytest with multiple processes)
@Pijukatel
Copy link
Contributor Author

Will be documented later by #481 , which already has decent PR #576

I would prefer to include the documentation along with this PR. It could provide a clearer understanding and demonstrate how it is being used. This is one of the reasons we keep the documentation in the same repository as the source code.

Otherwise it's great.

I added example code + page.

I guess I should also make Python variant of this doc in other repo: https://docs.apify.com/academy/anti-scraping/mitigation/generating-fingerprints

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add init script for Playwright browser context
4 participants