Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Integrate browserforge fingerprints #829

Merged
merged 33 commits into from
Feb 5, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
b0d52f2
Draft of integration of browserforge fingerprint generation.
Pijukatel Dec 16, 2024
be15847
Works with page.evaluate.
Pijukatel Dec 17, 2024
a9415ec
Use add_init_script
Pijukatel Dec 17, 2024
36727a1
WIP
Pijukatel Dec 18, 2024
42eff80
Fix format, type check and tests.
Pijukatel Dec 18, 2024
998cbb6
Fix rootcause for flakiness in fingerprint generation
Pijukatel Dec 18, 2024
e1025c8
Use browserforge.injector code for fingerprints
Pijukatel Dec 19, 2024
33fdd6e
Merge remote-tracking branch 'origin/master' into integrate-browserfo…
Pijukatel Dec 19, 2024
85ba877
Regenerate poetry lock after merge
Pijukatel Dec 19, 2024
6e35c1d
Remove unintentional change to headless test
Pijukatel Dec 19, 2024
3f96456
Merge branch 'master' into integrate-browserforge-fingerprints
Pijukatel Jan 3, 2025
ddfabea
chore: revert React version bump
barjin Jan 3, 2025
3d37bca
Merge remote-tracking branch 'origin/master' into integrate-browserfo…
Pijukatel Jan 10, 2025
1b8e6a3
Add ScreenFingerprint and NavigatorFingerprint
Pijukatel Jan 10, 2025
9828a36
Add Fingerprint and their options types
Pijukatel Jan 13, 2025
f733c07
Add adapter tests
Pijukatel Jan 13, 2025
97011d9
Integrate into pw_crawler
Pijukatel Jan 13, 2025
debe900
Further integration into our code.
Pijukatel Jan 14, 2025
3d8340c
Finalize draft.
Pijukatel Jan 14, 2025
3d9b170
Set fiongerprint generator as top level argument to pw crawler
Pijukatel Jan 14, 2025
25aa4e2
Revert unnecessary change to function doc string.
Pijukatel Jan 14, 2025
5e46b78
Make test adapter-generic.
Pijukatel Jan 14, 2025
69b6974
Add types to __init__ if fingerprint_suite
Pijukatel Jan 14, 2025
27479be
Remove FingerprintGeneratorOptions
Pijukatel Jan 20, 2025
751f67c
Merge remote-tracking branch 'origin/master' into integrate-browserfo…
Pijukatel Jan 23, 2025
1cbadb0
Review commnets
Pijukatel Jan 23, 2025
8e44acd
Handle inconsistent result from browserforge fingerprint generator
Pijukatel Jan 24, 2025
d8001e7
Apply suggestions from code review
Pijukatel Jan 27, 2025
07acbfa
Docs
Pijukatel Jan 27, 2025
866fe98
Make sure browserforge files are downloaded before tests.
Pijukatel Jan 27, 2025
b3eee4f
Merge remote-tracking branch 'origin/master' into integrate-browserfo…
Pijukatel Jan 31, 2025
5c15db1
Review comments
Pijukatel Feb 4, 2025
fa9b0f9
Merge remote-tracking branch 'origin/master' into integrate-browserfo…
Pijukatel Feb 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ install-dev:
poetry install --all-extras
poetry run pre-commit install
poetry run playwright install
poetry run python -m browserforge update

build:
poetry build --no-interaction -vv
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.fingerprint_suite import DefaultFingerprintGenerator, HeaderGeneratorOptions, ScreenOptions


async def main() -> None:
# Use default fingerprint generator with desired fingerprint options.
# Generator will try to generate real looking browser fingerprint based on the options.
# Unspecified fingerprint options will be automatically selected by the generator.
fingerprint_generator = DefaultFingerprintGenerator(
header_options=HeaderGeneratorOptions(browsers=['chromium']),
screen_options=ScreenOptions(min_width=400),
)

crawler = PlaywrightCrawler(
# Limit the crawl to max requests. Remove or increase it for crawling all links.
max_requests_per_crawl=10,
# Headless mode, set to False to see the browser in action.
headless=False,
# Browser types supported by Playwright.
browser_type='chromium',
# Fingerprint generator to be used. By default no fingerprint generation is done.
fingerprint_generator=fingerprint_generator,
)

# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')

# Find a link to the next page and enqueue it if it exists.
await context.enqueue_links(selector='.morelink')

# Run the crawler with the initial list of URLs.
await crawler.run(['https://news.ycombinator.com/'])


if __name__ == '__main__':
asyncio.run(main())
17 changes: 17 additions & 0 deletions docs/examples/playwright_crawler_with_fingerprint_generator.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
id: playwright-crawler-with-fingeprint-generator
title: Playwright crawler with fingerprint generator
---

import ApiLink from '@site/src/components/ApiLink';
import CodeBlock from '@theme/CodeBlock';

import PlaywrightCrawlerExample from '!!raw-loader!./code/playwright_crawler_with_fingerprint_generator.py';

This example demonstrates how to use <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> together with <ApiLink to="class/FingerprintGenerator">`FingerprintGenerator`</ApiLink> that will populate several browser attributes to mimic real browser fingerprint. To read more about fingerprints please see: https://docs.apify.com/academy/anti-scraping/techniques/fingerprinting.

You can implement your own fingerprint generator or use <ApiLink to="class/BrowserforgeFingerprintGenerator">`DefaultFingerprintGenerator`</ApiLink>. To use the generator initialize it with the desired fingerprint options. The generator will try to create fingerprint based on those options. Unspecified options will be automatically selected by the generator from the set of reasonable values. If some option is important for you, do not rely on the default and explicitly define it.

<CodeBlock className="language-python">
{PlaywrightCrawlerExample}
</CodeBlock>
109 changes: 64 additions & 45 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 3 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ keywords = [
python = "^3.9"
apify = { version = ">=2.0.0", optional = true }
beautifulsoup4 = { version = ">=4.12.0", optional = true }
browserforge = { version = ">=1.2.3", optional = true }
colorama = ">=0.4.0"
cookiecutter = ">=2.6.0"
curl-cffi = { version = ">=0.7.2", optional = true }
Expand Down Expand Up @@ -100,10 +101,10 @@ types-python-dateutil = "~2.9.0.20240316"
# Support for re-using groups in other groups https://peps.python.org/pep-0735/ in poetry:
# https://github.com/python-poetry/poetry/issues/9751
adaptive-playwright = ["jaro-winkler", "playwright", "scikit-learn"]
all = ["beautifulsoup4", "curl-cffi", "html5lib", "jaro-winkler", "lxml", "parsel", "playwright", "scikit-learn"]
all = ["beautifulsoup4", "browserforge", "curl-cffi", "html5lib", "jaro-winkler", "lxml", "parsel", "playwright", "scikit-learn"]
beautifulsoup = ["beautifulsoup4", "lxml", "html5lib"]
curl-impersonate = ["curl-cffi"]
playwright = ["playwright"]
playwright = ["browserforge", "playwright"]
parsel = ["parsel"]

[tool.poetry.scripts]
Expand Down
9 changes: 8 additions & 1 deletion src/crawlee/browsers/_browser_pool.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from types import TracebackType

from crawlee.browsers._base_browser_plugin import BaseBrowserPlugin
from crawlee.fingerprint_suite import FingerprintGenerator
from crawlee.proxy_configuration import ProxyInfo

logger = getLogger(__name__)
Expand Down Expand Up @@ -103,6 +104,7 @@ def with_default_plugin(
browser_launch_options: Mapping[str, Any] | None = None,
browser_new_context_options: Mapping[str, Any] | None = None,
headless: bool | None = None,
fingerprint_generator: FingerprintGenerator | None = None,
use_incognito_pages: bool | None = False,
**kwargs: Any,
) -> BrowserPool:
Expand All @@ -117,6 +119,8 @@ def with_default_plugin(
are provided directly to Playwright's `browser.new_context` method. For more details, refer to the
Playwright documentation: https://playwright.dev/python/docs/api/class-browser#browser-new-context.
headless: Whether to run the browser in headless mode.
fingerprint_generator: An optional instance of implementation of `FingerprintGenerator` that is used
to generate browser fingerprints together with consistent headers.
use_incognito_pages: By default pages share the same browser context. If set to True each page uses its
own context that is destroyed once the page is closed or crashes.
kwargs: Additional arguments for default constructor.
Expand All @@ -134,7 +138,10 @@ def with_default_plugin(
if browser_type:
plugin_options['browser_type'] = browser_type

plugin = PlaywrightBrowserPlugin(**plugin_options)
plugin = PlaywrightBrowserPlugin(
**plugin_options,
fingerprint_generator=fingerprint_generator,
)
return cls(plugins=[plugin], **kwargs)

@property
Expand Down
Loading
Loading