Scraper uc #294

aravindkarnam · 2024-11-26T11:37:10Z

No description provided.

- Implement playwright_stealth for better bot detection avoidance - Add user simulation and navigator override options - Improve iframe processing and browser selection - Enhance error reporting and debugging capabilities - Optimize image processing and parallel crawling - Add new example for user simulation feature - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.

- Update version number to 0.3.71 - Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy - Enhance context creation with additional options - Improve error message formatting and visibility - Update quickstart documentation

- Add OverlappingWindowChunking and improve SlidingWindowChunking - Update CHUNK_TOKEN_THRESHOLD to 2048 tokens - Optimize AsyncPlaywrightCrawlerStrategy close method - Enhance flexibility in CosineStrategy with generic embedding model loading - Improve JSON-based extraction strategies - Add knowledge graph generation example

- Integrate customized html2text library for flexible Markdown output - Add options to exclude external links and images - Improve content scraping efficiency and error handling - Update AsyncPlaywrightCrawlerStrategy for faster closing - Enhance CosineStrategy with generic embedding model loading

- Add support for extracting Base64 encoded images - Improve image format detection to include Base64 images - Enhance compatibility with locally saved HTML files using Base64 image encoding

- Add ContentCleaningStrategy for improved content extraction - Implement advanced proxy configuration with authentication - Enhance image source detection and handling - Add fit_markdown and fit_html for refined content output - Improve external link and image handling flexibility

…l-base-directory Support for custom crawl base directory

…pabilities • Add smart overlay removal system for handling popups and modals • Improve screenshot functionality with configurable timing controls • Implement URL normalization and enhanced link processing • Add custom base directory support for cache storage • Refine external content filtering and social media domain handling This commit significantly improves the crawler's ability to handle modern websites by automatically removing intrusive overlays and providing better screenshot capabilities. URL handling is now more robust with proper normalization and duplicate detection. The cache system is more flexible with customizable base directory support. Breaking changes: None Issue numbers: None

According to unclecode#102 the requirements specified are minimum version. Currently they are defined as fixed versions in requirements.txt and setup.py leading to projects consuming this package are limited to using exactly these requirements instead of a more flexible range. This PR addresses this.

….3.74

unclecode#269) Thank you for the suggestions. It totally makes sense now. Change to pop operator.

…ategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction.

- Enhanced image processing with srcset support and validation checks for better image selection.

…ion 0.3.74

…sion 0.3.74

…e output

…ad of timeout to support python versions < 3.11

…is passed as None, renamed the scrapper folder to scraper.

Pulling in 0.3.74

2. Removed a few unused imports 3. Removed URL normalisation for external links separately as that won't be necessary

…ain only when depth is not zero. This way filter chain is skipped but other validations are in place even for start URL

…dy being checked in process_url

… created in the correct event loop - Explicitly retrieve and use the correct event loop when creating tasks to avoid cross-loop issues. - Ensures proper task scheduling in environments with multiple event loops.

… as it's needed to skip filters for start_url

unclecode and others added 30 commits October 17, 2024 21:37

Update gitignore

dbb587d

Rename some flags name, introducing magic flag.

dd17ed0

Update requirements and switch to 0.3.8

aab6ea0

Fix the model nam ein quick start example

b309bc3

Update Changelog

e7cd8a1

Refactor content scrapping strategy and improve error handling

1dd36f9

Fix Base64 image parsing in WebScrappingStrategy (issue 182)

04d16e6

- Add support for extracting Base64 encoded images - Improve image format detection to include Base64 images - Enhance compatibility with locally saved HTML files using Base64 image encoding

feat: customize crawl base directory

a5f627b

Merge pull request unclecode#194 from IdrisHanafi/feat/customize-craw…

32f57c4

…l-base-directory Support for custom crawl base directory

Update version

38474bd

Update Documentation

4239654

Merge branch 'main' of https://github.com/unclecode/crawl4ai

ff9149b

Update gitignore

ac9d83c

Merge branch '0.3.72'

d61615e

Update Docs folder, prepare branch for new version 0.3.73

c2a71a5

Update Readme

d913e20

Add badges to README

b2800fe

Fix README badge

d9e0b7a

Update new tutorial documents and added to the docs folder.

3529c2e

Merge branch '0.3.73'

e9f7d5e

fix dev requirements and lock playwright due to failing tests

605a827

Update documents, upload new version of quickstart.

9307c19

Merge branch '0.3.73'

982d203

unclecode and others added 30 commits November 19, 2024 19:02

Merge branch 'main' of https://github.com/unclecode/crawl4ai

788c67c

Remove test files

fbcff85

test: trying to push to 0.3.74

a6dad3f

Delete test3.txt

f2cb7d5

Update .gitignore to exclude additional scripts and files

b654c49

chore: add manage-collab.sh to .gitignore

2bdec1f

Merge branch '0.3.74' of https://github.com/unclecode/crawl4ai into 0…

7047422

….3.74

Fix unclecode#260 prevent pass duplicated kwargs to scrapping_strategy (

d418a04

unclecode#269) Thank you for the suggestions. It totally makes sense now. Change to pop operator.

fix: crawler strategy exception handling and fixes (unclecode#271)

3439f78

feat: enhance image processing capabilities

006bee4

- Enhanced image processing with srcset support and validation checks for better image selection.

Update Redme

571dda6

feat: enhance Markdown generation to include fit_html attribute

24ad2fe

chore: update README to reflect new features and improvements in vers…

e02935d

…ion 0.3.74

chore: update README to include new features and improvements for ver…

8dea3f4

…sion 0.3.74

Merge branch '0.3.74'

a5decaa

Merge branch 'main' of https://github.com/unclecode/crawl4ai

d7a112f

feat: add enhanced markdown generation example with citations and fil…

0d0cef3

…e output

Fixed a few bugs, import errors and changed to asyncio wait_for inste…

c179703

…ad of timeout to support python versions < 3.11

Fixed a bug in _process_links, handled condition for when url_scorer …

f8e85b1

…is passed as None, renamed the scrapper folder to scraper.

Merge pull request #8 from aravindkarnam/main

3d52b55

Pulling in 0.3.74

fix: Exempting the start_url from can_process_url

2226ef5

chore: 1. Expose process_external_links as a param

b13fd71

2. Removed a few unused imports 3. Removed URL normalisation for external links separately as that won't be necessary

fix: moved depth as a param to can_process_url and applying filter ch…

ee3001b

…ain only when depth is not zero. This way filter chain is skipped but other validations are in place even for start URL

Remove the can_process_url check from _process_links since it's alrea…

a98d51a

…dy being checked in process_url

<Future pending> issue fix was incorrect. Reverting

155c756

fixed the final scraper_quickstart.py example

9530ded

fixed the final scraper_quickstart.py example

ff731e4

updated definition of can_process_url to include dept as an argument,…

2f5e059

… as it's needed to skip filters for start_url

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper uc #294

Scraper uc #294

aravindkarnam commented Nov 26, 2024

Scraper uc #294

Are you sure you want to change the base?

Scraper uc #294

Conversation

aravindkarnam commented Nov 26, 2024