Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper uc #294

Open
wants to merge 142 commits into
base: scraper-uc
Choose a base branch
from
Open

Conversation

aravindkarnam
Copy link
Contributor

No description provided.

unclecode and others added 30 commits October 17, 2024 21:37
- Implement playwright_stealth for better bot detection avoidance
- Add user simulation and navigator override options
- Improve iframe processing and browser selection
- Enhance error reporting and debugging capabilities
- Optimize image processing and parallel crawling
- Add new example for user simulation feature
- Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.
- Update version number to 0.3.71
- Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy
- Enhance context creation with additional options
- Improve error message formatting and visibility
- Update quickstart documentation
- Add OverlappingWindowChunking and improve SlidingWindowChunking
- Update CHUNK_TOKEN_THRESHOLD to 2048 tokens
- Optimize AsyncPlaywrightCrawlerStrategy close method
- Enhance flexibility in CosineStrategy with generic embedding model loading
- Improve JSON-based extraction strategies
- Add knowledge graph generation example
- Integrate customized html2text library for flexible Markdown output
- Add options to exclude external links and images
- Improve content scraping efficiency and error handling
- Update AsyncPlaywrightCrawlerStrategy for faster closing
- Enhance CosineStrategy with generic embedding model loading
- Add support for extracting Base64 encoded images
- Improve image format detection to include Base64 images
- Enhance compatibility with locally saved HTML files using Base64 image encoding
- Add ContentCleaningStrategy for improved content extraction
- Implement advanced proxy configuration with authentication
- Enhance image source detection and handling
- Add fit_markdown and fit_html for refined content output
- Improve external link and image handling flexibility
…l-base-directory

Support for custom crawl base directory
…pabilities

• Add smart overlay removal system for handling popups and modals
• Improve screenshot functionality with configurable timing controls
• Implement URL normalization and enhanced link processing
• Add custom base directory support for cache storage
• Refine external content filtering and social media domain handling

This commit significantly improves the crawler's ability to handle modern
websites by automatically removing intrusive overlays and providing better
screenshot capabilities. URL handling is now more robust with proper
normalization and duplicate detection. The cache system is more flexible
with customizable base directory support.

Breaking changes: None
Issue numbers: None
According to unclecode#102 the requirements specified are minimum version. Currently they are defined as fixed versions in requirements.txt and setup.py leading to projects consuming this package are limited to using exactly these requirements instead of a more flexible range. This PR addresses this.
unclecode and others added 30 commits November 19, 2024 19:02
unclecode#269)

Thank you for the suggestions. It totally makes sense now. Change to pop operator.
…ategy, which allows us to expand our future strategies to generate better markdown. Right now, we generate raw markdown as we were doing before. We have a new algorithm for fitting markdown based on BM25, and now we add the ability to refine markdown into a citation form. Our links will be extracted and replaced by a citation reference number, and then we will have reference sections at the very end; we add all the links with the descriptions. This format is more suitable for large language models. In case we don't need to pass links, we can reduce the size of the markdown significantly and also attach the list of references as a separate file to a large language model. This commit contains changes for this direction.
- Enhanced image processing with srcset support and validation checks for better image selection.
…ad of timeout to support python versions < 3.11
…is passed as None, renamed the scrapper folder to scraper.
2. Removed a few unused imports
3. Removed URL normalisation for external links separately as that won't be necessary
…ain only when depth is not zero. This way

filter chain is skipped but other validations are in place even for start URL
… created in the correct event loop

- Explicitly retrieve and use the correct event loop when creating tasks to avoid cross-loop issues.
- Ensures proper task scheduling in environments with multiple event loops.
… as it's needed to skip filters for start_url
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants