Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper uc #294

Open
wants to merge 142 commits into
base: scraper-uc
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
142 commits
Select commit Hold shift + click to select a range
768aa06
feat(crawler): Enhance stealth and flexibility, improve error handling
unclecode Oct 17, 2024
dbb587d
Update gitignore
unclecode Oct 17, 2024
dd17ed0
Rename some flags name, introducing magic flag.
unclecode Oct 18, 2024
aab6ea0
Update requirements and switch to 0.3.8
unclecode Oct 18, 2024
b8147b6
chore: Bump version to 0.3.71 and improve error handling
unclecode Oct 18, 2024
b309bc3
Fix the model nam ein quick start example
unclecode Oct 18, 2024
4e2852d
[v0.3.71] Enhance chunking strategies and improve overall performance
unclecode Oct 19, 2024
e7cd8a1
Update Changelog
unclecode Oct 19, 2024
6ec4cb3
Enhance Markdown generation and external content control
unclecode Oct 20, 2024
1dd36f9
Refactor content scrapping strategy and improve error handling
unclecode Oct 20, 2024
04d16e6
Fix Base64 image parsing in WebScrappingStrategy (issue 182)
unclecode Oct 20, 2024
a5f627b
feat: customize crawl base directory
IdrisHanafi Oct 21, 2024
60ba131
[v0.3.72] Enhance content extraction and proxy support
unclecode Oct 22, 2024
32f57c4
Merge pull request #194 from IdrisHanafi/feat/customize-crawl-base-di…
unclecode Oct 24, 2024
bcfe83f
feat: enhance crawler with overlay removal and improved screenshot ca…
unclecode Oct 24, 2024
38474bd
Update version
unclecode Oct 24, 2024
4239654
Update Documentation
unclecode Oct 27, 2024
ff9149b
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode Oct 27, 2024
ac9d83c
Update gitignore
unclecode Oct 27, 2024
d61615e
Merge branch '0.3.72'
unclecode Oct 27, 2024
c2a71a5
Update Docs folder, prepare branch for new version 0.3.73
unclecode Oct 27, 2024
d913e20
Update Readme
unclecode Oct 28, 2024
b2800fe
Add badges to README
unclecode Oct 28, 2024
d9e0b7a
Fix README badge
unclecode Oct 28, 2024
3529c2e
Update new tutorial documents and added to the docs folder.
unclecode Oct 29, 2024
e9f7d5e
Merge branch '0.3.73'
unclecode Oct 29, 2024
df9ee44
build: make requirements more flexible
mjvankampen Oct 30, 2024
605a827
fix dev requirements and lock playwright due to failing tests
mjvankampen Oct 30, 2024
9307c19
Update documents, upload new version of quickstart.
unclecode Oct 30, 2024
982d203
Merge branch '0.3.73'
unclecode Oct 30, 2024
47464ce
Update README
unclecode Oct 30, 2024
cb6f532
Update README
unclecode Oct 30, 2024
e97e8df
Update README: Fix typo in project name
unclecode Oct 30, 2024
19c3f3e
Refactor tutorial markdown files: Update numbering and formatting
unclecode Oct 30, 2024
0a09d78
chore(docs): fix documentation links + markdown lint
timoa Oct 31, 2024
6c7235d
Add mission.md file
unclecode Oct 31, 2024
d8eef02
Add link to mission statement in README
unclecode Oct 31, 2024
492ada0
Add mission diagram to MISSION.md
unclecode Oct 31, 2024
62a86db
Refactor mission section in README and add mission diagram
unclecode Oct 31, 2024
07f508b
Merge pull request #218 from timoa/main
unclecode Nov 3, 2024
de6b43f
Merge pull request #215 from mjvankampen/build/flexible-requirements
unclecode Nov 3, 2024
54d5a3a
Improved database management and error handling, updated README instr…
unclecode Nov 4, 2024
e28c49a
Refactor .gitignore.dev file: Add ignore patterns for various files a…
unclecode Nov 4, 2024
42f1c67
Merge branch '0.3.73' of https://github.com/unclecode/crawl4ai into 0…
unclecode Nov 4, 2024
33d0e9e
Update dev gitignore
unclecode Nov 4, 2024
7b0cca4
Update gitignore
unclecode Nov 4, 2024
fbdf870
Update CHANGELOG
unclecode Nov 4, 2024
be8f4fc
Merge branch '0.3.73' of https://github.com/unclecode/crawl4ai into 0…
unclecode Nov 4, 2024
e6c914d
Refactor version management and remove deprecated gitignore.dev file
unclecode Nov 4, 2024
c4c6227
Creating the API server component
unclecode Nov 4, 2024
0bba0e0
Preventing NoneType has no attribute get Errors
bizrockman Nov 4, 2024
a28046c
Rename episode_08_Media_Handling:_Images,_Videos,_and_Audio.md to epi…
bizrockman Nov 4, 2024
870296f
Rename episode_11_1_Extraction_Strategies:_JSON_CSS.md to episode_11_…
bizrockman Nov 4, 2024
3a3c88a
Rename episode_11_2_Extraction_Strategies:_LLM.md to episode_11_2_Ext…
bizrockman Nov 4, 2024
796dbaf
Rename episode_11_3_Extraction_Strategies:_Cosine.md to episode_11_3_…
bizrockman Nov 4, 2024
67a23c3
feat(core): Release v0.3.73 with Browser Takeover and Docker Support
unclecode Nov 5, 2024
3cf19a1
chore(version): bump version to 0.3.73
unclecode Nov 5, 2024
43a2b26
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode Nov 5, 2024
1c20b81
docs(README): update Docker usage instructions and add deployment opt…
unclecode Nov 5, 2024
2a54f3c
refactor(core): remove main_v0.py file and associated functionality
unclecode Nov 5, 2024
1e7db0d
docs(README): update release notes for version 0.3.73 with new featur…
unclecode Nov 5, 2024
b512636
feat(api): add CORS support and static file serving, update root redi…
unclecode Nov 5, 2024
c5aa1be
Merge pull request #229 from bizrockman/main
unclecode Nov 6, 2024
9f5eef1
Refactored the `CustomHTML2Text` class in `content_scrapping_strategy…
unclecode Nov 6, 2024
2879344
Update README.md
devatnull Nov 6, 2024
f757423
Update API server request object. text_docker file and Readme
unclecode Nov 7, 2024
16f9186
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode Nov 7, 2024
b120965
Fixed issues with the Manage Browser, including its inability to conn…
unclecode Nov 7, 2024
bcdd809
Remove some old files.
unclecode Nov 8, 2024
f9a297e
Add Docker example script for testing Crawl4AI functionality
unclecode Nov 8, 2024
a098483
Update Roadmap
unclecode Nov 9, 2024
b6d6631
Enhance Async Crawler with Playwright support
unclecode Nov 12, 2024
8c22396
Merge pull request #234 from devatnull/patch-1
unclecode Nov 12, 2024
00026b5
feat(config): Adding a configurable way of setting the cache director…
Nov 12, 2024
bf91adf
fix: Resolve unexpected BrowserContext closure during crawl in Docker
unclecode Nov 13, 2024
61b93eb
Update change log
unclecode Nov 13, 2024
38044d4
Merge pull request #255 from maheshpec/feature/configure-cache-directory
unclecode Nov 13, 2024
c38ac29
perf(crawler): major performance improvements & raw HTML support
unclecode Nov 13, 2024
17913f5
feat(crawler): support local files and raw HTML input in AsyncWebCrawler
unclecode Nov 13, 2024
3d00fee
- In this commit, the library is updated to process file downloads. U…
unclecode Nov 14, 2024
7f1ae5a
Update changelog
unclecode Nov 14, 2024
1f269f9
test(content_filter): add comprehensive tests for BM25ContentFilter f…
unclecode Nov 15, 2024
ae7ebc0
chore: update .gitignore and enhance changelog with major feature add…
unclecode Nov 15, 2024
60670b2
Merge pull request #7 from aravindkarnam/main
aravindkarnam Nov 15, 2024
d0014c6
New async database manager and migration support
unclecode Nov 16, 2024
5098442
refactor: migrate versioning to __version__.py and remove deprecated …
unclecode Nov 16, 2024
90df692
feat(crawl_sync): add synchronous crawl endpoint and corresponding test
unclecode Nov 16, 2024
e62c807
feat(deploy): add Railway deployment configuration and setup instruct…
unclecode Nov 16, 2024
f77f06a
feat(deploy): add deployment configuration and templates for crawl4ai
unclecode Nov 16, 2024
fca1319
feat(docker): add MkDocs installation and build step for documentation
unclecode Nov 16, 2024
6f2fe59
feat(deploy): update instance size to professional-xs and add memory …
unclecode Nov 16, 2024
6b569cc
feat(deploy): update branch to 0.3.74 and change instance size to bas…
unclecode Nov 16, 2024
67edc2d
feat(deploy): update instance size to professional-xs and add memory …
unclecode Nov 16, 2024
5d0b132
feat(deploy): change instance size to professional-xs and update memo…
unclecode Nov 16, 2024
79feab8
refactor(deploy): remove memory utilization alert configuration from …
unclecode Nov 16, 2024
1961adb
refactor(docker): remove shared memory size configuration to streamli…
unclecode Nov 16, 2024
6360d05
feat(api): add API token authentication and update Dockerfile descrip…
unclecode Nov 16, 2024
9139ef3
feat(docker): update Dockerfile for improved installation process and…
unclecode Nov 16, 2024
4b45b28
feat(docs): enhance deployment documentation with one-click setup, AP…
unclecode Nov 16, 2024
3a66aa8
feat(cache): introduce CacheMode and CacheContext for enhanced cachin…
unclecode Nov 17, 2024
3a524a3
fix(docs): remove unnecessary blank line in README for improved reada…
unclecode Nov 17, 2024
2a82455
feat(crawl): implement direct crawl functionality and introduce Cache…
unclecode Nov 17, 2024
f9fe6f8
feat(database): implement version management and migration checks dur…
unclecode Nov 17, 2024
a59c107
Update changelog for 0.3.74
unclecode Nov 17, 2024
df63a40
feat(docs): update examples and documentation to replace bypass_cache…
unclecode Nov 17, 2024
152ac35
feat(docs): update README for version 0.3.74 with new features and im…
unclecode Nov 17, 2024
852729f
feat(docker): add Docker Compose configurations for local and hub dep…
unclecode Nov 18, 2024
b6af94c
Merge remote-tracking branch 'origin/main' into 0.3.74
unclecode Nov 18, 2024
73658c7
chore: update .gitignore to include manage-collab.sh
unclecode Nov 19, 2024
593c7ad
test: trying to push to main
Nov 19, 2024
3aae30e
test1: trying to push to main
Nov 19, 2024
2f19d38
Update .gitignore to include .gitboss/ and todo_executor.md
unclecode Nov 19, 2024
788c67c
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode Nov 19, 2024
fbcff85
Remove test files
unclecode Nov 19, 2024
a6dad3f
test: trying to push to 0.3.74
Nov 19, 2024
f2cb7d5
Delete test3.txt
unclecode Nov 19, 2024
b654c49
Update .gitignore to exclude additional scripts and files
unclecode Nov 19, 2024
2bdec1f
chore: add manage-collab.sh to .gitignore
unclecode Nov 19, 2024
7047422
Merge branch '0.3.74' of https://github.com/unclecode/crawl4ai into 0…
unclecode Nov 19, 2024
d418a04
Fix #260 prevent pass duplicated kwargs to scrapping_strategy (#269)
darwing1210 Nov 20, 2024
3439f78
fix: crawler strategy exception handling and fixes (#271)
NanmiCoder Nov 20, 2024
dbb751c
In this commit, we introduce the new concept of MakrdownGenerationStr…
unclecode Nov 21, 2024
006bee4
feat: enhance image processing capabilities
unclecode Nov 22, 2024
571dda6
Update Redme
unclecode Nov 22, 2024
24ad2fe
feat: enhance Markdown generation to include fit_html attribute
unclecode Nov 22, 2024
e02935d
chore: update README to reflect new features and improvements in vers…
unclecode Nov 22, 2024
8dea3f4
chore: update README to include new features and improvements for ver…
unclecode Nov 22, 2024
a5decaa
Merge branch '0.3.74'
unclecode Nov 22, 2024
d7a112f
Merge branch 'main' of https://github.com/unclecode/crawl4ai
unclecode Nov 22, 2024
0d0cef3
feat: add enhanced markdown generation example with citations and fil…
unclecode Nov 22, 2024
c179703
Fixed a few bugs, import errors and changed to asyncio wait_for inste…
aravindkarnam Nov 23, 2024
f8e85b1
Fixed a bug in _process_links, handled condition for when url_scorer …
aravindkarnam Nov 23, 2024
3d52b55
Merge pull request #8 from aravindkarnam/main
aravindkarnam Nov 23, 2024
2226ef5
fix: Exempting the start_url from can_process_url
aravindkarnam Nov 23, 2024
b13fd71
chore: 1. Expose process_external_links as a param
aravindkarnam Nov 26, 2024
ee3001b
fix: moved depth as a param to can_process_url and applying filter ch…
aravindkarnam Nov 26, 2024
a98d51a
Remove the can_process_url check from _process_links since it's alrea…
aravindkarnam Nov 26, 2024
a888c91
Fix "Future attached to a different loop" error by ensuring tasks are…
aravindkarnam Nov 26, 2024
155c756
<Future pending> issue fix was incorrect. Reverting
aravindkarnam Nov 26, 2024
9530ded
fixed the final scraper_quickstart.py example
aravindkarnam Nov 26, 2024
ff731e4
fixed the final scraper_quickstart.py example
aravindkarnam Nov 26, 2024
2f5e059
updated definition of can_process_url to include dept as an argument,…
aravindkarnam Nov 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .do/app.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
alerts:
- rule: DEPLOYMENT_FAILED
- rule: DOMAIN_FAILED
name: crawl4ai
region: nyc
services:
- dockerfile_path: Dockerfile
github:
branch: 0.3.74
deploy_on_push: true
repo: unclecode/crawl4ai
health_check:
http_path: /health
http_port: 11235
instance_count: 1
instance_size_slug: professional-xs
name: web
routes:
- path: /
22 changes: 22 additions & 0 deletions .do/deploy.template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
spec:
name: crawl4ai
services:
- name: crawl4ai
git:
branch: 0.3.74
repo_clone_url: https://github.com/unclecode/crawl4ai.git
dockerfile_path: Dockerfile
http_port: 11235
instance_count: 1
instance_size_slug: professional-xs
health_check:
http_path: /health
envs:
- key: INSTALL_TYPE
value: "basic"
- key: PYTHON_VERSION
value: "3.10"
- key: ENABLE_GPU
value: "false"
routes:
- path: /
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,7 @@ test_env/
**/.DS_Store

todo.md
todo_executor.md
git_changes.py
git_changes.md
pypi_build.sh
Expand All @@ -208,4 +209,8 @@ git_issues.md
.tests/
.issues/
.docs/
.issues/
.issues/
.gitboss/
todo_executor.md
protect-all-except-feature.sh
manage-collab.sh
Loading