A Python web crawler script designed to explore web pages, find video files, and download them. The script supports customisation HTML tags and attributes for discovering videos and links, and it allows for parallel URL exploration.
- Download Videos: Finds and downloads video files from specified HTML tags and attributes.
- Explore Links: Follows links on the page to find more videos, with support for recursive crawling.
- Parallel Processing: Uses threading to explore and download in parallel for faster execution.
- Configurable: Allows customisation of HTML tags and attributes for video sources and links.
- Python 3.6 or higher
requests
librarybeautifulsoup4
library
-
Clone the Repository
git clone https://github.com/pandaind/web-crawler-downloader.git cd web-crawler-downloader
-
Install Required Libraries
You can install the necessary Python libraries using
pip
:pip install -r requirements.txt
./run.sh
To run the web crawler script, use the following command:
python crawler.py [START_URL] [FOLDER_PATH] [KEYWORDS] [--max_depth MAX_DEPTH] [--download_tag TAG:ATTRIBUTE] [--explore_tag TAG:ATTRIBUTE]
START_URL
: The starting URL for the web crawler.FOLDER_PATH
: The folder path where downloaded videos will be saved.KEYWORDS
: Space-separated keywords to filter links and videos.--max_depth
: (Optional) Maximum depth to crawl. Default is2
.--download_tag
: (Optional) Tag and attribute used to find video sources (format:tag:attribute
). Default issource:src
.--explore_tag
: (Optional) Tag and attribute used to find links to explore (format:tag:attribute
). Default isa:href
.
To start crawling from https://example.com
and download videos to the ./videos
folder, with a maximum depth of 3
and using default tags and attributes:
python crawler.py https://example.com ./videos "video" --max_depth 3
To specify custom tags and attributes for finding videos and links:
python crawler.py https://example.com ./videos "video" --max_depth 3 --download_tag "source:src" --explore_tag "a:href"
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/YourFeature
). - Commit your changes (
git commit -am 'Add new feature'
). - Push to the branch (
git push origin feature/YourFeature
). - Create a new Pull Request.
This project is licensed under the MIT License