- Python 3.8+
- pip
- venv (virtual environment module)
# Create a virtual environment
python3 -m venv webcrawler-env
# Activate the virtual environment
# On macOS and Linux:
source webcrawler-env/bin/activate
# On Windows:
webcrawler-env\Scripts\activate
# Ensure you're in the project directory and virtual environment
pip3 install -r requirements.txt
python3 -m app.main
Begins crawling configured e-commerce domains.
python3 -m app.update_result
Synchronizes and appends results from cache.
python3 -m app.clear_cache
Removes all cached crawling data.
When you're done working on the project:
deactivate
domain_url
(Required): Base URL of the e-commerce websitemax_page_visits_in_iteration
: Maximum number of pages to visit in one crawling iterationpurchase_button_keywords
: Array of keywords found on product purchase buttonsproduct_page_regex_patterns
: Optional regex patterns to identify product pages
{
"domains": [{
"purchase_button_keywords": [
"add to cart",
"add to bag",
"buy now",
"add to basket"
],
"search_placeholder_keywords": [
"search",
"Search",
"SEARCH"
],
"max_page_visits_in_iteration": 100,
"product_page_regex_patterns": [],
"next_page_button_keywords": ["next"],
"domain_url": "www.amazon.in"
}]
}
- Filename:
final.csv
- Columns:
domain_url
: Website domainproduct_url
: URL of discovered product pages
- By default, the crawler will visit 100 pages per domain
- This can be customized using the
max_page_visits_in_iteration
key in the configuration
- Create virtual environment
- Activate environment
- Install requirements
- Configure
config.json
- Run the crawler
- Deactivate environment when done
- Always activate the virtual environment before working on the project
- Ensure
config.json
is properly configured before running the crawler - Adjust
purchase_button_keywords
andproduct_page_regex_patterns
to match specific e-commerce website structures