This project is a web scraper for Books to Scrape, an online bookstore for testing web scraping scripts. The scraper downloads HTML pages from the site and extracts product information, such as book titles, prices, and URLs saving the data into a csv file.
- Download HTML Pages: Retrieve pages from the bookstore site based on category and pagination.
- Extract Product Information: Parse downloaded pages to extract book details.
- Save to CSV: Export the extracted data into a CSV file for further analysis.
- Python 3.x
- Dependencies:
playwright
beautifulsoup4
pandas
Install dependencies using
pip install playwright beautifulsoup4 pandas
**bookstore_page_downloader.py**
This script downloads HTML page for a given book category and saves them locally**main.py**
This script processes the downloaded HTML file to extract product details and save them in a CSV file.
-
Download Pages: Modify the
query
,page_from
, andpage_to
parameters inbookstore_page_downloader.py
to specify the book category and page range to download.python bookstore_page_downloader.py
-
Extract Data: Modify the
query
andsource_dir
parameters inmain.py
to specify the category and location of the downloaded HTML files.python main.py
-
The resulting CSV file will be saved in the project directory.
- Ensure the
export
folder exists in the project root or specify a valid path. - The script works with the
nonfiction_13
category by default. Adjust thequery
parameter for other categories.