Wikipedia Web Scraping Project

Overview

This project involves extracting data about the largest universities from Wikipedia, saving the data in CSV and JSON formats, and loading it into various databases for further analysis. The main features include:

Data extraction from Wikipedia using BeautifulSoup.
Data storage in CSV and JSON formats.
Data cleaning
Loading cleaned data into MySQL, PostgreSQL, and SQL Server databases.

Tech Stack Used

Programming Language: Python Libraries: requests, BeautifulSoup, pandas, csv, json IDE: VS Code Databases: MySQL, PostgreSQL, SQL Server

Setup Instructions

Prerequisites

Python: Make sure you have Python installed on your system. You can download it from python.org. Virtual Environment: It is recommended to use a virtual environment to manage dependencies. Database Setup: Ensure you have MySQL, PostgreSQL, and SQL Server installed and running on your system.

Installation Steps

Clone the Repository git clone https://github.com/OlawumiSalaam/wikipedia-scraping-project.git

cd wikipedia-data-harvest
Set Up Virtual Environment

python -m venv venv

On Windows use venv\Scripts\activate
Database Configuration

MySQL: Create a database named universities in MySQL.

PostgreSQL: Create a database named universities in PostgreSQL.

SQL Server: Create a database named universities in SQL Server.

Environment Variables

Create a .env file in the project root and add your database configuration.

MYSQL_USER=<your_mysql_username>

MYSQL_PASSWORD=<your_mysql_password>

MYSQL_DB=<your_mysql_database>

MYSQL_HOST=<your_mysql_host>

POSTGRES_USER=<your_postgres_username>

POSTGRES_PASSWORD=<your_postgres_password>

POSTGRES_DB=<your_postgres_database>

POSTGRES_HOST=<your_postgres_host>

SQLSERVER_USER=<your_sqlserver_username>

SQLSERVER_PASSWORD=<your_sqlserver_password>

SQLSERVER_DB=<your_sqlserver_database>

SQLSERVER_HOST=<your_sqlserver_host>

Usage

Run the script to extract and save data from Wikipedia into CSV and JSON formats:

python extract.py

Clean the saved data:

python data_cleaning.py

Load Data into Databases

Load the data into MySQL: python mysql-load.py

Load the data into PostgreSQL:

python postgres-load.py

Load the data into SQL Server:

python sqlserver-load.py

Optional: Run the notebooks:

Open Jupyter Notebook and run the cells in the provided notebooks in the notebooks directory for interactive interface

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Data		Data
Image		Image
notebooks		notebooks
sql		sql
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
init-db.sql		init-db.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Web Scraping Project

Overview

Tech Stack Used

Setup Instructions

Prerequisites

Installation Steps

Usage

Run the script to extract and save data from Wikipedia into CSV and JSON formats:

Clean the saved data:

Load Data into Databases

Optional: Run the notebooks:

About

Releases

Packages

Languages

License

OlawumiSalaam/wikipedia-scraping-project

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Web Scraping Project

Overview

Tech Stack Used

Setup Instructions

Prerequisites

Installation Steps

Usage

Run the script to extract and save data from Wikipedia into CSV and JSON formats:

Clean the saved data:

Load Data into Databases

Optional: Run the notebooks:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages