PrivacyCourseExtraction

This repository contains scripts to extract privacy-related content from university course listings. The process involves web scraping to gather HTML pages of course listings, downloading linked pages, and analyzing the content for privacy-related keywords.

Repository Structure

webScraper.py: A script to perform Google searches for university course listings and save the HTML pages.
downloadHop1Links.py: A script to download all linked pages from the main course listing pages.
findPrivacyContent.py: A script to analyze the downloaded HTML pages for privacy-related content and save the results to CSV files.

Prerequisites

Python 3.6+
Selenium
BeautifulSoup
Pandas
Requests
WebDriver Manager for Chrome
ChromeDriver

Setup

Install required Python packages: pip install selenium beautifulsoup4 pandas requests webdriver-manager chromedriver-autoinstaller
Ensure ChromeDriver is installed and compatible with your version of Chrome. The scripts use chromedriver-autoinstaller to handle this automatically.
Directory Structure: Ensure the courseListings/ directory exists in the root of your repository. This is where HTML files will be saved and read from.

Usage

Step 1: Scrape University Course Listings Run webScraper.py to perform Google searches for the list of universities and save the HTML pages of the course listings

This script:

Reads a list of university names.
Searches for the undergraduate computer science courses page for each university on Google.
Saves the HTML of the first search result to the courseListings/ directory.
Updates the universityLinkMapping.json with the URLs of the course listings.

Step 2: Download Linked Pages Run downloadHop1Links.py to download all linked pages from the main course listing pages saved in Step 1.

This script:

Reads the main course listing HTML files.
Extracts all links from each main page.
Downloads the content of each link and saves it to the respective university directory under courseListings/.

Step 3: Find Privacy-Related Content Run findPrivacyContent.py to analyze the downloaded HTML pages for privacy-related content and save the results to CSV files.

This script:

Reads the main and linked HTML files.
Searches for privacy-related keywords in the content.
Saves the extracted content to privacyContent.csv, privacyRelatedTitleContent.csv, and universitiesWithNoPrivacyRelatedContent.csv.

Note:

We will first extract university cours catalog pages using webScraper.py
This should be followed by running the downloadHop1Links.py in order to extract all the hop 1 links from the course catalog pages.
findPrivacyContent.py - will extract the course title/descriptions from the web pages downloaded using downloadHop1Links.py. Important - It is necessary to do a manual review and filtering of the data resulting from findPrivacyContent.py in order to get best results.

Contributors

Kristen Vaccaro
Aishwarya Manjunath

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
downloadHop1Links.py		downloadHop1Links.py
findPrivacyContent.py		findPrivacyContent.py
webScraper.py		webScraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrivacyCourseExtraction

Repository Structure

Prerequisites

Setup

Usage

Note:

Contributors

About

Releases

Packages

Languages

License

ucsd-smollab/PrivacySyllabus-PrivacyCourseExtraction

Folders and files

Latest commit

History

Repository files navigation

PrivacyCourseExtraction

Repository Structure

Prerequisites

Setup

Usage

Note:

Contributors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages