feat: websoc scraper enhancements #83

ecxyzzy · 2025-01-14T20:11:59Z

Description

This PR implements two key enhancements to the WebSoc scraper, which should result in fewer errors and enable faster data retrieval.

Chunk-wise scraping

Instead of scraping each term department by department, we implement a method of scraping based on chunks of section codes. There are much fewer chunks than there are departments, so this yields a significant speedup; performance increases of 4x were observed when testing locally.

Chunks are computed based on the contents of the websoc_section table. Therefore, this method is only available if the term has been scraped at least once. The scraper will fall back to the original method of scraping department by department if it detects that it is scraping a term for the first time.

Since WebSoc will allow us to fetch up to 900 sections, chunks are 891 sections "wide". This provides a 1% margin of error, in case sections that do not exist in the database magically appear between computing the chunks and the actual scraping batch being executed.

Materialized view refresh deferral

Previously, the WebSoc scraper would refresh the materialized views that supply data to the courses and instructors endpoints every time it completed a scrape. This is a slow and blocking process, and more importantly it does not need to be run every time WebSoc is scraped.

To remedy this, a new cron-triggered Worker @apps/mview-refresher has been implemented, whose sole purpose is to refresh the materialized views on a nightly basis. The frequency can be adjusted in its wrangler.toml if a lower lag time is desired, but nightly is probably sufficient.

Related Issue

Closes #49.

How Has This Been Tested?

Tested locally with the following procedure:

Clean up your local database using the following statements.
TRIPLE CHECK THAT YOU ARE RUNNING THESE STATEMENTS LOCALLY IF YOU HAVE ACCESS TO PRODUCTION!!!

BEGIN TRANSACTION;
UPDATE websoc_meta SET last_dept_scraped = NULL WHERE name = '2025 Winter';
DELETE FROM websoc_section_meeting WHERE section_id IN (SELECT id FROM websoc_section WHERE year = '2025' AND quarter = 'Winter');
DELETE FROM websoc_section_enrollment WHERE year = '2025' AND quarter = 'Winter';
DELETE FROM websoc_section WHERE year = '2025' AND quarter = 'Winter';
COMMIT;

Paste the following into apps/data-pipeline/websoc-scraper/src/index.ts.

import { exit } from "node:process";
import { doScrape } from "$lib";
import { database } from "@packages/db";

async function main() {
  const url = process.env.DB_URL;
  if (!url) throw new Error("DB_URL not found");
  const db = database(url);
  await doScrape(db);
  exit(0);
}

main().then();

Check again that your DB_URL environment variable points to your local development database.
Run pnpm start once. Verify that it is scraping by department. Once completed, run it again.
Verify that the second run has switched to chunk-wise scraping and that it is faster.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code involves a change to the database schema.
My code requires a change to the documentation.

andrew-wang0 · 2025-01-15T00:19:34Z

Blazingly fast 🔥

This script in /mview-refresher/package.json

{
    "start": "dotenv -e ../../../.env -- tsx src/index.ts"
}

is missing the index.tx entrypoint,

We could possibly add that in or remove the script.

ecxyzzy · 2025-01-15T00:27:20Z

Good catch, done 🙌

andrew-wang0

Looks good to me 👌

ecxyzzy added 2 commits January 14, 2025 11:51

feat(websoc-scraper): implement chunk-wise scraping

a553f26

feat: implement mview-refresher

0d645bf

ecxyzzy requested review from laggycomputer and andrew-wang0 January 14, 2025 20:11

ecxyzzy temporarily deployed to staging-83 January 14, 2025 20:12 — with GitHub Actions Inactive

chore(mview-refresher): remove start script

27db922

ecxyzzy temporarily deployed to staging-83 January 15, 2025 00:27 — with GitHub Actions Inactive

andrew-wang0 approved these changes Jan 15, 2025

View reviewed changes

ecxyzzy merged commit 8f198a3 into main Jan 15, 2025
1 check passed

ecxyzzy deleted the enhance-websoc-scraper branch January 15, 2025 00:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: websoc scraper enhancements #83

feat: websoc scraper enhancements #83

ecxyzzy commented Jan 14, 2025 •

edited

Loading

andrew-wang0 commented Jan 15, 2025 •

edited

Loading

ecxyzzy commented Jan 15, 2025

andrew-wang0 left a comment

feat: websoc scraper enhancements #83

feat: websoc scraper enhancements #83

Conversation

ecxyzzy commented Jan 14, 2025 • edited Loading

Description

Chunk-wise scraping

Materialized view refresh deferral

Related Issue

How Has This Been Tested?

Types of changes

Checklist:

andrew-wang0 commented Jan 15, 2025 • edited Loading

ecxyzzy commented Jan 15, 2025

andrew-wang0 left a comment

Choose a reason for hiding this comment

ecxyzzy commented Jan 14, 2025 •

edited

Loading

andrew-wang0 commented Jan 15, 2025 •

edited

Loading