Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR implements two key enhancements to the WebSoc scraper, which should result in fewer errors and enable faster data retrieval.
Chunk-wise scraping
Instead of scraping each term department by department, we implement a method of scraping based on chunks of section codes. There are much fewer chunks than there are departments, so this yields a significant speedup; performance increases of 4x were observed when testing locally.
Chunks are computed based on the contents of the
websoc_section
table. Therefore, this method is only available if the term has been scraped at least once. The scraper will fall back to the original method of scraping department by department if it detects that it is scraping a term for the first time.Since WebSoc will allow us to fetch up to 900 sections, chunks are 891 sections "wide". This provides a 1% margin of error, in case sections that do not exist in the database magically appear between computing the chunks and the actual scraping batch being executed.
Materialized view refresh deferral
Previously, the WebSoc scraper would refresh the materialized views that supply data to the courses and instructors endpoints every time it completed a scrape. This is a slow and blocking process, and more importantly it does not need to be run every time WebSoc is scraped.
To remedy this, a new cron-triggered Worker
@apps/mview-refresher
has been implemented, whose sole purpose is to refresh the materialized views on a nightly basis. The frequency can be adjusted in itswrangler.toml
if a lower lag time is desired, but nightly is probably sufficient.Related Issue
Closes #49.
How Has This Been Tested?
Tested locally with the following procedure:
TRIPLE CHECK THAT YOU ARE RUNNING THESE STATEMENTS LOCALLY IF YOU HAVE ACCESS TO PRODUCTION!!!
apps/data-pipeline/websoc-scraper/src/index.ts
.DB_URL
environment variable points to your local development database.pnpm start
once. Verify that it is scraping by department. Once completed, run it again.Types of changes
Checklist: