Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate WebSoc/related scraper enhancements #49

Closed
3 tasks done
ecxyzzy opened this issue Dec 10, 2024 · 0 comments · Fixed by #83
Closed
3 tasks done

Investigate WebSoc/related scraper enhancements #49

ecxyzzy opened this issue Dec 10, 2024 · 0 comments · Fixed by #83
Assignees

Comments

@ecxyzzy
Copy link
Member

ecxyzzy commented Dec 10, 2024

Summary

Currently, the WebSoc scraper goes department by department when scraping a term. The issue is that there are many departments, some of which no longer exist or have no offerings for the given term, so this process can be slower than anticipated.

The key is that WebSoc supports returning up to 900 sections per request, and departments often have fewer sections than that. Furthermore, these sections are not necessarily contiguous in section code, so we can use chunks larger than 900. This means that we can cover all of WebSoc with far fewer requests than the department method.

If we assume that section codes are invariant once published to WebSoc, we can also generate some optimal mapping of section codes in a term to a set of ranges, to even further minimize the number of requests made.

We should also consider moving the logic for refreshing materialized views to another cron-triggered worker since it is also blocking, and realistically the WebSoc data should not impact course/instructor data every scrape.

Action Items

@ecxyzzy ecxyzzy self-assigned this Dec 10, 2024
ecxyzzy added a commit that referenced this issue Jan 15, 2025
## Description

This PR implements two key enhancements to the WebSoc scraper, which
should result in fewer errors and enable faster data retrieval.

### Chunk-wise scraping

Instead of scraping each term department by department, we implement a
method of scraping based on chunks of section codes. There are much
fewer chunks than there are departments, so this yields a significant
speedup; performance increases of 4x were observed when testing locally.

Chunks are computed based on the contents of the `websoc_section` table.
Therefore, this method is only available if the term has been scraped at
least once. The scraper will fall back to the original method of
scraping department by department if it detects that it is scraping a
term for the first time.

Since WebSoc will allow us to fetch up to 900 sections, chunks are 891
sections "wide". This provides a 1% margin of error, in case sections
that do not exist in the database magically appear between computing the
chunks and the actual scraping batch being executed.

### Materialized view refresh deferral

Previously, the WebSoc scraper would refresh the _materialized views_
that supply data to the courses and instructors endpoints every time it
completed a scrape. This is a slow and blocking process, and more
importantly it does not need to be run every time WebSoc is scraped.

To remedy this, a new cron-triggered Worker `@apps/mview-refresher` has
been implemented, whose sole purpose is to refresh the materialized
views on a nightly basis. The frequency can be adjusted in its
`wrangler.toml` if a lower lag time is desired, but nightly is probably
sufficient.

## Related Issue

Closes #49.

## How Has This Been Tested?

Tested locally with the following procedure:

* Clean up your local database using the following statements.
**TRIPLE CHECK THAT YOU ARE RUNNING THESE STATEMENTS LOCALLY IF YOU HAVE
ACCESS TO PRODUCTION!!!**
```sql
BEGIN TRANSACTION;
UPDATE websoc_meta SET last_dept_scraped = NULL WHERE name = '2025 Winter';
DELETE FROM websoc_section_meeting WHERE section_id IN (SELECT id FROM websoc_section WHERE year = '2025' AND quarter = 'Winter');
DELETE FROM websoc_section_enrollment WHERE year = '2025' AND quarter = 'Winter';
DELETE FROM websoc_section WHERE year = '2025' AND quarter = 'Winter';
COMMIT;
```

* Paste the following into
`apps/data-pipeline/websoc-scraper/src/index.ts`.
```ts
import { exit } from "node:process";
import { doScrape } from "$lib";
import { database } from "@packages/db";

async function main() {
  const url = process.env.DB_URL;
  if (!url) throw new Error("DB_URL not found");
  const db = database(url);
  await doScrape(db);
  exit(0);
}

main().then();
```

* Check **again** that your `DB_URL` environment variable points to your
local development database.
* Run `pnpm start` once. Verify that it is scraping by department. Once
completed, run it again.
* Verify that the second run has switched to chunk-wise scraping and
that it is faster.

## Types of changes

<!--- What types of changes does your code introduce? Put an `x` in all
the boxes that apply: -->

- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to change)

## Checklist:

<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->
<!--- If you're unsure about any of these, don't hesitate to ask. We're
here to help! -->

- [ ] My code involves a change to the database schema.
- [ ] My code requires a change to the documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant