A periodic web crawler to feed course data into GT Scheduler.
Sample: 202302.json
To report a bug or request a new feature, please create a new Issue in the GT Scheduler website repository.
This work is a derivative of the original and spectacular GT Schedule Crawler project created by Jinseo Park (as a part of the overall GT Scheduler project). The original work and all modifications are licensed under the AGPL v3.0 license.
Copyright (c) 2020 Jinseo Park ([email protected])
Copyright (c) 2020 the Bits of Good "GT Scheduler" team
The crawler is a command-line application written in TypeScript (a typed superset of JavaScript) that runs using Node.js to crawl schedule data from Oscar (Georgia Tech's registration management system).
It operates as a series of steps that are processed after one another (see src/index.ts
) for each current "term" (combination of year and semester, i.e. Fall 2021).
In order to process the prerequisites data for each course (which comes in the form of a string like "Undergraduate Semester level CS 2340 Minimum Grade of C and Undergraduate Semester level LMC 3432 Minimum Grade of C" that can become much more complex), the crawler also utilizes an ANTLR grammar and generated parser in order to convert the prerequisites data retrieved from Oscar into a normalized tree structure. The grammar itself and the generated parser/lexer code can be found in the src/steps/prereqs/grammar
folder.
The crawler is run every 30 minutes using a GitHub Action workflow, which then publishes the resultant JSON to the gh-pages
where it can be downloaded by the frontend app: https://gt-scheduler.github.io/crawler/202008.json.
- Node.js (any recent version will probably work)
- Installation of the
yarn
package manager version 1 (support for version 2 is untested)
After cloning the repository to your local computer, run the following command in the repo folder:
yarn install
This may take a couple minutes and will create a new folder called node_modules
with all of the dependencies installed within. This only needs to be run once.
Then, to run the crawler, run:
yarn start
After the crawler runs, a series of JSON files should have been created in a new data
directory in the project root.
By default, the crawler outputs standard log lines to the terminal in development. However, it also supports outputting structured JSON log events that can be more easily parsed and analyzed when debugging. This is turned on by default when the crawler is running in a GitHub Action (where the LOG_FORMAT
environment variable is set to json
), but it can also be enabled for development.
The utility script yarn start-logged
can be used to run the crawler and output JSON log lines to a logfile in the current working directory:
yarn start-logged
To analyze the JSON log lines data, I recommend using jq
since it is a powerful tool for parsing/analyzing JSON in the shell. The following command imports all lines in the latest log file and loads them all as one large array for further processing (note: this command will probably only work on Unix-like systems (Linux and probably macOS), so your mileage may vary. If you're running into issues, try running it on a Linux computer and make sure you have jq
installed):
cat $(find . -type f -name "*.log" | sort -n | tail -1) | jq -cs '.'
For some useful queries on the log data, see 📚 Useful queries on crawler logs.
First, ensure Python 3.9 or newer is installed. Then, install the necessary Python modules with the included requirements.txt
file:
pip install -r requirements.txt
Run the reviser to augment the data previously scraped with the new finals data
python ./src/Revise.py
The JSON files in the data
folder will now contain updated information regarding the finals date and time.
More information can be found here
The Registrar publishes a PDF with the Finals schedule at the start of each semester. The page with the PDF for the Fall 2022 semester can be found here
The matrix.json
file contains a mapping from term to the pdf file.
The key is one of the terms identified by the scraper here.
The value is the direct address for the PDF file such as this
This mapping needs to be updated each semester when a new schedule is posted
More information can be found on the wiki
The project uses pre-commit hooks using Husky and lint-staged
to run linting (via ESLint) and formatting (via Prettier). These can be run manually from the command line to format/lint the code on-demand, using the following commands:
yarn run lint
- runs ESLint and reports all linting errors without fixing themyarn run lint:fix
- runs ESLint and reports all linting errors, attempting to fix any auto-fixable onesyarn run format
- runs Prettier and automatically formats the entire codebaseyarn run format:check
- runs Prettier and reports formatting errors without fixing them
The GT Scheduler project welcomes (and encourages) contributions from the community. Regular development is performed by the project owners (Jason Park and Bits of Good), but we still encourage others to work on adding new features or fixing existing bugs and make the registration process better for the Georgia Tech community.
More information on how to contribute can be found in the contributing guide.