(docs) datapuller docs init (#766)

* modify getting started docs * remove wip warning from infra section * add datapuller documentation * datapuller local deevelopment docs * polish
asuc-octo · Jan 25, 2025 · 5708bce · 5708bce
1 parent 390f9bf
commit 5708bce
Show file tree

Hide file tree

Showing 6 changed files with 52 additions and 11 deletions.
diff --git a/docs/src/README.md b/docs/src/README.md
@@ -21,19 +21,19 @@ Using Docker allows us to build the docs without downloading dependencies on our
 git pull
 
 # Build the container (only needed once every time docs/Dockerfile changes!)
-docker build --target docs-dev --tag="docs:dev" ./docs
+docker build --target=docs-dev --tag="docs:dev" ./docs
 
 # Run the container
 docker run --publish 3000:3000 --volume ./docs:/docs "docs:dev"
 ```
 
-The docs should be available at `http://localhost:3000/` with live reload. To change the port to port `XXXX`, modify the last command:
-```sh
-# Run the container and publish the docs to http://localhost:XXXX/
-docker run --publish XXXX:3000 --volume ./docs:/docs "docs:dev"
-```
+The docs should be available at `http://localhost:3000/` with live reload. To kill the container, you can use the Docker Desktop UI or run `docker kill [container id]`. You can find the container ID from `docker ps`.
 
-To kill the container, you can use the Docker Desktop UI or run `docker kill [container id]`. You can find the container ID from `docker ps`.
+> [!TIP]
+> To change the port from the above `3000`, modify the `docker run` command as follows, replacing the `XXXX` with your desired port:
+> ```sh
+> docker run --publish XXXX:3000 --volume ./docs:/docs "docs:dev"
+> ```
 
 #### Without Containerization
 

diff --git a/docs/src/SUMMARY.md b/docs/src/SUMMARY.md
@@ -16,6 +16,7 @@
 - [Backend](./core/backend/README.md)
 
 - [Datapuller](./core/datapuller/README.md)
+    - [Local & Remote Development](./core/datapuller/local-remote-development.md)
 
 - [Frontend](./core/frontend/README.md)
 

diff --git a/docs/src/core/datapuller/README.md b/docs/src/core/datapuller/README.md
@@ -1,3 +1,15 @@
 # Datapuller
 
-TODO
+Welcome to the `datapuller` section.
+
+## What is the `datapuller`?
+
+The `datapuller` is a modular collection of data-pulling scripts responsible for populating Berkeleytime's databases with course, class, section, grades, and enrollment data from the official university-provided APIs. This collection of pullers are unified through a singular entrypoint, making it incredibly easy for new pullers to be developed. The original proposal can be found [here](https://docs.google.com/document/d/1EdfI5Cmsk91LwZtUN0VSC5HEKy4RRuMhLhw8TRKRQrM/edit?tab=t.0#heading=h.c6lfrfjeglpv)[^1].
+
+### Motivation
+
+Before the `datapuller`, all data updates were done through a single script run everyday. The lack of modularity made it difficult to increase or decrease the frequency of specific data types. For example, enrollment data changes rapidly during enrollment season—it would be beneficial to be able to update our data more frequently than just once a day. However, course data seldom changes—it would be efficient to update our data less frequently.
+
+Thus, `datapuller` was born, modularizing each puller into a separate script and giving us more control and increasing the fault-tolerance of each script.
+
+[^1]: Modifications to the initial proposal are not included in the document. However, the motivation remains relatively consistent.
diff --git a/docs/src/core/datapuller/local-remote-development.md b/docs/src/core/datapuller/local-remote-development.md
@@ -0,0 +1,30 @@
+# Local & Remote Development
+
+## Local Development
+
+The nature of the `datapuller` separates it from the backend and frontend services. Thus, when testing locally, it is quicker and easier to build and run the `datapuller` separately from the backend/frontend stack.
+
+To run a specific puller, the datapuller must first be built, then the specific puller must be passed as a command[^1]. In addition, a Mongo instance should be running in the same network and the correct `MONGO_URI` in `.env`.
+
+```sh
+# ./berkeleytime
+
+# Run a Mongo instance. The name flag changes the MONGO_URI.
+# Here, it would be mongodb://mongodb:27017/bt.
+docker run --name mongodb --network bt --detach "mongo:7.0.5"
+
+# Build the datapuller-dev image
+docker build --target datapuller-dev --tag "datapuller-dev" .
+
+# Run the desired puller. The default puller is main.
+docker run --volume ./.env:/datapuller/apps/datapuller/.env --network bt \
+    "datapuller-dev" "--puller=courses"
+```
+
+The valid pullers are `courses`, `classes`, `sections`, `grade-distributions`, and `main`.
+
+[^1]: Here, I reference the Docker world's terminology. In the Docker world, the `ENTRYPOINT` instruction denotes the the executable that cannot be overriden after the image is built. The `CMD` instruction denotes an argument that can be overriden after the image is built. In the Kubernetes world, the `ENTRYPOINT` analogous is the `command` field, while the `CMD` equivalent is the `args` field.
+
+## Remote Development
+
+The development CI/CD pipeline marks all `datapuller` CronJobs as [suspended](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/cron-job-v1/#CronJobSpec), preventing the `datapuller` jobs to be scheduled. To test a change, [manually run the desired puller](../infrastructure/runbooks.md#manually-run-datapuller).
diff --git a/docs/src/core/infrastructure/README.md b/docs/src/core/infrastructure/README.md
@@ -1,8 +1,5 @@
 # Infrastructure
 
-> [!WARNING]
-> The infrastructure section is currently under construction.
-
 Welcome to the infrastructure section.
 
 > [!NOTE]

diff --git a/docs/src/core/infrastructure/cicd-workflow.md b/docs/src/core/infrastructure/cicd-workflow.md
@@ -17,6 +17,7 @@ The differences between the three environments are managed by each individual wo
 | Helm Chart Versions[^3] | `0.1.0-dev-[commit hash]` | `0.1.0-stage` | `1.0.0` |
 | TTL (Time to Live) | `[GitHub Action input]` | N/A | N/A |
 | Deployment Count Limit | 8 | 1 | 1 |
+| Datapuller `suspend` | `true` | `false` | `false` |
 
 [^1]: In the past, we have used a self-hosted GitLab instance. However, the CI/CD pipeline was obscured behind a admin login page. Hopefully, with GitHub actions, the deployment process will be more transparent and accessible to all engineers. Please don't break anything though!