Skip to content

Commit

Permalink
(docs) datapuller docs init (#766)
Browse files Browse the repository at this point in the history
* modify getting started docs

* remove wip warning from infra section

* add datapuller documentation

* datapuller local deevelopment docs

* polish
  • Loading branch information
maxmwang authored Jan 25, 2025
1 parent 390f9bf commit 5708bce
Show file tree
Hide file tree
Showing 6 changed files with 52 additions and 11 deletions.
14 changes: 7 additions & 7 deletions docs/src/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,19 @@ Using Docker allows us to build the docs without downloading dependencies on our
git pull

# Build the container (only needed once every time docs/Dockerfile changes!)
docker build --target docs-dev --tag="docs:dev" ./docs
docker build --target=docs-dev --tag="docs:dev" ./docs

# Run the container
docker run --publish 3000:3000 --volume ./docs:/docs "docs:dev"
```

The docs should be available at `http://localhost:3000/` with live reload. To change the port to port `XXXX`, modify the last command:
```sh
# Run the container and publish the docs to http://localhost:XXXX/
docker run --publish XXXX:3000 --volume ./docs:/docs "docs:dev"
```
The docs should be available at `http://localhost:3000/` with live reload. To kill the container, you can use the Docker Desktop UI or run `docker kill [container id]`. You can find the container ID from `docker ps`.

To kill the container, you can use the Docker Desktop UI or run `docker kill [container id]`. You can find the container ID from `docker ps`.
> [!TIP]
> To change the port from the above `3000`, modify the `docker run` command as follows, replacing the `XXXX` with your desired port:
> ```sh
> docker run --publish XXXX:3000 --volume ./docs:/docs "docs:dev"
> ```
#### Without Containerization
Expand Down
1 change: 1 addition & 0 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- [Backend](./core/backend/README.md)

- [Datapuller](./core/datapuller/README.md)
- [Local & Remote Development](./core/datapuller/local-remote-development.md)

- [Frontend](./core/frontend/README.md)

Expand Down
14 changes: 13 additions & 1 deletion docs/src/core/datapuller/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
# Datapuller

TODO
Welcome to the `datapuller` section.

## What is the `datapuller`?

The `datapuller` is a modular collection of data-pulling scripts responsible for populating Berkeleytime's databases with course, class, section, grades, and enrollment data from the official university-provided APIs. This collection of pullers are unified through a singular entrypoint, making it incredibly easy for new pullers to be developed. The original proposal can be found [here](https://docs.google.com/document/d/1EdfI5Cmsk91LwZtUN0VSC5HEKy4RRuMhLhw8TRKRQrM/edit?tab=t.0#heading=h.c6lfrfjeglpv)[^1].

### Motivation

Before the `datapuller`, all data updates were done through a single script run everyday. The lack of modularity made it difficult to increase or decrease the frequency of specific data types. For example, enrollment data changes rapidly during enrollment season—it would be beneficial to be able to update our data more frequently than just once a day. However, course data seldom changes—it would be efficient to update our data less frequently.

Thus, `datapuller` was born, modularizing each puller into a separate script and giving us more control and increasing the fault-tolerance of each script.

[^1]: Modifications to the initial proposal are not included in the document. However, the motivation remains relatively consistent.
30 changes: 30 additions & 0 deletions docs/src/core/datapuller/local-remote-development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Local & Remote Development

## Local Development

The nature of the `datapuller` separates it from the backend and frontend services. Thus, when testing locally, it is quicker and easier to build and run the `datapuller` separately from the backend/frontend stack.

To run a specific puller, the datapuller must first be built, then the specific puller must be passed as a command[^1]. In addition, a Mongo instance should be running in the same network and the correct `MONGO_URI` in `.env`.

```sh
# ./berkeleytime

# Run a Mongo instance. The name flag changes the MONGO_URI.
# Here, it would be mongodb://mongodb:27017/bt.
docker run --name mongodb --network bt --detach "mongo:7.0.5"

# Build the datapuller-dev image
docker build --target datapuller-dev --tag "datapuller-dev" .

# Run the desired puller. The default puller is main.
docker run --volume ./.env:/datapuller/apps/datapuller/.env --network bt \
"datapuller-dev" "--puller=courses"
```

The valid pullers are `courses`, `classes`, `sections`, `grade-distributions`, and `main`.

[^1]: Here, I reference the Docker world's terminology. In the Docker world, the `ENTRYPOINT` instruction denotes the the executable that cannot be overriden after the image is built. The `CMD` instruction denotes an argument that can be overriden after the image is built. In the Kubernetes world, the `ENTRYPOINT` analogous is the `command` field, while the `CMD` equivalent is the `args` field.

## Remote Development

The development CI/CD pipeline marks all `datapuller` CronJobs as [suspended](https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/cron-job-v1/#CronJobSpec), preventing the `datapuller` jobs to be scheduled. To test a change, [manually run the desired puller](../infrastructure/runbooks.md#manually-run-datapuller).
3 changes: 0 additions & 3 deletions docs/src/core/infrastructure/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,5 @@
# Infrastructure

> [!WARNING]
> The infrastructure section is currently under construction.
Welcome to the infrastructure section.

> [!NOTE]
Expand Down
1 change: 1 addition & 0 deletions docs/src/core/infrastructure/cicd-workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ The differences between the three environments are managed by each individual wo
| Helm Chart Versions[^3] | `0.1.0-dev-[commit hash]` | `0.1.0-stage` | `1.0.0` |
| TTL (Time to Live) | `[GitHub Action input]` | N/A | N/A |
| Deployment Count Limit | 8 | 1 | 1 |
| Datapuller `suspend` | `true` | `false` | `false` |

[^1]: In the past, we have used a self-hosted GitLab instance. However, the CI/CD pipeline was obscured behind a admin login page. Hopefully, with GitHub actions, the deployment process will be more transparent and accessible to all engineers. Please don't break anything though!

Expand Down

0 comments on commit 5708bce

Please sign in to comment.