chore: Docs for apiv2 and importers. (#400)

chanzuckerberg · Jan 9, 2025 · 32bde07 · 32bde07
1 parent 6ea76ea
commit 32bde07
Show file tree

Hide file tree

Showing 3 changed files with 73 additions and 2 deletions.
diff --git a/apiv2/Makefile b/apiv2/Makefile
@@ -66,7 +66,7 @@ test: ## Run tests
 	$(docker_compose) exec $(APP_CONTAINER) pytest -vvv
 
 .PHONY: test-file
-test-file: init ## Run tests for a specific file, ex: make test-file FILE=tests/test_file.py
+test-file: ## Run tests for a specific file, ex: make test-file FILE=tests/test_file.py
 	$(docker_compose) exec $(APP_CONTAINER) pytest -vvv $(FILE)
 
 .PHONY: restart

diff --git a/apiv2/README.md b/apiv2/README.md
@@ -10,8 +10,39 @@ docker compose exec graphql-api python3 scrape.py
 
 After the above steps, browse the api at [http://localhost:9009/graphql](http://localhost:9009/graphql)
 
+## Testing
+From the `apiv2` directory, run:
+
+```
+# Run all tests
+$ make test
+
+# Just run ingestion tests
+$ FILE=db_import make test-file
+
+# Just run apiv2 tests
+$ FILE=tests make test-file
+
+# Run a single test file
+$ FILE=tests/test_aggregates.py make test-file
+```
+
+## Common Commands
+
+| Command                         | Description                                                                                   |
+|---------------------------------|-----------------------------------------------------------------------------------------------|
+| `make test`                       | Runs unit tests for api v2 and db ingestion                                                   |
+| `make update-schema`              | If `schema/schema.yaml` has been modified, run this to re-run codegen and create new migrations |
+| `make codegen`                    | If any codegen templates have been modified, run this to apply the changes                    |
+| `make pgconsole`                  | Open a psql console in the apiv2 database                                                     |
+| `make alembic-upgrade-head`       | Run all alembic migrations to bring the db structure up to date.                              |
+| `make stop`                       | Stops all the docker containers started by this application.                                  |
+| `make clean`                      | Removes all the docker containers started by this application.                                |
+
 ## How to make changes to the schema
 
+**NOTE:** Most schema changes will require accompanying changes to the `db_import` scripts to support the new fields!
+
 ```
 # First, make any necessary changes to `apiv2/schema/schema.yaml`, then run this to generate new code & migrations and apply them:
 cd apiv2
@@ -23,14 +54,20 @@ pre-commit run --all-files
 
 # Then commit all changed files, *including* any new migrations!
 git commit -am "chore: Updating schema to add X feature"
+
 ```
 
 ## Creating an rdev stack
 1. Open a pull request
 2. Add a label called "stack" to the PR. This will trigger an action that creates an rdev stack. Once the stack has been created, the PR comment will be updated to reflect the URL you can use to access your rdev instance.
 3. **NOTE!!:** For now, all rdevs share a database. Fixing this is a TODO.
 
-## Updating staging/prod
+## RELEASES - Updating staging/prod
+**TL;DR:**
+- Merging to main will deploy changes to staging
+- Merging a release-please pull request will deploy changes to production
+
+**Still reading / How this works:**
 1. Merging a PR to `main` that has a conventional commit title of `feat`, `fix`, or breaking changes (`chore` doesn't count!) will trigger a [release-please](https://github.com/googleapis/release-please/) action that creates a new release candidate PR.
 2. Once the release PR has been created, another action gets triggered that builds a new staging docker image. This action also writes updated image sha's to staging/prod envs in `.infra`
 3. Once the new image build is complete, *another* action ensures that ArgoCD was able to successfully update staging.

diff --git a/apiv2/db_import/README.md b/apiv2/db_import/README.md
@@ -0,0 +1,34 @@
+# Database Ingestion
+
+S3 is the source of truth for all the data in the CryoET data portal, but we provide a nice GraphQL API that makes it easy to query the portal for specific metadata. To accomplish this, we've built a utility that's aware of the portal's expected directory structure and writes most of the metadata that it finds to a PostgreSQL database. The code for this utility is in this directory.
+
+## How to run this tool
+To run database ingestion in staging/prod, please see the [docs for launching db ingestion in our batch cluster](https://github.com/chanzuckerberg/cryoet-data-portal-backend/blob/main/ingestion_tools/docs/enqueue_runs.md#database-ingestion-db-import-subcommand).
+
+To run database ingestion from production locally, for testing or demonstration purposes:
+
+```
+$ make apiv2-init # From the root of the repo, if you don't have a local dev environment running already
+$ export AWS_PROFILE=cryoet-dev # ANY valid aws credentials will do, since the data we're reading is in a public bucket
+$ cd apiv2
+$ python3 -m db_import.importer load --postgres_url postgresql://postgres:postgres@localhost:5432/cryoetv2 cryoet-data-portal-public https://localhost:8080 --import-everything --s3-prefix 10002 --import-depositions --deposition-id 10301 --deposition-id 10303
+```
+
+## Architecture
+**NOTE - we're currently partway through a refactor!** The old base importer classes in `importers/base_importer.py` are being deprecated in favor of the base classes in `importers/base.py` and the docs here reflect the newer functionality.
+
+The architecture of the database importer tool is *similar* to the [architecture of the S3 ingestion tool](https://github.com/chanzuckerberg/cryoet-data-portal-backend/blob/main/ingestion_tools/docs/s3_ingestion.md#how-are-the-different-entities-related) tool:
+1. We have a top-level script `importer.py` that processes the data in the portal hierarchically
+2. For each object in our schema/database, we have a subclass of the `ItemDBImporter` class that gets instantiated for each instance of an object (eg. one `TomogramItem` instance represents one tomogram in the data portal) and one subclass of `IntegratedDBImporter` that represents a group of items at a particular point in our object hierarchy.
+  a. `IntegratedDBImporter` is responsible for finding all relevant metadata at a particular S3 prefix, instantiating the appropriate `ItemDBImporter` objects, and deleting any stale DB rows once all items have been processed.
+  b. `ItemDBImporter` is responsible for transforming JSON input data into database fields, and performing any necessary calculation.
+3. Similar to the S3 ingestion tool, database ingestion has a set of `Finder` classes in `common/finders.py` that provide reusable helpers for finding files in s3 or finding specific fields within json files in S3. Each `IntegratedDBImporter` class specifies the relevant finder for its data type and provides arguments for how to find data.
+
+## Testing
+Tests for DB importers generally do something along the lines of:
+
+1. Populate the database with some stale data, meaning some rows that should be modified by a relevant importer, and some that should be deleted.
+2. Perform ingestion of a given type. The data we ingest comes from `test_infra/test_files` at the root of this repo. The seeded data in step one **should not match** the pre-populated data in step 1, this is because we want to ensure that we can accurately **update**, **delete** and **create** database rows that match the data in the "fake" s3 bucket represented by our `test_infra` directory.
+3. Validate that new rows were created, stale rows were updated, and unneeded rows were deleted during ingestion, and that all fields match the expected values from `test_infra/test_files`
+
+**NOTE** that if you change any files in `test_infra/test_files` you'll need to run `test_infra/seed_moto.sh` to upload the changes to the moto server we use for testing