CHAI is an attempt at an open-source data pipeline for package managers. The goal is to have a pipeline that can use the data from any package manager and provide a normalized data source for myriads of different use cases.
Use Docker
- Install Docker
- Clone the chai repository (https://docs.github.com/en/repositories/creating-and-managing-repositories/cloning-a-repository)
- Using a terminal, navigate to the cloned repository directory
- Run
docker compose build
to create the latest Docker images - Then, run
docker compose up
to launch.
Note
This will run CHAI for all package managers. As an example crates by itself will take over an hour and consume >5GB storage.
Currently, we support only two package managers:
- crates
- Homebrew
You can run a single package manager by running
docker compose up -e ... <package_manager>
We are planning on supporting NPM
, PyPI
, and rubygems
next.
Specify these eg. docker compose -e FOO=bar up
:
FREQUENCY
: Sets how often (in hours) the pipeline should run.TEST
: Runs the loader in test mode when set to true, skipping certain data insertions.FETCH
: Determines whether to fetch new data from the source when set to true.NO_CACHE
: When set to true, deletes temporary files after processing.
Note
The flag NO_CACHE
does not mean that files will not get downloaded to your local
storage, just that we'll delete the files once we're done with them
These arguments are all configurable in the docker-compose.yml
file.
db
: PostgreSQL database for the reduced package dataalembic
: handles migrationspackage_managers
: fetches and writes data for each package managerapi
: a simple REST api for reading from the db
Stuff happens. Start over:
rm -rf ./data
: removes all the data the fetcher is putting.
Our goal is to build a data schema that looks like this:
You can read more about specific data models in the dbs readme
Our specific application extracts the dependency graph understand what are critical pieces of the open-source graph. We also built a simple example that displays sbom-metadata for your repository.
There are many other potential use cases for this data:
- License compatibility checker
- Developer publications
- Package popularity
- Dependency analysis vulnerability tool (requires translating semver)
Tip
Help us add the above to the examples folder.
- The database url is
postgresql://postgres:s3cr3t@localhost:5435/chai
, and is used asCHAI_DATABASE_URL
in the environment.psql CHAI_DATABASE_URL
will connect you to the database.
These are tasks that can be run using [xcfile.dev]. If you use pkgx
, typing
dev
loads the environment. Alternatively, run them manually.
rm -rf db/data data .venv
docker compose build
Requires: build
docker compose up -d
Env: TEST=true Env: DEBUG=true
docker compose up
Requires: build Env: TEST=true Env: DEBUG=true
docker compose up
docker compose down
docker compose logs
Requires: stop
rm -rf db/data
Inputs: MIGRATION_NAME Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
cd alembic
alembic revision --autogenerate -m "$MIGRATION_NAME"
Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
cd alembic
alembic upgrade head
Inputs: STEP Env: CHAI_DATABASE_URL=postgresql://postgres:s3cr3t@localhost:5435/chai
cd alembic
alembic downgrade -$STEP
psql "postgresql://postgres:s3cr3t@localhost:5435/chai"
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT count(id) FROM packages;"
psql "postgresql://postgres:s3cr3t@localhost:5435/chai" -c "SELECT * FROM load_history;"
Refreshes table knowledge from the db.
docker-compose restart api
docker compose down --remove-orphans