Complex workload for demonstrating the GLACIATION project.
This test workload runs as a Spark application.
The workload repeats the same action for a specified time.
During each iteration of the loop, a dataset is read from the source bucket
in MinIO, the sum of values in each column is calculated, and the result is written to the target bucket
.
In the dataset folder, there is an example dataset example.csv
from Machinery Fault Dataset.
The dataset should contain a table without headers with numerical data.
The repository contains Helm Charts that allow running Spark Application in Kubernetes.
helm install complex-workload <helm-chart>
--set arguments.time=<duration of work, sec., 3600 by default>
--set arguments.minioHost=<URL MinIO API, localhost:9000 by default>
--set arguments.minioAccessKey=<your_access_key>
--set arguments.minioSecretKey=<your_secret_key>
--set arguments.sourceBucket=<Name of a source bucket, "source" by default>
--set arguments.targetBucket=<Name of a target bucket, "target" by default>
--set arguments.datasetName=<Name of a dataset file, "example.csv" by default>
--set spark.driver.cores=<CPU cores for Spark driver, 1 by default>
--set spark.driver.coreLimit=<Limit for CPU cores, "1200m" by default>
--set spark.driver.memory=<RAM for Spark driver, "512m" by default>
--set spark.driver.serviceAccount=<Kubernetes service account, "spark-operator" by default>
--set spark.executor.cores=<CPU cores per Spark executor, 1 by default>
--set spark.executor.instances=<Number of Spark executor instances, 3 by default>
--set spark.executor.memory=<RAM per Spark executor, "512m" by default>
Example:
helm repo add complex-workload-repo https://glaciation-heu.github.io/complex-workload/helm-charts/
helm repo update
helm search repo complex-workload
helm install complex-workload complex-workload-repo/complex-workload
--set arguments.minioAccessKey=<your_access_key>
--set arguments.minioSecretKey=<your_secret_key>
- Python 3.10+ - look at detailed instructions below
- pipx
- Poetry
- Docker
- Helm
- Minikube
- MinIO
- spark-operator
-
If you don't have
Python
installed run:These instructions are for Ubuntu 22.04 and may not work for other versions.
Also, these instructions are about using Poetry with Pyenv-managed (non-system) Python.
Before we install pyenv, we need to update our package lists for upgrades and new package installations. We also need to install dependencies for pyenv.
Open your terminal and type:
sudo apt-get update sudo apt-get install -y make build-essential libssl-dev zlib1g-dev libbz2-dev \ libreadline-dev libsqlite3-dev wget curl llvm libncursesw5-dev xz-utils \ tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
We will clone pyenv from the official GitHub repository and add it to our system path.
git clone https://github.com/pyenv/pyenv.git ~/.pyenv echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc echo 'eval "$(pyenv init -)"' >> ~/.bashrc exec "$SHELL"
For additional information visit official docs
Now that pyenv is installed, we can install different Python versions. To install Python 3.12, use the following command:
pyenv install 3.12
Do this in the template dir. Pycharm will automatically connect to it later
poetry env use ~/.pyenv/versions/3.12.1/bin/python
(change the version number accordingly to what is installed)
Finally, verify that Poetry indeed is connected to the proper version:
poetry enf info
-
If you don't have
Poetry
installed run:
pipx install poetry
- Install dependencies:
poetry config virtualenvs.in-project true
poetry install --no-root --with dev
- Install
pre-commit
hooks:
poetry run pre-commit install
minikube start --cpus 4 --memory 8192
- Create
minio-dev.yaml
file contains the following Kubernetes resources:
apiVersion: v1
kind: Pod
metadata:
labels:
app: minio
name: minio
namespace: default
spec:
containers:
- name: minio
image: quay.io/minio/minio:latest
command:
- /bin/bash
- -c
args:
- minio server /mnt/disk1/minio-data --console-address :9090
volumeMounts:
- mountPath: /mnt/disk1/minio-data
name: localvolume
nodeSelector:
kubernetes.io/hostname: minikube
volumes:
- name: localvolume
hostPath:
path: /mnt/disk1/minio-data
type: DirectoryOrCreate
- Apply minio-dev.yaml:
kubectl apply -f minio-dev.yaml
- Forward ports:
kubectl port-forward minio 9000:9000
kubectl port-forward minio 9090:9090
- Open UI MinIO (login is
minioadmin
, password isminioadmin
) and:
- Create Access and secret keys on the page Access Keys;
- Create Source and Target buckets on the page Buckets;
- Upload Dataset
dataset/example.csv
to Source bucket on the page Source bucket.
- Deploy a spark-operator:
git clone https://github.com/kubeflow/spark-operator
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
helm install spark-operator spark-operator/charts/spark-operator-chart
- Build a docker image:
docker build . -t complex-workload:latest
- Upload the docker image to minikube:
minikube image load complex-workload:latest
- Deploy the job:
helm install complex-workload ./charts/complex-workload
--version 0.1.0
--set image.repository=complex-workload
--set image.tag=latest
--set arguments.time=60
--set arguments.minioHost=<minio_host>
--set arguments.minioAccessKey=<your_access_key>
--set arguments.minioSecretKey=<your_secret_key>
- Delete the complex-workload and the spark-operator
helm delete complex-workload
helm delete spark-operator
To create a release, add a tag in GIT with the format a.a.a, where 'a' is an integer.
git tag 0.1.0
git push origin 0.1.0
The release version for branches, pull requests, and other tags will be generated based on the last tag of the form a.a.a.
The Helm chart version changed automatically when a new release is created. The version of the Helm chart is equal to the version of the release.
GitHub Actions triggers testing and builds for each release.
Initial setup
Create the branch gh-pages and use it as a GitHub page.
After execution
- The index.yaml file containing the list of Helm Charts will be available at
https://glaciation-heu.github.io/complex-workload/helm-charts/index.yaml
. - The Docker image will be available at
https://github.com/orgs/glaciation-heu/packages?repo_name=complex-workload
.
HIRO uses and requires from its partners GitFlow with Forks