Skip to content

Latest commit

 

History

History
727 lines (615 loc) · 31.6 KB

run-pipeline-on-rhoai.md

File metadata and controls

727 lines (615 loc) · 31.6 KB

Running InstructLab Pipeline with Data Science Pipelines on RHOAI

This file provides step-by-step instructions for setting up and using the Data Science Pipelines (DSP) for InstructLab iterations.

Pre-requisites

  • An OpenShift cluster with

    • Sufficient GPUs available for training.
      • At a minimum, a node with at least 4 GPUs such as an NVIDIA A100s
    • The following Operators already installed:
      • Red Hat - Authorino
      • Red Hat Openshift Serverless
      • Red Hat OpenShift Service Mesh v2
        • NOTE: v3 is not compatible with RHOAI
      • Red Hat Openshift AI and operator dependencies documented at OpenShift AI Supported Configurations
  • Teacher and Judge models with a serving endpoint

    • If already setup you will need the endpoint, api key, and any CA bundles if needed for each model
    • If setting up your own using these instructions, you will need additional multi-node A100s or L40s for each model
  • SDG taxonomy tree to utilize for Synthetic Data Generation (SDG)

    • see instructions for creating a taxonomy tree on how to set up your own taxonomy tree.
  • An OpenShift AI installation, with the Training Operator and KServe components set to Managed

    • A data science project/namespace, in this document this will be referred to as <data-science-project-name/namespace>
  • A StorageClass that supports dynamic provisioning with ReadWriteMany access mode (see step 3 below).

  • An S3 object store such as AWS, or an alternative object storage solution that is S3-compliant such as Ceph, Nooba and MinIO.

  • A locally installed oc command line tool to create and manage kubernetes resources.

  • Ilab CLI (or Skopeo/Oras/etc.) for model downloads

  • For Disconnected Clusters:

    • Mirror Required Images:

      In a disconnected environment, you must mirror the following container images to your internal registry before running the pipeline. Use tools like oc adm release mirror, skopeo, or oras to mirror these images:

      • registry.redhat.io/ubi9/toolbox@sha256:da31dee8904a535d12689346e65e5b00d11a6179abf1fa69b548dbd755fa2770
      • registry.redhat.io/openshift4/ose-cli@sha256:1d5c8442a6ec745e6ae44a7738c0681f1e21aac8be76ba826c2ddf2eed8475db
      • registry.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:b3dc9af0244aa6b84e6c3ef53e714a316daaefaae67e28de397cd71ee4b2ac7e
      • registry.redhat.io/rhelai1/skills-adapter-v3@sha256:53dd11a762bb39fc33c15499891309f0cdc8dbfd02abf94c9c60aad643aca255
      • registry.redhat.io/rhelai1/knowledge-adapter-v3@sha256:ef1608ec78d5e39655b505544c0f30a015a6c9cb7e2b2deffe394791f8c76c6f
      • registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1@sha256:bc08e466aa35352a621d0ad221c2e247ff9751f4cb6cffe00d5894ce6bfd3fd7
      • registry.redhat.io/rhelai1/prometheus-8x7b-v2-0@sha256:9fcb49c230f6e73ff944377307bb83a05ae3ac20300af75e429151f4f8bf4285
      • quay.io/modh/odh-generic-data-science-notebook@sha256:7c1a4ca213b71d342a2d1366171304e469da06d5f15710fab5dd3ce013aa1b73
      • quay.io/modh/vllm@sha256:3c56d4c2a5a9565e8b07ba17a6624290c4fb39ac9097b99b946326c09a8b40c8
      • quay.io/modh/vllm@sha256:97b91f9bd71202f5de8d379cfb61baec887b47f836a2ff8b158c946196de5660
      • quay.io/opendatahub/workbench-images@sha256:7f26f5f2bec4184af15acd95f29b3450526c5c28c386b6cb694fbe82d71d0b41
      • ghcr.io/oras-project/oras:main@sha256:8859e7e3ae510fb921ebeb109ac9d3e3bb91799e0d52001ae456df33929029db
    • 500GB PersistentVolumeClaim (PVC) for Mixtral:

      The proposed method to deploy Mixtral requires a 500GB PVC.

      • In a disconnected cluster, ensure that your OpenShift environment has sufficient storage capacity and a StorageClass configured to provision this PVC.
      • If automatic PVC creation fails, you may need to manually create a PersistentVolume (PV) and bind it to a PVC.
    • Accessible git repository with the taxonomy:

      • The iLab pipeline uses a taxonomy git repository, which should be accessible from the disconnected cluster

Steps

Before running the training and evaluation steps we must complete the following steps:

  1. Prepare base model and push to object store
  2. Setting up Judge & Teacher model
  3. Setup NFS StorageClass (Optional)
  4. Set Up Data Science Pipelines Server and Run InstructLab Pipeline

Prepare base model and push to object store

You will need a base model to train the ilab pipeline on, so to begin, upload the granite-7b-starter model to your object store.

$ mkdir -p s3-data/

Download ilab model repository in s3-data model directory

# You can also use Oras or Skopeo cli tools to download the model
# If using other tools besides ilab, ensure that filenames are mapped
# appropriately
$ ilab model download --repository docker://registry.redhat.io/rhelai1/granite-7b-starter --release 1.2
$ cp -r <path-to-model-downloaded-dir>/rhelai1/granite-7b-starter s3-data/granite-7b-starter

Generate tar archive

$ cd s3-data
$ tar -czvf rhelai.tar.gz *

Upload the created tar archive to your object store.

# Default cache location for ilab model download is ~/.cache/instructlab/models
# The model should be copied in such a way that the *.safetensors are found in s3://your-bucket-name/teach-model/*.safetensors
s3cmd sync s3-data/granite-7b-starter s3://<your-bucket-name>/granite-7b-starter

Setting up Judge & Teacher model

The Teacher model is used for Synthetic Data Generation (SDG) while the Judge model is used for model evaluation.

For the Teacher model you need mixtral-8x7b-instruct-v0-1 deployed with skills-adapter-v3:1.2 and knowledge-adapter-v3:1.2 LoRA layered skills and knowledge adapters.

For the Judge model you will need the prometheus-8x7b-v2-0 model.

If you already have these models deployed you can skip the deployment steps and go straight to the secret set up for Judge and Teacher respectively.

Deploy a judge model server (optional)

Create a service account to be used for token authentication

apiVersion: v1
kind: ServiceAccount
metadata:
  name: judge-sa
  namespace: <data-science-project-name/namespace>

Upload prometheus-8x7b-v2-0 model (Judge-Model) to the same object storage as before.

For example using ilab to download and s3cmd to sync to object store you can do:

# You can also use Oras or Skopeo cli tools to download the model
# If using other tools besides ilab, ensure that filenames are mapped
# appropriately
ilab model download --repository docker://registry.redhat.io/rhelai1/prometheus-8x7b-v2-0 --release 1.2

# Default cache location for ilab model download is ~/.cache/instructlab/models
s3cmd sync path/to/model s3://your-bucket-name/judge-model/

Navigate to the OpenShift AI dashboard

  • Choose Data Science Projects from the left hand menu and choose your data science project/namespace.
  • Select the Connections tab, and then click on the Add connection button. Enter the details of your S3 bucket (object store) and click Add data connection.

Note

Note: Before following the next step - Ensure that the CapabilityServiceMeshAuthorization status is True in DSCinitialization resource.

Create a model server instance

  • Navigate back to Data Science Projects page, select your namespace again, and then select the Models tab
  • On the right hand side select Deploy model under Single-model serving platform
  • Under Serving runtime choose the serving runtime vLLM Serving Runtime for Kserve.
  • Check the Make deployed models available through an external route box.
  • Under token authentication check the Require token authentication box, write the name of the service account that we have created above.
  • Choose the existing data connection created earlier.
  • Click deploy.

Deploy judge model serving details

Create a secret containing the judge model serving details

apiVersion: v1
kind: Secret
metadata:
  name: <judge-model-details-k8s-secret>
  namespace: <data-science-project-name/namespace>
type: Opaque
stringData:
  JUDGE_NAME: <judge-model-name>                              # Name of the judge model or deployment
  JUDGE_ENDPOINT: <judge-model-endpoint>                      # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
  JUDGE_API_KEY: <judge-model-api-key>                        # Deployed model-server auth token
  JUDGE_CA_CERT: <judge-model-ca-cert-config-map-name>        # Configmap containing CA cert for the judge model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
  JUDGE_CA_CERT_CM_KEY: <judge-model-ca-cert-config-map-key>  # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`

Note

Note: If using a custom CA certificate you must provide the relevant data in a ConfigMap. The config map name and key are then provided as a parameter to the pipeline as well as in the judge-serving-details secret above.

If you deployed the Judge server model using the optional instructions above then you can retrieve JUDGE_API_KEY by running the following command:

JUDGE_API_KEY=$(oc -n <data-science-project-name/namespace> create token judge-sa)

Deploy a teacher model server (Optional)

Unlike the Judge model we have to deploy the Teacher model manually on RHOAI, this consists of deploying the K8s resources using oc.

First, upload the Teacher model to s3 if it does not already exist there:

# You can also use ORAS or Skopeo cli tools to download the model
# If using other tools besides ilab, ensure that filenames are mapped
# appropriately
ilab model download --repository docker://registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 --release 1.2

# Default cache location for ilab model download is ~/.cache/instructlab/models
# The model should be copied in such a way that the *.safetensors are found in s3://your-bucket-name/teach-model/*.safetensors
s3cmd sync path/to/model s3://your-bucket-name/teach-model/

Deploy the following yaml called pre_requisites.yaml to the <data-science-project-name/namespace>

pre_requisites.yaml
---
kind: ServiceAccount
apiVersion: v1
metadata:
  name: mixtral-sa
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: mixtral-view-role
  labels:
    opendatahub.io/dashboard: 'true'
rules:
  - verbs:
      - get
    apiGroups:
      - serving.kserve.io
    resources:
      - inferenceservices
    resourceNames:
      - mixtral
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: mixtral-view
  labels:
    opendatahub.io/dashboard: 'true'
subjects:
  - kind: ServiceAccount
    name: mixtral-sa
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: mixtral-view-role
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: mixtral-serving-ilab
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 200Gi
  storageClassName: standard-csi
  volumeMode: Filesystem
oc -n <data-science-project-name/namespace> apply -f pre_requisites.yaml

You will need to ensure that the storage-config secret exists in the <data-science-project-name/namespace> namespace. And this storage-config has the configuration for the bucket where the teacher model is stored.

apiVersion: v1
stringData:
  aws-connection-my-bucket: |
    {
      "type": "s3",
      "access_key_id": "your_accesskey",
      "secret_access_key": "your_secretkey",
      "endpoint_url": "https://s3-us-east-2.amazonaws.com",
      "bucket": "mybucket",
      "default_bucket": "mybucket",
      "region": "us-east-2"
    }
kind: Secret
metadata:
  name: storage-config
type: Opaque

If this secret does not exist in this namespace, then create it. If it does exist, then ensure there is an entry for the bucket that stores the teacher model. The key is used in the InferenceService spec below.

Next we need to create the custom ServingRuntime and InferenceService.

Similar to above, deploy the following yaml files to the namespace <data-science-project-name/namespace>

You will need to update the spec.model.storage.path in the InferenceService to match the path where the model files are stored in your bucket. The key should match the value in your storage-config secret that has the bucket credentials. In our example above we use aws-connection-my-bucket.

servingruntime.mixtral.yaml
---
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/accelerator-name: migrated-gpu
    opendatahub.io/apiProtocol: REST
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
    opendatahub.io/template-display-name: Mixtral ServingRuntime
    opendatahub.io/template-name: vllm-runtime
    openshift.io/display-name: mixtral
  labels:
    opendatahub.io/dashboard: "true"
  name: mixtral
spec:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "8080"
  containers:
  - args:
    - --port=8080
    - --model=/mnt/models
    - --served-model-name={{.Name}}
    - --distributed-executor-backend=mp
    command:
    - python
    - -m
    - vllm.entrypoints.openai.api_server
    env:
    - name: HF_HOME
      value: /tmp/hf_home
    image: quay.io/modh/vllm@sha256:3c56d4c2a5a9565e8b07ba17a6624290c4fb39ac9097b99b946326c09a8b40c8
    name: kserve-container
    ports:
    - containerPort: 8080
      protocol: TCP
    volumeMounts:
    - mountPath: /dev/shm
      name: shm
    - mountPath: /mnt
      name: mixtral-serve
  multiModel: false
  storageHelper:
    disabled: true
  supportedModelFormats:
  - autoSelect: true
    name: vLLM
  volumes:
  - name: mixtral-serve
    persistentVolumeClaim:
      claimName: mixtral-serving-ilab
  - emptyDir:
      medium: Memory
      sizeLimit: 2Gi
    name: shm
inferenceservice.mixtral.yaml
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: mixtral
    security.opendatahub.io/enable-auth: "true"
    serving.knative.openshift.io/enablePassthrough: "true"
    sidecar.istio.io/inject: "true"
    sidecar.istio.io/rewriteAppHTTPProbers: "true"
  finalizers:
  - inferenceservice.finalizers
  labels:
    opendatahub.io/dashboard: "true"
  name: mixtral
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      args:
      - --dtype=bfloat16
      - --tensor-parallel-size=4
      - --enable-lora
      - --max-lora-rank=64
      - --lora-dtype=bfloat16
      - --fully-sharded-loras
      - --lora-modules
      - skill-classifier-v3-clm=/mnt/models/skills
      - text-classifier-knowledge-v3-clm=/mnt/models/knowledge
      modelFormat:
        name: vLLM
      name: ""
      resources:
        limits:
          cpu: "4"
          memory: 60Gi
          nvidia.com/gpu: "4"
        requests:
          cpu: "4"
          memory: 60Gi
          nvidia.com/gpu: "4"
      runtime: mixtral
      storage:
        # the secret name of the secret deployed earlier
        key: aws-connection-my-bucket
        # update this to match the path in your bucket
        path: <prefix-path-to-mixtral-model-in-s3>
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists
oc -n <data-science-project-name/namespace> apply -f servingruntime.mixtral.yaml
oc -n <data-science-project-name/namespace> apply -f inferenceservice.mixtral.yaml

A new pod named mixtral-predictor-0000#-deployment-<hash> should be created. This should result in a successful running pod. If the pod does not come up successfully, you inspect the .status field for the InferenceService for issues.

oc -n <data-science-project-name/namespace> get inferenceservice mixtral -o yaml

Deploy teacher model serving details

Create a secret containing the Teacher model serving details

apiVersion: v1
kind: Secret
metadata:
  name: <teacher-model-details-k8s-secret>
  namespace: <data-science-project-name/namespace>
type: Opaque
stringData:
  api_key:  <teacher-model-api-key>                      # Deployed model-server auth token
  endpoint: <teacher-model-endpoint>                     # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
  model: <teacher-model-name>         # Name of the teacher model or deployment
  SDG_CA_CERT:  <teacher-model-ca-config-map-name>       # Configmap containing CA cert for the teacher model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
  SDG_CA_CERT_CM_KEY: <teacher-model-ca-config-map-key>  # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`

Note

Note: If using a custom CA certificate you must provide the relevant data in a ConfigMap. The config map name and key are then provided as a parameter to the pipeline as well as in the teacher-model-details-k8s-secret secret above.

If you deployed the Teacher server model using the optional instructions above then you can retrieve api_key by running the following command:

SDG_API_KEY=$(oc -n <data-science-project-name/namespace> create token mixtral-sa)

(Optional) - Setup NFS StorageClass

Caution

The image provided here is for test purposes only. Users must provide a production ready storageclass with ReadWriteMany capability.

This step is needed when the cluster doesn't have a storage provisioner capable of provisioning PersistentVolumeClaim with ReadWriteMany capability.

Installing the NFS CSI driver

$ curl -skSL https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/v4.9.0/deploy/install-driver.sh | bash -s v4.9.0 --`

For deploying an in-cluster NFS server, apply nfs-server-deployment.yaml file

oc new-project nfs
oc apply -f ./standalone/nfs-server-deployment.yaml

Note

Note: Check the root PersistentVolumeclaim that'll be created and the requested storage.

For creating NFS storage-class, apply nfs-storage-class.yaml file

oc apply -f ./standalone/nfs-storage-class.yaml

Accelerator Profile:

An accelerator profile must also be defined within the RHOAI dashboard or via CLI to enable GPU acceleration for model serving with Kserve Serving.

apiVersion: v1
items:
- apiVersion: dashboard.opendatahub.io/v1
  kind: AcceleratorProfile
  metadata:
    name: gpu
    namespace: redhat-ods-applications
  spec:
    displayName: gpu
    enabled: true
    identifier: nvidia.com/gpu
    tolerations: []

Signed Certificate:

A signed certificate ensures that there are not any unnecessary issues when running the training pipeline.

To deploy a signed certificate in your cluster follow trusted cluster cert documentation.

This will create the required resources in the cluster, including the required StorageClass.

Set Up Data Science Pipelines Server and Run InstructLab Pipeline

Now we can continue to set up the required resources in our cluster.

The following resources will be created:

  1. Secret
  2. ClusterRole
  3. ClusterRoleBinding
  4. Pod

Create a secret resource that contains the credentials for your Object Storage (AWS S3 Bucket)

apiVersion: v1
kind: Secret
metadata:
  name: sdg-object-store-credentials
type: Opaque
stringData:
  bucket: <s3-bucket-name>             # The object store bucket containing SDG+Model+Taxonomy data. (Name of S3 bucket)
  access_key: <s3-access-key>          # The object store access key (AWS Access key ID)
  secret_key: <s3-secret-key>          # The object store secret key (AWS Secret Access Key)
  data_key: <s3-path-to-teacher-model-files>                  # The name of the tarball that contains SDG data.
  endpoint: <s3-endpoint>              # The object store endpoint
  region: <s3-region>                  # The region for the object store.
  verify_tls: true                     # Verify TLS for the object store.

Apply the yaml file to the cluster

Create a ServiceAccount, ClusterRole and ClusterRoleBinding

Provide access to the service account running the pipeline for accessing and manipulating related resources.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  namespace: <data-science-project-name/namespace>
  name: secret-access-role
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log", "configmaps", "persistentvolumeclaims", "secrets","events"]
    verbs: ["get", "list", "watch", "create", "update", "delete"]

  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["get", "list", "create", "watch"]

  - apiGroups: ["kubeflow.org"]
    resources: ["pytorchjobs"]
    verbs: ["get", "list", "create", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: secret-access-binding
subjects:
  - kind: ServiceAccount
    name: <workbench-service-account-name> # created above in Step-2
    namespace: <data-science-project-name/namespace>
roleRef:
  kind: ClusterRole
  name: secret-access-role
  apiGroup: rbac.authorization.k8s.io

Apply the yaml to the cluster.

These are the required RBAC configuration which we are applying on the ServiceAccount.

Configure Data Science Pipeline Server:

From within the RHOAI dashboard, navigate to the Data Science Pipelines page and click Configure pipeline server. This will present you with a form where you can upload the credentials for the S3 bucket you created in the previous step.

Run the Pipeline

Now that all the cluster requirements have been setup, we are ready to upload and run our InstructLab pipeline!

Upload the Pipeline:

Now we can go back to our RHOAI Data Science Pipelines dashboard and select Import pipeline. We recommend importing the pipeline yaml directly from the github repo using: https://raw.githubusercontent.com/opendatahub-io/ilab-on-ocp/refs/heads/main/pipeline.yaml

Note: While using a Disconnected Cluster, the pipeline should be imported from a file instead of a URL.Using a URL fails in disconnected clusters as they cannot connect to GitHub.

Create a Run:

Once the pipeline is uploaded we will be able to select Create run from the Actions dropdown. This will present us with a number of parameters we can set to customize our run. Click Create run at the bottom of the page to kick off your InstructLab pipeline.

Available Pipeline Parameters:

Parameter Definition
sdg_repo_url SDG parameter. Points to a taxonomy git repository
sdg_repo_branch SDG parameter. Points to a branch within the taxonomy git repository. If set, has priority over sdg_repo_pr
sdg_repo_pr SDG parameter. Points to a pull request against the taxonomy git repository
sdg_base_model SDG parameter. LLM model used to generate the synthetic dataset
sdg_scale_factor SDG parameter. The total number of instructions to be generated
sdg_pipeline SDG parameter. Data generation pipeline to use. Available: 'simple', 'full', or a valid path to a directory of pipeline workflow YAML files. Note that 'full' requires a larger teacher model, Mixtral-8x7b.
sdg_max_batch_len SDG parameter. Maximum tokens per gpu for each batch that will be handled in a single step.
sdg_sample_size SDG parameter. Sampling size used for Synthetic Data Generation
train_nproc_per_node Training parameter. Number of GPUs per each node/worker to use for training.
train_nnodes Training parameter. Number of nodes/workers to train on.
train_num_epochs_phase_1 Training parameter for in Phase 1. Number of epochs to run training.
train_num_epochs_phase_2 Training parameter for in Phase 2. Number of epochs to run training.
train_effective_batch_size_phase_1 Training parameter for in Phase 1. The number of samples in a batch that the model should see before its parameters are updated.
train_effective_batch_size_phase_2 Training parameter for in Phase 2. The number of samples in a batch that the model should see before its parameters are updated.
train_learning_rate_phase_1 Training parameter for in Phase 1. How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size.
train_learning_rate_phase_2 Training parameter for in Phase 2. How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size.
train_num_warmup_steps_phase_1 Training parameter for in Phase 1. The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to train_learning_rate.
train_num_warmup_steps_phase_2 Training parameter for in Phase 2. The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to train_learning_rate.
train_save_samples Training parameter. Number of samples the model should see before saving a checkpoint.
train_max_batch_len Training parameter. Maximum tokens per gpu for each batch that will be handled in a single step.
train_seed Training parameter. Random seed for initializing training.
mt_bench_max_workers MT Bench parameter. Number of workers to use for evaluation with mt_bench or mt_bench_branch. Must be a positive integer or 'auto'.
mt_bench_merge_system_user_message MT Bench parameter. Boolean indicating whether to merge system and user messages (required for Mistral based judges)
final_eval_max_workers Final model evaluation parameter for MT Bench Branch. Number of workers to use for evaluation with mt_bench or mt_bench_branch. Must be a positive integer or 'auto'.
final_eval_few_shots Final model evaluation parameter for MMLU. Number of question-answer pairs provided in the context preceding the question used for evaluation.
final_eval_batch_size Final model evaluation parameter for MMLU. Batch size for evaluation. Valid values are a positive integer or 'auto' to select the largest batch size that will fit in memory.
final_eval_merge_system_user_message Final model evaluation parameter for MT Bench Branch. Boolean indicating whether to merge system and user messages (required for Mistral based judges)
k8s_storage_class_name A Kubernetes StorageClass name for persistent volumes. Selected StorageClass must support RWX PersistentVolumes.
Suggested Parameters: Full Pipeline

To run the ilab Pipeline at full capabilities, we suggest using these values:

Parameter Suggested Value
sdg_repo_url https://github.com/instructlab/taxonomy.git
sdg_repo_branch ""
sdg_repo_pr 0
sdg_base_model s3:///<PATH_TO_MODEL>
sdg_scale_factor 30
sdg_pipeline "full"
sdg_max_batch_len 5000
sdg_sample_size 1.0
train_nproc_per_node 2
train_nnodes 2
train_num_epochs_phase_1 7
train_num_epochs_phase_2 10
train_effective_batch_size_phase_1 128
train_effective_batch_size_phase_2 3840
train_learning_rate_phase_1 2e-05
train_learning_rate_phase_2 6e-06
train_num_warmup_steps_phase_1 1000
train_num_warmup_steps_phase_2 1000
train_save_samples 250000
train_max_batch_len 5000
train_seed 42
mt_bench_max_workers "auto"
mt_bench_merge_system_user_message False
final_eval_max_workers "auto"
final_eval_few_shots 5
final_eval_batch_size "auto"
final_eval_merge_system_user_message False
k8s_storage_class_name standard
Note that this will take a very long time, on the scale of double-digit hours of runtime.
Suggested Parameters: Development

Running the ilab pipeline at full capabilities takes a very long time, and with a good amount of resource consumption. To create a e2e run that completes much quicker (at the expense of output quality), and with fewer resources (namely, GPU nodes) we suggest using these values instead:

Parameter Suggested Value
sdg_repo_url https://github.com/instructlab/taxonomy.git
sdg_repo_branch ""
sdg_repo_pr 0
sdg_base_model s3:///<PATH_TO_MODEL>
sdg_scale_factor 30
sdg_pipeline "simple"
sdg_max_batch_len 5000
sdg_sample_size 0.0002
train_nproc_per_node 1
train_nnodes 1
train_num_epochs_phase_1 2
train_num_epochs_phase_2 2
train_effective_batch_size_phase_1 3840
train_effective_batch_size_phase_2 3840
train_learning_rate_phase_1 .0001
train_learning_rate_phase_2 .0001
train_num_warmup_steps_phase_1 800
train_num_warmup_steps_phase_2 800
train_save_samples 0
train_max_batch_len 20000
train_seed 42
mt_bench_max_workers "auto"
mt_bench_merge_system_user_message False
final_eval_max_workers "auto"
final_eval_few_shots 5
final_eval_batch_size "auto"
final_eval_merge_system_user_message False
k8s_storage_class_name standard

Using these parameters will allow a user to run the complete pipeline much quicker; in testing we have found this to take about 90 minutes. Additionally, we can point the judge-server and teacher-server to the same Mistral model, which only uses 1 GPU, and the PyTorchJob configuration specified here also only uses 2 training nodes of 1 GPU, so a total of 3 GPUs are required, rather than the 8-9 GPUs required for the full pipeline. With that said, the output model quality is likely very poor, and these should only be used for testing purposes.