Running InstructLab Pipeline with Data Science Pipelines on RHOAI

This file provides step-by-step instructions for setting up and using the Data Science Pipelines (DSP) for InstructLab iterations.

Pre-requisites

An OpenShift cluster with
- Sufficient GPUs available for training.
  - At a minimum, a node with at least 4 GPUs such as an NVIDIA A100s
- The following Operators already installed:
  - Red Hat - Authorino
  - Red Hat Openshift Serverless
  - Red Hat OpenShift Service Mesh v2
    - NOTE: v3 is not compatible with RHOAI
  - Red Hat Openshift AI and operator dependencies documented at OpenShift AI Supported Configurations
Teacher and Judge models with a serving endpoint
- If already setup you will need the endpoint, api key, and any CA bundles if needed for each model
- If setting up your own using these instructions, you will need additional multi-node A100s or L40s for each model
SDG taxonomy tree to utilize for Synthetic Data Generation (SDG)
- see instructions for creating a taxonomy tree on how to set up your own taxonomy tree.
An OpenShift AI installation, with the Training Operator and KServe components set to Managed
- A data science project/namespace, in this document this will be referred to as <data-science-project-name/namespace>
A StorageClass that supports dynamic provisioning with ReadWriteMany access mode (see step 3 below).
An S3 object store such as AWS, or an alternative object storage solution that is S3-compliant such as Ceph, Nooba and MinIO.
A locally installed oc command line tool to create and manage kubernetes resources.
Ilab CLI (or Skopeo/Oras/etc.) for model downloads
For Disconnected Clusters:
- Mirror Required Images:
  
  In a disconnected environment, you must mirror the following container images to your internal registry before running the pipeline. Use tools like oc adm release mirror, skopeo, or oras to mirror these images:
  - registry.redhat.io/ubi9/toolbox@sha256:da31dee8904a535d12689346e65e5b00d11a6179abf1fa69b548dbd755fa2770
  - registry.redhat.io/openshift4/ose-cli@sha256:1d5c8442a6ec745e6ae44a7738c0681f1e21aac8be76ba826c2ddf2eed8475db
  - registry.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:b3dc9af0244aa6b84e6c3ef53e714a316daaefaae67e28de397cd71ee4b2ac7e
  - registry.redhat.io/rhelai1/skills-adapter-v3@sha256:53dd11a762bb39fc33c15499891309f0cdc8dbfd02abf94c9c60aad643aca255
  - registry.redhat.io/rhelai1/knowledge-adapter-v3@sha256:ef1608ec78d5e39655b505544c0f30a015a6c9cb7e2b2deffe394791f8c76c6f
  - registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1@sha256:bc08e466aa35352a621d0ad221c2e247ff9751f4cb6cffe00d5894ce6bfd3fd7
  - registry.redhat.io/rhelai1/prometheus-8x7b-v2-0@sha256:9fcb49c230f6e73ff944377307bb83a05ae3ac20300af75e429151f4f8bf4285
  - quay.io/modh/odh-generic-data-science-notebook@sha256:7c1a4ca213b71d342a2d1366171304e469da06d5f15710fab5dd3ce013aa1b73
  - quay.io/modh/vllm@sha256:3c56d4c2a5a9565e8b07ba17a6624290c4fb39ac9097b99b946326c09a8b40c8
  - quay.io/modh/vllm@sha256:97b91f9bd71202f5de8d379cfb61baec887b47f836a2ff8b158c946196de5660
  - quay.io/opendatahub/workbench-images@sha256:7f26f5f2bec4184af15acd95f29b3450526c5c28c386b6cb694fbe82d71d0b41
  - ghcr.io/oras-project/oras:main@sha256:8859e7e3ae510fb921ebeb109ac9d3e3bb91799e0d52001ae456df33929029db
- 500GB PersistentVolumeClaim (PVC) for Mixtral:
  
  The proposed method to deploy Mixtral requires a 500GB PVC.
  - In a disconnected cluster, ensure that your OpenShift environment has sufficient storage capacity and a StorageClass configured to provision this PVC.
  - If automatic PVC creation fails, you may need to manually create a PersistentVolume (PV) and bind it to a PVC.
- Accessible git repository with the taxonomy:
  - The iLab pipeline uses a taxonomy git repository, which should be accessible from the disconnected cluster

Steps

Before running the training and evaluation steps we must complete the following steps:

Prepare base model and push to object store
Setting up Judge & Teacher model
- Deploy a judge model server (Optional)
- Deploy judge model serving details
- Deploy a teacher model server (Optional)
- Deploy teacher model serving details
Setup NFS StorageClass (Optional)
Set Up Data Science Pipelines Server and Run InstructLab Pipeline

Prepare base model and push to object store

You will need a base model to train the ilab pipeline on, so to begin, upload the granite-7b-starter model to your object store.

$ mkdir -p s3-data/

Download ilab model repository in s3-data model directory

# You can also use Oras or Skopeo cli tools to download the model
# If using other tools besides ilab, ensure that filenames are mapped
# appropriately
$ ilab model download --repository docker://registry.redhat.io/rhelai1/granite-7b-starter --release 1.2
$ cp -r <path-to-model-downloaded-dir>/rhelai1/granite-7b-starter s3-data/granite-7b-starter

Generate tar archive

$ cd s3-data
$ tar -czvf rhelai.tar.gz *

Upload the created tar archive to your object store.

# Default cache location for ilab model download is ~/.cache/instructlab/models
# The model should be copied in such a way that the *.safetensors are found in s3://your-bucket-name/teach-model/*.safetensors
s3cmd sync s3-data/granite-7b-starter s3://<your-bucket-name>/granite-7b-starter

Setting up Judge & Teacher model

The Teacher model is used for Synthetic Data Generation (SDG) while the Judge model is used for model evaluation.

For the Teacher model you need mixtral-8x7b-instruct-v0-1 deployed with skills-adapter-v3:1.2 and knowledge-adapter-v3:1.2 LoRA layered skills and knowledge adapters.

For the Judge model you will need the prometheus-8x7b-v2-0 model.

If you already have these models deployed you can skip the deployment steps and go straight to the secret set up for Judge and Teacher respectively.

Deploy a judge model server (optional)

Create a service account to be used for token authentication

apiVersion: v1
kind: ServiceAccount
metadata:
  name: judge-sa
  namespace: <data-science-project-name/namespace>

Upload prometheus-8x7b-v2-0 model (Judge-Model) to the same object storage as before.

For example using ilab to download and s3cmd to sync to object store you can do:

# You can also use Oras or Skopeo cli tools to download the model
# If using other tools besides ilab, ensure that filenames are mapped
# appropriately
ilab model download --repository docker://registry.redhat.io/rhelai1/prometheus-8x7b-v2-0 --release 1.2

# Default cache location for ilab model download is ~/.cache/instructlab/models
s3cmd sync path/to/model s3://your-bucket-name/judge-model/

Navigate to the OpenShift AI dashboard

Choose Data Science Projects from the left hand menu and choose your data science project/namespace.
Select the Connections tab, and then click on the Add connection button. Enter the details of your S3 bucket (object store) and click Add data connection.

Note

Note: Before following the next step - Ensure that the CapabilityServiceMeshAuthorization status is True in DSCinitialization resource.

Create a model server instance

Navigate back to Data Science Projects page, select your namespace again, and then select the Models tab
On the right hand side select Deploy model under Single-model serving platform
Under Serving runtime choose the serving runtime vLLM Serving Runtime for Kserve.
Check the Make deployed models available through an external route box.
Under token authentication check the Require token authentication box, write the name of the service account that we have created above.
Choose the existing data connection created earlier.
Click deploy.

Deploy judge model serving details

Create a secret containing the judge model serving details

apiVersion: v1
kind: Secret
metadata:
  name: <judge-model-details-k8s-secret>
  namespace: <data-science-project-name/namespace>
type: Opaque
stringData:
  JUDGE_NAME: <judge-model-name>                              # Name of the judge model or deployment
  JUDGE_ENDPOINT: <judge-model-endpoint>                      # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
  JUDGE_API_KEY: <judge-model-api-key>                        # Deployed model-server auth token
  JUDGE_CA_CERT: <judge-model-ca-cert-config-map-name>        # Configmap containing CA cert for the judge model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
  JUDGE_CA_CERT_CM_KEY: <judge-model-ca-cert-config-map-key>  # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`

Note

Note: If using a custom CA certificate you must provide the relevant data in a ConfigMap. The config map name and key are then provided as a parameter to the pipeline as well as in the judge-serving-details secret above.

If you deployed the Judge server model using the optional instructions above then you can retrieve JUDGE_API_KEY by running the following command:

JUDGE_API_KEY=$(oc -n <data-science-project-name/namespace> create token judge-sa)

Deploy a teacher model server (Optional)

Unlike the Judge model we have to deploy the Teacher model manually on RHOAI, this consists of deploying the K8s resources using oc.

First, upload the Teacher model to s3 if it does not already exist there:

# You can also use ORAS or Skopeo cli tools to download the model
# If using other tools besides ilab, ensure that filenames are mapped
# appropriately
ilab model download --repository docker://registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 --release 1.2

# Default cache location for ilab model download is ~/.cache/instructlab/models
# The model should be copied in such a way that the *.safetensors are found in s3://your-bucket-name/teach-model/*.safetensors
s3cmd sync path/to/model s3://your-bucket-name/teach-model/

Deploy the following yaml called pre_requisites.yaml to the <data-science-project-name/namespace>

pre_requisites.yaml

---
kind: ServiceAccount
apiVersion: v1
metadata:
  name: mixtral-sa
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: mixtral-view-role
  labels:
    opendatahub.io/dashboard: 'true'
rules:
  - verbs:
      - get
    apiGroups:
      - serving.kserve.io
    resources:
      - inferenceservices
    resourceNames:
      - mixtral
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: mixtral-view
  labels:
    opendatahub.io/dashboard: 'true'
subjects:
  - kind: ServiceAccount
    name: mixtral-sa
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: mixtral-view-role
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: mixtral-serving-ilab
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 200Gi
  storageClassName: standard-csi
  volumeMode: Filesystem

oc -n <data-science-project-name/namespace> apply -f pre_requisites.yaml

You will need to ensure that the storage-config secret exists in the <data-science-project-name/namespace> namespace. And this storage-config has the configuration for the bucket where the teacher model is stored.

apiVersion: v1
stringData:
  aws-connection-my-bucket: |
    {
      "type": "s3",
      "access_key_id": "your_accesskey",
      "secret_access_key": "your_secretkey",
      "endpoint_url": "https://s3-us-east-2.amazonaws.com",
      "bucket": "mybucket",
      "default_bucket": "mybucket",
      "region": "us-east-2"
    }
kind: Secret
metadata:
  name: storage-config
type: Opaque

If this secret does not exist in this namespace, then create it. If it does exist, then ensure there is an entry for the bucket that stores the teacher model. The key is used in the InferenceService spec below.

Next we need to create the custom ServingRuntime and InferenceService.

Similar to above, deploy the following yaml files to the namespace <data-science-project-name/namespace>

You will need to update the spec.model.storage.path in the InferenceService to match the path where the model files are stored in your bucket. The key should match the value in your storage-config secret that has the bucket credentials. In our example above we use aws-connection-my-bucket.

servingruntime.mixtral.yaml

---
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  annotations:
    opendatahub.io/accelerator-name: migrated-gpu
    opendatahub.io/apiProtocol: REST
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
    opendatahub.io/template-display-name: Mixtral ServingRuntime
    opendatahub.io/template-name: vllm-runtime
    openshift.io/display-name: mixtral
  labels:
    opendatahub.io/dashboard: "true"
  name: mixtral
spec:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "8080"
  containers:
  - args:
    - --port=8080
    - --model=/mnt/models
    - --served-model-name={{.Name}}
    - --distributed-executor-backend=mp
    command:
    - python
    - -m
    - vllm.entrypoints.openai.api_server
    env:
    - name: HF_HOME
      value: /tmp/hf_home
    image: quay.io/modh/vllm@sha256:3c56d4c2a5a9565e8b07ba17a6624290c4fb39ac9097b99b946326c09a8b40c8
    name: kserve-container
    ports:
    - containerPort: 8080
      protocol: TCP
    volumeMounts:
    - mountPath: /dev/shm
      name: shm
    - mountPath: /mnt
      name: mixtral-serve
  multiModel: false
  storageHelper:
    disabled: true
  supportedModelFormats:
  - autoSelect: true
    name: vLLM
  volumes:
  - name: mixtral-serve
    persistentVolumeClaim:
      claimName: mixtral-serving-ilab
  - emptyDir:
      medium: Memory
      sizeLimit: 2Gi
    name: shm

inferenceservice.mixtral.yaml

---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: mixtral
    security.opendatahub.io/enable-auth: "true"
    serving.knative.openshift.io/enablePassthrough: "true"
    sidecar.istio.io/inject: "true"
    sidecar.istio.io/rewriteAppHTTPProbers: "true"
  finalizers:
  - inferenceservice.finalizers
  labels:
    opendatahub.io/dashboard: "true"
  name: mixtral
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      args:
      - --dtype=bfloat16
      - --tensor-parallel-size=4
      - --enable-lora
      - --max-lora-rank=64
      - --lora-dtype=bfloat16
      - --fully-sharded-loras
      - --lora-modules
      - skill-classifier-v3-clm=/mnt/models/skills
      - text-classifier-knowledge-v3-clm=/mnt/models/knowledge
      modelFormat:
        name: vLLM
      name: ""
      resources:
        limits:
          cpu: "4"
          memory: 60Gi
          nvidia.com/gpu: "4"
        requests:
          cpu: "4"
          memory: 60Gi
          nvidia.com/gpu: "4"
      runtime: mixtral
      storage:
        # the secret name of the secret deployed earlier
        key: aws-connection-my-bucket
        # update this to match the path in your bucket
        path: <prefix-path-to-mixtral-model-in-s3>
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
      operator: Exists

oc -n <data-science-project-name/namespace> apply -f servingruntime.mixtral.yaml
oc -n <data-science-project-name/namespace> apply -f inferenceservice.mixtral.yaml

A new pod named mixtral-predictor-0000#-deployment-<hash> should be created. This should result in a successful running pod. If the pod does not come up successfully, you inspect the .status field for the InferenceService for issues.

oc -n <data-science-project-name/namespace> get inferenceservice mixtral -o yaml

Deploy teacher model serving details

Create a secret containing the Teacher model serving details

apiVersion: v1
kind: Secret
metadata:
  name: <teacher-model-details-k8s-secret>
  namespace: <data-science-project-name/namespace>
type: Opaque
stringData:
  api_key:  <teacher-model-api-key>                      # Deployed model-server auth token
  endpoint: <teacher-model-endpoint>                     # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
  model: <teacher-model-name>         # Name of the teacher model or deployment
  SDG_CA_CERT:  <teacher-model-ca-config-map-name>       # Configmap containing CA cert for the teacher model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
  SDG_CA_CERT_CM_KEY: <teacher-model-ca-config-map-key>  # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`

Note

Note: If using a custom CA certificate you must provide the relevant data in a ConfigMap. The config map name and key are then provided as a parameter to the pipeline as well as in the teacher-model-details-k8s-secret secret above.

If you deployed the Teacher server model using the optional instructions above then you can retrieve api_key by running the following command:

SDG_API_KEY=$(oc -n <data-science-project-name/namespace> create token mixtral-sa)

(Optional) - Setup NFS StorageClass

Caution

The image provided here is for test purposes only. Users must provide a production ready storageclass with ReadWriteMany capability.

This step is needed when the cluster doesn't have a storage provisioner capable of provisioning PersistentVolumeClaim with ReadWriteMany capability.

Installing the NFS CSI driver

$ curl -skSL https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/v4.9.0/deploy/install-driver.sh | bash -s v4.9.0 --`

For deploying an in-cluster NFS server, apply nfs-server-deployment.yaml file

oc new-project nfs
oc apply -f ./standalone/nfs-server-deployment.yaml

Note

Note: Check the root PersistentVolumeclaim that'll be created and the requested storage.

For creating NFS storage-class, apply nfs-storage-class.yaml file

oc apply -f ./standalone/nfs-storage-class.yaml

Accelerator Profile:

An accelerator profile must also be defined within the RHOAI dashboard or via CLI to enable GPU acceleration for model serving with Kserve Serving.

apiVersion: v1
items:
- apiVersion: dashboard.opendatahub.io/v1
  kind: AcceleratorProfile
  metadata:
    name: gpu
    namespace: redhat-ods-applications
  spec:
    displayName: gpu
    enabled: true
    identifier: nvidia.com/gpu
    tolerations: []

Signed Certificate:

A signed certificate ensures that there are not any unnecessary issues when running the training pipeline.

To deploy a signed certificate in your cluster follow trusted cluster cert documentation.

This will create the required resources in the cluster, including the required StorageClass.

Set Up Data Science Pipelines Server and Run InstructLab Pipeline

Now we can continue to set up the required resources in our cluster.

The following resources will be created:

Secret
ClusterRole
ClusterRoleBinding
Pod

Create a secret resource that contains the credentials for your Object Storage (AWS S3 Bucket)

apiVersion: v1
kind: Secret
metadata:
  name: sdg-object-store-credentials
type: Opaque
stringData:
  bucket: <s3-bucket-name>             # The object store bucket containing SDG+Model+Taxonomy data. (Name of S3 bucket)
  access_key: <s3-access-key>          # The object store access key (AWS Access key ID)
  secret_key: <s3-secret-key>          # The object store secret key (AWS Secret Access Key)
  data_key: <s3-path-to-teacher-model-files>                  # The name of the tarball that contains SDG data.
  endpoint: <s3-endpoint>              # The object store endpoint
  region: <s3-region>                  # The region for the object store.
  verify_tls: true                     # Verify TLS for the object store.

Apply the yaml file to the cluster

Create a ServiceAccount, ClusterRole and ClusterRoleBinding

Provide access to the service account running the pipeline for accessing and manipulating related resources.

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  namespace: <data-science-project-name/namespace>
  name: secret-access-role
rules:
  - apiGroups: [""]
    resources: ["pods", "pods/log", "configmaps", "persistentvolumeclaims", "secrets","events"]
    verbs: ["get", "list", "watch", "create", "update", "delete"]

  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["get", "list", "create", "watch"]

  - apiGroups: ["kubeflow.org"]
    resources: ["pytorchjobs"]
    verbs: ["get", "list", "create", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: secret-access-binding
subjects:
  - kind: ServiceAccount
    name: <workbench-service-account-name> # created above in Step-2
    namespace: <data-science-project-name/namespace>
roleRef:
  kind: ClusterRole
  name: secret-access-role
  apiGroup: rbac.authorization.k8s.io

Apply the yaml to the cluster.

These are the required RBAC configuration which we are applying on the ServiceAccount.

Configure Data Science Pipeline Server:

From within the RHOAI dashboard, navigate to the Data Science Pipelines page and click Configure pipeline server. This will present you with a form where you can upload the credentials for the S3 bucket you created in the previous step.

Run the Pipeline

Now that all the cluster requirements have been setup, we are ready to upload and run our InstructLab pipeline!

Upload the Pipeline:

Now we can go back to our RHOAI Data Science Pipelines dashboard and select Import pipeline. We recommend importing the pipeline yaml directly from the github repo using: https://raw.githubusercontent.com/opendatahub-io/ilab-on-ocp/refs/heads/main/pipeline.yaml

Note: While using a Disconnected Cluster, the pipeline should be imported from a file instead of a URL.Using a URL fails in disconnected clusters as they cannot connect to GitHub.

Create a Run:

Once the pipeline is uploaded we will be able to select Create run from the Actions dropdown. This will present us with a number of parameters we can set to customize our run. Click Create run at the bottom of the page to kick off your InstructLab pipeline.

Available Pipeline Parameters:

Parameter	Definition
`sdg_repo_url`	SDG parameter. Points to a taxonomy git repository
`sdg_repo_branch`	SDG parameter. Points to a branch within the taxonomy git repository. If set, has priority over sdg_repo_pr
`sdg_repo_pr`	SDG parameter. Points to a pull request against the taxonomy git repository
`sdg_base_model`	SDG parameter. LLM model used to generate the synthetic dataset
`sdg_scale_factor`	SDG parameter. The total number of instructions to be generated
`sdg_pipeline`	SDG parameter. Data generation pipeline to use. Available: 'simple', 'full', or a valid path to a directory of pipeline workflow YAML files. Note that 'full' requires a larger teacher model, Mixtral-8x7b.
`sdg_max_batch_len`	SDG parameter. Maximum tokens per gpu for each batch that will be handled in a single step.
`sdg_sample_size`	SDG parameter. Sampling size used for Synthetic Data Generation
`train_nproc_per_node`	Training parameter. Number of GPUs per each node/worker to use for training.
`train_nnodes`	Training parameter. Number of nodes/workers to train on.
`train_num_epochs_phase_1`	Training parameter for in Phase 1. Number of epochs to run training.
`train_num_epochs_phase_2`	Training parameter for in Phase 2. Number of epochs to run training.
`train_effective_batch_size_phase_1`	Training parameter for in Phase 1. The number of samples in a batch that the model should see before its parameters are updated.
`train_effective_batch_size_phase_2`	Training parameter for in Phase 2. The number of samples in a batch that the model should see before its parameters are updated.
`train_learning_rate_phase_1`	Training parameter for in Phase 1. How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size.
`train_learning_rate_phase_2`	Training parameter for in Phase 2. How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size.
`train_num_warmup_steps_phase_1`	Training parameter for in Phase 1. The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to train_learning_rate.
`train_num_warmup_steps_phase_2`	Training parameter for in Phase 2. The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to train_learning_rate.
`train_save_samples`	Training parameter. Number of samples the model should see before saving a checkpoint.
`train_max_batch_len`	Training parameter. Maximum tokens per gpu for each batch that will be handled in a single step.
`train_seed`	Training parameter. Random seed for initializing training.
`mt_bench_max_workers`	MT Bench parameter. Number of workers to use for evaluation with mt_bench or mt_bench_branch. Must be a positive integer or 'auto'.
`mt_bench_merge_system_user_message`	MT Bench parameter. Boolean indicating whether to merge system and user messages (required for Mistral based judges)
`final_eval_max_workers`	Final model evaluation parameter for MT Bench Branch. Number of workers to use for evaluation with mt_bench or mt_bench_branch. Must be a positive integer or 'auto'.
`final_eval_few_shots`	Final model evaluation parameter for MMLU. Number of question-answer pairs provided in the context preceding the question used for evaluation.
`final_eval_batch_size`	Final model evaluation parameter for MMLU. Batch size for evaluation. Valid values are a positive integer or 'auto' to select the largest batch size that will fit in memory.
`final_eval_merge_system_user_message`	Final model evaluation parameter for MT Bench Branch. Boolean indicating whether to merge system and user messages (required for Mistral based judges)
`k8s_storage_class_name`	A Kubernetes StorageClass name for persistent volumes. Selected StorageClass must support RWX PersistentVolumes.

Suggested Parameters: Full Pipeline

To run the ilab Pipeline at full capabilities, we suggest using these values:

Parameter	Suggested Value
`sdg_repo_url`	https://github.com/instructlab/taxonomy.git
`sdg_repo_branch`	""
`sdg_repo_pr`	0
`sdg_base_model`	s3:///<PATH_TO_MODEL>
`sdg_scale_factor`	30
`sdg_pipeline`	"full"
`sdg_max_batch_len`	5000
`sdg_sample_size`	1.0
`train_nproc_per_node`	2
`train_nnodes`	2
`train_num_epochs_phase_1`	7
`train_num_epochs_phase_2`	10
`train_effective_batch_size_phase_1`	128
`train_effective_batch_size_phase_2`	3840
`train_learning_rate_phase_1`	2e-05
`train_learning_rate_phase_2`	6e-06
`train_num_warmup_steps_phase_1`	1000
`train_num_warmup_steps_phase_2`	1000
`train_save_samples`	250000
`train_max_batch_len`	5000
`train_seed`	42
`mt_bench_max_workers`	"auto"
`mt_bench_merge_system_user_message`	False
`final_eval_max_workers`	"auto"
`final_eval_few_shots`	5
`final_eval_batch_size`	"auto"
`final_eval_merge_system_user_message`	False
`k8s_storage_class_name`	standard
Note that this will take a very long time, on the scale of double-digit hours of runtime.

Suggested Parameters: Development

Running the ilab pipeline at full capabilities takes a very long time, and with a good amount of resource consumption. To create a e2e run that completes much quicker (at the expense of output quality), and with fewer resources (namely, GPU nodes) we suggest using these values instead:

Parameter	Suggested Value
`sdg_repo_url`	https://github.com/instructlab/taxonomy.git
`sdg_repo_branch`	""
`sdg_repo_pr`	0
`sdg_base_model`	s3:///<PATH_TO_MODEL>
`sdg_scale_factor`	30
`sdg_pipeline`	"simple"
`sdg_max_batch_len`	5000
`sdg_sample_size`	0.0002
`train_nproc_per_node`	1
`train_nnodes`	1
`train_num_epochs_phase_1`	2
`train_num_epochs_phase_2`	2
`train_effective_batch_size_phase_1`	3840
`train_effective_batch_size_phase_2`	3840
`train_learning_rate_phase_1`	.0001
`train_learning_rate_phase_2`	.0001
`train_num_warmup_steps_phase_1`	800
`train_num_warmup_steps_phase_2`	800
`train_save_samples`	0
`train_max_batch_len`	20000
`train_seed`	42
`mt_bench_max_workers`	"auto"
`mt_bench_merge_system_user_message`	False
`final_eval_max_workers`	"auto"
`final_eval_few_shots`	5
`final_eval_batch_size`	"auto"
`final_eval_merge_system_user_message`	False
`k8s_storage_class_name`	standard

Using these parameters will allow a user to run the complete pipeline much quicker; in testing we have found this to take about 90 minutes. Additionally, we can point the judge-server and teacher-server to the same Mistral model, which only uses 1 GPU, and the PyTorchJob configuration specified here also only uses 2 training nodes of 1 GPU, so a total of 3 GPUs are required, rather than the 8-9 GPUs required for the full pipeline. With that said, the output model quality is likely very poor, and these should only be used for testing purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run-pipeline-on-rhoai.md

run-pipeline-on-rhoai.md

Running InstructLab Pipeline with Data Science Pipelines on RHOAI

Pre-requisites

Steps

Prepare base model and push to object store

Setting up Judge & Teacher model

Deploy a judge model server (optional)

Deploy judge model serving details

Deploy a teacher model server (Optional)

Deploy teacher model serving details

(Optional) - Setup NFS StorageClass

Accelerator Profile:

Signed Certificate:

Set Up Data Science Pipelines Server and Run InstructLab Pipeline

Configure Data Science Pipeline Server:

Run the Pipeline

Upload the Pipeline:

Create a Run:

Available Pipeline Parameters:

Suggested Parameters: Full Pipeline

Suggested Parameters: Development

Files

run-pipeline-on-rhoai.md

Latest commit

History

run-pipeline-on-rhoai.md

File metadata and controls

Running InstructLab Pipeline with Data Science Pipelines on RHOAI

Pre-requisites

Steps

Prepare base model and push to object store

Setting up Judge & Teacher model

Deploy a judge model server (optional)

Deploy judge model serving details

Deploy a teacher model server (Optional)

Deploy teacher model serving details

(Optional) - Setup NFS StorageClass

Accelerator Profile:

Signed Certificate:

Set Up Data Science Pipelines Server and Run InstructLab Pipeline

Configure Data Science Pipeline Server:

Run the Pipeline

Upload the Pipeline:

Create a Run:

Available Pipeline Parameters:

Suggested Parameters: Full Pipeline

Suggested Parameters: Development