This file provides step-by-step instructions for setting up and using the Data Science Pipelines (DSP) for InstructLab iterations.
-
An OpenShift cluster with
- Sufficient GPUs available for training.
- At a minimum, a node with at least 4 GPUs such as an NVIDIA A100s
- The following Operators already installed:
- Red Hat - Authorino
- Red Hat Openshift Serverless
- Red Hat OpenShift Service Mesh v2
- NOTE: v3 is not compatible with RHOAI
- Red Hat Openshift AI and operator dependencies documented at OpenShift AI Supported Configurations
- Sufficient GPUs available for training.
-
Teacher and Judge models with a serving endpoint
- If already setup you will need the endpoint, api key, and any CA bundles if needed for each model
- If setting up your own using these instructions, you will need additional multi-node A100s or L40s for each model
-
SDG taxonomy tree to utilize for Synthetic Data Generation (SDG)
- see instructions for creating a taxonomy tree on how to set up your own taxonomy tree.
-
An OpenShift AI installation, with the Training Operator and KServe components set to
Managed
- A data science project/namespace, in this document this will be referred to as
<data-science-project-name/namespace>
- A data science project/namespace, in this document this will be referred to as
-
A StorageClass that supports dynamic provisioning with ReadWriteMany access mode (see step 3 below).
-
An S3 object store such as AWS, or an alternative object storage solution that is S3-compliant such as Ceph, Nooba and MinIO.
-
A locally installed
oc
command line tool to create and manage kubernetes resources. -
Ilab CLI (or Skopeo/Oras/etc.) for model downloads
-
For Disconnected Clusters:
-
Mirror Required Images:
In a disconnected environment, you must mirror the following container images to your internal registry before running the pipeline. Use tools like
oc adm release mirror
,skopeo
, ororas
to mirror these images:registry.redhat.io/ubi9/toolbox@sha256:da31dee8904a535d12689346e65e5b00d11a6179abf1fa69b548dbd755fa2770
registry.redhat.io/openshift4/ose-cli@sha256:1d5c8442a6ec745e6ae44a7738c0681f1e21aac8be76ba826c2ddf2eed8475db
registry.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:b3dc9af0244aa6b84e6c3ef53e714a316daaefaae67e28de397cd71ee4b2ac7e
registry.redhat.io/rhelai1/skills-adapter-v3@sha256:53dd11a762bb39fc33c15499891309f0cdc8dbfd02abf94c9c60aad643aca255
registry.redhat.io/rhelai1/knowledge-adapter-v3@sha256:ef1608ec78d5e39655b505544c0f30a015a6c9cb7e2b2deffe394791f8c76c6f
registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1@sha256:bc08e466aa35352a621d0ad221c2e247ff9751f4cb6cffe00d5894ce6bfd3fd7
registry.redhat.io/rhelai1/prometheus-8x7b-v2-0@sha256:9fcb49c230f6e73ff944377307bb83a05ae3ac20300af75e429151f4f8bf4285
quay.io/modh/odh-generic-data-science-notebook@sha256:7c1a4ca213b71d342a2d1366171304e469da06d5f15710fab5dd3ce013aa1b73
quay.io/modh/vllm@sha256:3c56d4c2a5a9565e8b07ba17a6624290c4fb39ac9097b99b946326c09a8b40c8
quay.io/modh/vllm@sha256:97b91f9bd71202f5de8d379cfb61baec887b47f836a2ff8b158c946196de5660
quay.io/opendatahub/workbench-images@sha256:7f26f5f2bec4184af15acd95f29b3450526c5c28c386b6cb694fbe82d71d0b41
ghcr.io/oras-project/oras:main@sha256:8859e7e3ae510fb921ebeb109ac9d3e3bb91799e0d52001ae456df33929029db
-
500GB PersistentVolumeClaim (PVC) for Mixtral:
The proposed method to deploy Mixtral requires a 500GB PVC.
- In a disconnected cluster, ensure that your OpenShift environment has sufficient storage capacity and a StorageClass configured to provision this PVC.
- If automatic PVC creation fails, you may need to manually create a PersistentVolume (PV) and bind it to a PVC.
-
Accessible git repository with the taxonomy:
- The iLab pipeline uses a taxonomy git repository, which should be accessible from the disconnected cluster
-
Before running the training and evaluation steps we must complete the following steps:
- Prepare base model and push to object store
- Setting up Judge & Teacher model
- Setup NFS StorageClass (Optional)
- Set Up Data Science Pipelines Server and Run InstructLab Pipeline
You will need a base model to train the ilab pipeline on, so to begin, upload the granite-7b-starter model to your object store.
$ mkdir -p s3-data/
Download ilab model repository in s3-data model directory
# You can also use Oras or Skopeo cli tools to download the model
# If using other tools besides ilab, ensure that filenames are mapped
# appropriately
$ ilab model download --repository docker://registry.redhat.io/rhelai1/granite-7b-starter --release 1.2
$ cp -r <path-to-model-downloaded-dir>/rhelai1/granite-7b-starter s3-data/granite-7b-starter
Generate tar archive
$ cd s3-data
$ tar -czvf rhelai.tar.gz *
Upload the created tar archive to your object store.
# Default cache location for ilab model download is ~/.cache/instructlab/models
# The model should be copied in such a way that the *.safetensors are found in s3://your-bucket-name/teach-model/*.safetensors
s3cmd sync s3-data/granite-7b-starter s3://<your-bucket-name>/granite-7b-starter
The Teacher model is used for Synthetic Data Generation (SDG) while the Judge model is used for model evaluation.
For the Teacher model you need mixtral-8x7b-instruct-v0-1 deployed with skills-adapter-v3:1.2 and knowledge-adapter-v3:1.2 LoRA layered skills and knowledge adapters.
For the Judge model you will need the prometheus-8x7b-v2-0 model.
If you already have these models deployed you can skip the deployment steps and go straight to the secret set up for Judge and Teacher respectively.
Create a service account to be used for token authentication
apiVersion: v1
kind: ServiceAccount
metadata:
name: judge-sa
namespace: <data-science-project-name/namespace>
Upload prometheus-8x7b-v2-0 model (Judge-Model) to the same object storage as before.
For example using ilab
to download and s3cmd
to sync to object store you can do:
# You can also use Oras or Skopeo cli tools to download the model
# If using other tools besides ilab, ensure that filenames are mapped
# appropriately
ilab model download --repository docker://registry.redhat.io/rhelai1/prometheus-8x7b-v2-0 --release 1.2
# Default cache location for ilab model download is ~/.cache/instructlab/models
s3cmd sync path/to/model s3://your-bucket-name/judge-model/
Navigate to the OpenShift AI dashboard
- Choose Data Science Projects from the left hand menu and choose your data science project/namespace.
- Select the Connections tab, and then click on the Add connection button. Enter the details of your S3 bucket (object store) and click Add data connection.
Note
Note: Before following the next step - Ensure that the CapabilityServiceMeshAuthorization
status is True
in DSCinitialization
resource.
Create a model server instance
- Navigate back to Data Science Projects page, select your namespace again, and then select the Models tab
- On the right hand side select Deploy model under Single-model serving platform
- Under Serving runtime choose the serving runtime
vLLM Serving Runtime for Kserve
. - Check the
Make deployed models available through an external route
box. - Under token authentication check the
Require token authentication
box, write the name of the service account that we have created above. - Choose the existing data connection created earlier.
- Click deploy.
Create a secret containing the judge model serving details
apiVersion: v1
kind: Secret
metadata:
name: <judge-model-details-k8s-secret>
namespace: <data-science-project-name/namespace>
type: Opaque
stringData:
JUDGE_NAME: <judge-model-name> # Name of the judge model or deployment
JUDGE_ENDPOINT: <judge-model-endpoint> # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
JUDGE_API_KEY: <judge-model-api-key> # Deployed model-server auth token
JUDGE_CA_CERT: <judge-model-ca-cert-config-map-name> # Configmap containing CA cert for the judge model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
JUDGE_CA_CERT_CM_KEY: <judge-model-ca-cert-config-map-key> # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`
Note
Note: If using a custom CA certificate you must provide the relevant data in a ConfigMap. The config map name and key
are then provided as a parameter to the pipeline as well as in the judge-serving-details
secret above.
If you deployed the Judge server model using the optional instructions above then you can retrieve JUDGE_API_KEY
by
running the following command:
JUDGE_API_KEY=$(oc -n <data-science-project-name/namespace> create token judge-sa)
Unlike the Judge model we have to deploy the Teacher model manually on RHOAI, this consists of deploying the K8s resources
using oc
.
First, upload the Teacher model to s3 if it does not already exist there:
# You can also use ORAS or Skopeo cli tools to download the model
# If using other tools besides ilab, ensure that filenames are mapped
# appropriately
ilab model download --repository docker://registry.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 --release 1.2
# Default cache location for ilab model download is ~/.cache/instructlab/models
# The model should be copied in such a way that the *.safetensors are found in s3://your-bucket-name/teach-model/*.safetensors
s3cmd sync path/to/model s3://your-bucket-name/teach-model/
Deploy the following yaml
called pre_requisites.yaml
to the <data-science-project-name/namespace>
pre_requisites.yaml
---
kind: ServiceAccount
apiVersion: v1
metadata:
name: mixtral-sa
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: mixtral-view-role
labels:
opendatahub.io/dashboard: 'true'
rules:
- verbs:
- get
apiGroups:
- serving.kserve.io
resources:
- inferenceservices
resourceNames:
- mixtral
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: mixtral-view
labels:
opendatahub.io/dashboard: 'true'
subjects:
- kind: ServiceAccount
name: mixtral-sa
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: mixtral-view-role
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: mixtral-serving-ilab
labels:
opendatahub.io/dashboard: 'true'
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
storageClassName: standard-csi
volumeMode: Filesystem
oc -n <data-science-project-name/namespace> apply -f pre_requisites.yaml
You will need to ensure that the storage-config
secret exists in the <data-science-project-name/namespace>
namespace.
And this storage-config
has the configuration for the bucket where the teacher model is stored.
apiVersion: v1
stringData:
aws-connection-my-bucket: |
{
"type": "s3",
"access_key_id": "your_accesskey",
"secret_access_key": "your_secretkey",
"endpoint_url": "https://s3-us-east-2.amazonaws.com",
"bucket": "mybucket",
"default_bucket": "mybucket",
"region": "us-east-2"
}
kind: Secret
metadata:
name: storage-config
type: Opaque
If this secret does not exist in this namespace, then create it. If it does exist, then ensure there is an entry
for the bucket that stores the teacher model. The key
is used in the InferenceService
spec below.
Next we need to create the custom ServingRuntime
and InferenceService
.
Similar to above, deploy the following yaml
files to the namespace <data-science-project-name/namespace>
You will need to update the spec.model.storage.path
in the InferenceService
to match the path where the model files are
stored in your bucket. The key
should match the value in your storage-config
secret that has the bucket credentials.
In our example above we use aws-connection-my-bucket
.
servingruntime.mixtral.yaml
---
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
annotations:
opendatahub.io/accelerator-name: migrated-gpu
opendatahub.io/apiProtocol: REST
opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
opendatahub.io/template-display-name: Mixtral ServingRuntime
opendatahub.io/template-name: vllm-runtime
openshift.io/display-name: mixtral
labels:
opendatahub.io/dashboard: "true"
name: mixtral
spec:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8080"
containers:
- args:
- --port=8080
- --model=/mnt/models
- --served-model-name={{.Name}}
- --distributed-executor-backend=mp
command:
- python
- -m
- vllm.entrypoints.openai.api_server
env:
- name: HF_HOME
value: /tmp/hf_home
image: quay.io/modh/vllm@sha256:3c56d4c2a5a9565e8b07ba17a6624290c4fb39ac9097b99b946326c09a8b40c8
name: kserve-container
ports:
- containerPort: 8080
protocol: TCP
volumeMounts:
- mountPath: /dev/shm
name: shm
- mountPath: /mnt
name: mixtral-serve
multiModel: false
storageHelper:
disabled: true
supportedModelFormats:
- autoSelect: true
name: vLLM
volumes:
- name: mixtral-serve
persistentVolumeClaim:
claimName: mixtral-serving-ilab
- emptyDir:
medium: Memory
sizeLimit: 2Gi
name: shm
inferenceservice.mixtral.yaml
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
annotations:
openshift.io/display-name: mixtral
security.opendatahub.io/enable-auth: "true"
serving.knative.openshift.io/enablePassthrough: "true"
sidecar.istio.io/inject: "true"
sidecar.istio.io/rewriteAppHTTPProbers: "true"
finalizers:
- inferenceservice.finalizers
labels:
opendatahub.io/dashboard: "true"
name: mixtral
spec:
predictor:
maxReplicas: 1
minReplicas: 1
model:
args:
- --dtype=bfloat16
- --tensor-parallel-size=4
- --enable-lora
- --max-lora-rank=64
- --lora-dtype=bfloat16
- --fully-sharded-loras
- --lora-modules
- skill-classifier-v3-clm=/mnt/models/skills
- text-classifier-knowledge-v3-clm=/mnt/models/knowledge
modelFormat:
name: vLLM
name: ""
resources:
limits:
cpu: "4"
memory: 60Gi
nvidia.com/gpu: "4"
requests:
cpu: "4"
memory: 60Gi
nvidia.com/gpu: "4"
runtime: mixtral
storage:
# the secret name of the secret deployed earlier
key: aws-connection-my-bucket
# update this to match the path in your bucket
path: <prefix-path-to-mixtral-model-in-s3>
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
oc -n <data-science-project-name/namespace> apply -f servingruntime.mixtral.yaml
oc -n <data-science-project-name/namespace> apply -f inferenceservice.mixtral.yaml
A new pod named mixtral-predictor-0000#-deployment-<hash>
should be created. This should result in a successful
running pod. If the pod does not come up successfully, you inspect the .status
field for the InferenceService
for issues.
oc -n <data-science-project-name/namespace> get inferenceservice mixtral -o yaml
Create a secret containing the Teacher model serving details
apiVersion: v1
kind: Secret
metadata:
name: <teacher-model-details-k8s-secret>
namespace: <data-science-project-name/namespace>
type: Opaque
stringData:
api_key: <teacher-model-api-key> # Deployed model-server auth token
endpoint: <teacher-model-endpoint> # Model serving endpoint, Sample format - `https://<deployed-model-server-endpoint>/v1`
model: <teacher-model-name> # Name of the teacher model or deployment
SDG_CA_CERT: <teacher-model-ca-config-map-name> # Configmap containing CA cert for the teacher model (optional - required if using custom CA cert), Example - `kube-root-ca.crt`
SDG_CA_CERT_CM_KEY: <teacher-model-ca-config-map-key> # Name of key inside configmap (optional - required if using custom CA cert), Example - `ca.crt`
Note
Note: If using a custom CA certificate you must provide the relevant data in a ConfigMap. The config map name and
key are then provided as a parameter to the pipeline as well as in the teacher-model-details-k8s-secret
secret above.
If you deployed the Teacher server model using the optional instructions above then you can retrieve api_key
by
running the following command:
SDG_API_KEY=$(oc -n <data-science-project-name/namespace> create token mixtral-sa)
Caution
The image provided here is for test purposes only. Users must provide a production ready storageclass with ReadWriteMany capability.
This step is needed when the cluster doesn't have a storage provisioner capable of provisioning PersistentVolumeClaim with ReadWriteMany capability.
Installing the NFS CSI driver
$ curl -skSL https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/v4.9.0/deploy/install-driver.sh | bash -s v4.9.0 --`
For deploying an in-cluster NFS server, apply nfs-server-deployment.yaml file
oc new-project nfs
oc apply -f ./standalone/nfs-server-deployment.yaml
Note
Note: Check the root PersistentVolumeclaim that'll be created and the requested storage.
For creating NFS storage-class, apply nfs-storage-class.yaml file
oc apply -f ./standalone/nfs-storage-class.yaml
An accelerator profile must also be defined within the RHOAI dashboard or via CLI to enable GPU acceleration for model serving with Kserve Serving.
apiVersion: v1
items:
- apiVersion: dashboard.opendatahub.io/v1
kind: AcceleratorProfile
metadata:
name: gpu
namespace: redhat-ods-applications
spec:
displayName: gpu
enabled: true
identifier: nvidia.com/gpu
tolerations: []
A signed certificate ensures that there are not any unnecessary issues when running the training pipeline.
To deploy a signed certificate in your cluster follow trusted cluster cert documentation.
This will create the required resources in the cluster, including the required StorageClass.
Now we can continue to set up the required resources in our cluster.
The following resources will be created:
- Secret
- ClusterRole
- ClusterRoleBinding
- Pod
Create a secret resource that contains the credentials for your Object Storage (AWS S3 Bucket)
apiVersion: v1
kind: Secret
metadata:
name: sdg-object-store-credentials
type: Opaque
stringData:
bucket: <s3-bucket-name> # The object store bucket containing SDG+Model+Taxonomy data. (Name of S3 bucket)
access_key: <s3-access-key> # The object store access key (AWS Access key ID)
secret_key: <s3-secret-key> # The object store secret key (AWS Secret Access Key)
data_key: <s3-path-to-teacher-model-files> # The name of the tarball that contains SDG data.
endpoint: <s3-endpoint> # The object store endpoint
region: <s3-region> # The region for the object store.
verify_tls: true # Verify TLS for the object store.
Apply the yaml file to the cluster
Create a ServiceAccount, ClusterRole and ClusterRoleBinding
Provide access to the service account running the pipeline for accessing and manipulating related resources.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
namespace: <data-science-project-name/namespace>
name: secret-access-role
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "configmaps", "persistentvolumeclaims", "secrets","events"]
verbs: ["get", "list", "watch", "create", "update", "delete"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["get", "list", "create", "watch"]
- apiGroups: ["kubeflow.org"]
resources: ["pytorchjobs"]
verbs: ["get", "list", "create", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: secret-access-binding
subjects:
- kind: ServiceAccount
name: <workbench-service-account-name> # created above in Step-2
namespace: <data-science-project-name/namespace>
roleRef:
kind: ClusterRole
name: secret-access-role
apiGroup: rbac.authorization.k8s.io
Apply the yaml to the cluster.
These are the required RBAC configuration which we are applying on the ServiceAccount.
From within the RHOAI dashboard, navigate to the Data Science Pipelines page and click Configure pipeline server. This will present you with a form where you can upload the credentials for the S3 bucket you created in the previous step.
Now that all the cluster requirements have been setup, we are ready to upload and run our InstructLab pipeline!
Now we can go back to our RHOAI Data Science Pipelines dashboard and select Import pipeline. We recommend importing the pipeline yaml directly from the github repo using: https://raw.githubusercontent.com/opendatahub-io/ilab-on-ocp/refs/heads/main/pipeline.yaml
Note: While using a Disconnected Cluster, the pipeline should be imported from a file instead of a URL.Using a URL fails in disconnected clusters as they cannot connect to GitHub.
Once the pipeline is uploaded we will be able to select Create run from the Actions dropdown. This will present us with a number of parameters we can set to customize our run. Click Create run at the bottom of the page to kick off your InstructLab pipeline.
Parameter | Definition |
---|---|
sdg_repo_url |
SDG parameter. Points to a taxonomy git repository |
sdg_repo_branch |
SDG parameter. Points to a branch within the taxonomy git repository. If set, has priority over sdg_repo_pr |
sdg_repo_pr |
SDG parameter. Points to a pull request against the taxonomy git repository |
sdg_base_model |
SDG parameter. LLM model used to generate the synthetic dataset |
sdg_scale_factor |
SDG parameter. The total number of instructions to be generated |
sdg_pipeline |
SDG parameter. Data generation pipeline to use. Available: 'simple', 'full', or a valid path to a directory of pipeline workflow YAML files. Note that 'full' requires a larger teacher model, Mixtral-8x7b. |
sdg_max_batch_len |
SDG parameter. Maximum tokens per gpu for each batch that will be handled in a single step. |
sdg_sample_size |
SDG parameter. Sampling size used for Synthetic Data Generation |
train_nproc_per_node |
Training parameter. Number of GPUs per each node/worker to use for training. |
train_nnodes |
Training parameter. Number of nodes/workers to train on. |
train_num_epochs_phase_1 |
Training parameter for in Phase 1. Number of epochs to run training. |
train_num_epochs_phase_2 |
Training parameter for in Phase 2. Number of epochs to run training. |
train_effective_batch_size_phase_1 |
Training parameter for in Phase 1. The number of samples in a batch that the model should see before its parameters are updated. |
train_effective_batch_size_phase_2 |
Training parameter for in Phase 2. The number of samples in a batch that the model should see before its parameters are updated. |
train_learning_rate_phase_1 |
Training parameter for in Phase 1. How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size. |
train_learning_rate_phase_2 |
Training parameter for in Phase 2. How fast we optimize the weights during gradient descent. Higher values may lead to unstable learning performance. It's generally recommended to have a low learning rate with a high effective batch size. |
train_num_warmup_steps_phase_1 |
Training parameter for in Phase 1. The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to train_learning_rate. |
train_num_warmup_steps_phase_2 |
Training parameter for in Phase 2. The number of steps a model should go through before reaching the full learning rate. We start at 0 and linearly climb up to train_learning_rate. |
train_save_samples |
Training parameter. Number of samples the model should see before saving a checkpoint. |
train_max_batch_len |
Training parameter. Maximum tokens per gpu for each batch that will be handled in a single step. |
train_seed |
Training parameter. Random seed for initializing training. |
mt_bench_max_workers |
MT Bench parameter. Number of workers to use for evaluation with mt_bench or mt_bench_branch. Must be a positive integer or 'auto'. |
mt_bench_merge_system_user_message |
MT Bench parameter. Boolean indicating whether to merge system and user messages (required for Mistral based judges) |
final_eval_max_workers |
Final model evaluation parameter for MT Bench Branch. Number of workers to use for evaluation with mt_bench or mt_bench_branch. Must be a positive integer or 'auto'. |
final_eval_few_shots |
Final model evaluation parameter for MMLU. Number of question-answer pairs provided in the context preceding the question used for evaluation. |
final_eval_batch_size |
Final model evaluation parameter for MMLU. Batch size for evaluation. Valid values are a positive integer or 'auto' to select the largest batch size that will fit in memory. |
final_eval_merge_system_user_message |
Final model evaluation parameter for MT Bench Branch. Boolean indicating whether to merge system and user messages (required for Mistral based judges) |
k8s_storage_class_name |
A Kubernetes StorageClass name for persistent volumes. Selected StorageClass must support RWX PersistentVolumes. |
To run the ilab Pipeline at full capabilities, we suggest using these values:
Parameter | Suggested Value |
---|---|
sdg_repo_url |
https://github.com/instructlab/taxonomy.git |
sdg_repo_branch |
"" |
sdg_repo_pr |
0 |
sdg_base_model |
s3:///<PATH_TO_MODEL> |
sdg_scale_factor |
30 |
sdg_pipeline |
"full" |
sdg_max_batch_len |
5000 |
sdg_sample_size |
1.0 |
train_nproc_per_node |
2 |
train_nnodes |
2 |
train_num_epochs_phase_1 |
7 |
train_num_epochs_phase_2 |
10 |
train_effective_batch_size_phase_1 |
128 |
train_effective_batch_size_phase_2 |
3840 |
train_learning_rate_phase_1 |
2e-05 |
train_learning_rate_phase_2 |
6e-06 |
train_num_warmup_steps_phase_1 |
1000 |
train_num_warmup_steps_phase_2 |
1000 |
train_save_samples |
250000 |
train_max_batch_len |
5000 |
train_seed |
42 |
mt_bench_max_workers |
"auto" |
mt_bench_merge_system_user_message |
False |
final_eval_max_workers |
"auto" |
final_eval_few_shots |
5 |
final_eval_batch_size |
"auto" |
final_eval_merge_system_user_message |
False |
k8s_storage_class_name |
standard |
Note that this will take a very long time, on the scale of double-digit hours of runtime. |
Running the ilab pipeline at full capabilities takes a very long time, and with a good amount of resource consumption. To create a e2e run that completes much quicker (at the expense of output quality), and with fewer resources (namely, GPU nodes) we suggest using these values instead:
Parameter | Suggested Value |
---|---|
sdg_repo_url |
https://github.com/instructlab/taxonomy.git |
sdg_repo_branch |
"" |
sdg_repo_pr |
0 |
sdg_base_model |
s3:///<PATH_TO_MODEL> |
sdg_scale_factor |
30 |
sdg_pipeline |
"simple" |
sdg_max_batch_len |
5000 |
sdg_sample_size |
0.0002 |
train_nproc_per_node |
1 |
train_nnodes |
1 |
train_num_epochs_phase_1 |
2 |
train_num_epochs_phase_2 |
2 |
train_effective_batch_size_phase_1 |
3840 |
train_effective_batch_size_phase_2 |
3840 |
train_learning_rate_phase_1 |
.0001 |
train_learning_rate_phase_2 |
.0001 |
train_num_warmup_steps_phase_1 |
800 |
train_num_warmup_steps_phase_2 |
800 |
train_save_samples |
0 |
train_max_batch_len |
20000 |
train_seed |
42 |
mt_bench_max_workers |
"auto" |
mt_bench_merge_system_user_message |
False |
final_eval_max_workers |
"auto" |
final_eval_few_shots |
5 |
final_eval_batch_size |
"auto" |
final_eval_merge_system_user_message |
False |
k8s_storage_class_name |
standard |
Using these parameters will allow a user to run the complete pipeline much quicker; in testing we have found this to take about 90 minutes.
Additionally, we can point the judge-server
and teacher-server
to the same Mistral model, which only uses 1 GPU, and the PyTorchJob configuration
specified here also only uses 2 training nodes of 1 GPU, so a total of 3 GPUs are required, rather than the 8-9 GPUs required for the full pipeline.
With that said, the output model quality is likely very poor, and these should only be used for testing purposes.