LMEvalJobs with Kueue suspend enabled start immediately #362

ruivieira · 2024-11-14T15:10:32Z

Tested with:

ODH 2.19
TrustyAI latest

DataScienceCluster:

kind: DataScienceCluster
metadata:
  name: default-dsc
  labels:
    app.kubernetes.io/created-by: opendatahub-operator
    app.kubernetes.io/instance: default
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: datasciencecluster
    app.kubernetes.io/part-of: opendatahub-operator
spec:
  components:
    codeflare:
      managementState: Removed
    kserve:
      serving:
        ingressGateway:
          certificate:
            type: OpenshiftDefaultIngress
        managementState: Managed
        name: knative-serving
      managementState: Removed
      defaultDeploymentMode: Serverless
    modelregistry:
      registriesNamespace: odh-model-registries
      managementState: Removed
    trustyai:
      devFlags:
        manifests:
          - contextDir: config
            sourcePath: ''
            uri: 'https://github.com/trustyai-explainability/trustyai-service-operator/tarball/main'
      managementState: Managed
    ray:
      managementState: Removed
    kueue:
      managementState: Managed
    workbenches:
      managementState: Removed
    dashboard:
      managementState: Managed   
    modelmeshserving:
      managementState: Managed
    datasciencepipelines:
      managementState: Removed
    trainingoperator:
      managementState: Removed

LMEvalJob CR:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  suspend: true
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base 
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli" 
      template: "templates.classification.multi_class.relation.default" 
  logSamples: true

The text was updated successfully, but these errors were encountered:

yhwang · 2024-11-14T17:35:42Z

one quick question: is the JOB_MGR added to the enable-services here: https://github.com/trustyai-explainability/trustyai-service-operator/blob/main/config/manager/manager.yaml#L36

The JOB_MGR is not in the deployment yet. The overlays that enable the JOB_MGR is not there yet. Need to manually enable this feature.

yhwang · 2024-11-15T17:59:07Z

Let me re-post my comment on another PR here. It's easier to know all the needed settings for enabling the Kueue integration with the LMEvalJob:

When enabling the Kueue for LMES, one extra update in the kueue-manager-config configmap is needed. LMEvalJob needs to be added into the externalFrameworks like this:

integrations:
  .......
  externalFrameworks:
  - "trustyai.opendatahub.io/lmevaljob"

ruivieira added kind/bug Something isn't working lm-eval Issues related to LM-Eval labels Nov 14, 2024

ruivieira added this to TrustyAI planning Nov 14, 2024

github-project-automation bot moved this to Todo in TrustyAI planning Nov 14, 2024

ruivieira added this to the LM-Eval milestone Nov 14, 2024

ruivieira changed the title ~~LMEvalJobs with Keueu suspend enabled start immediately~~ LMEvalJobs with Kueue suspend enabled start immediately Nov 14, 2024

This was linked to pull requests Nov 14, 2024

feat(lmeval): Enable Kueue Job Manager in LMEval #363

Closed

feat(lmeval): Add overlay for Kueue support #364

Merged

ruivieira closed this as completed in #364 Nov 15, 2024

github-project-automation bot moved this from Todo to Done in TrustyAI planning Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMEvalJobs with Kueue suspend enabled start immediately #362

LMEvalJobs with Kueue suspend enabled start immediately #362

ruivieira commented Nov 14, 2024

yhwang commented Nov 14, 2024 •

edited

Loading

yhwang commented Nov 15, 2024

LMEvalJobs with Kueue suspend enabled start immediately #362

LMEvalJobs with Kueue suspend enabled start immediately #362

Comments

ruivieira commented Nov 14, 2024

yhwang commented Nov 14, 2024 • edited Loading

yhwang commented Nov 15, 2024

yhwang commented Nov 14, 2024 •

edited

Loading