Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LMEvalJobs with Kueue suspend enabled start immediately #362

Closed
ruivieira opened this issue Nov 14, 2024 · 2 comments · Fixed by #364
Closed

LMEvalJobs with Kueue suspend enabled start immediately #362

ruivieira opened this issue Nov 14, 2024 · 2 comments · Fixed by #364
Labels
kind/bug Something isn't working lm-eval Issues related to LM-Eval
Milestone

Comments

@ruivieira
Copy link
Member

Tested with:

  • ODH 2.19
  • TrustyAI latest

DataScienceCluster:

kind: DataScienceCluster
metadata:
  name: default-dsc
  labels:
    app.kubernetes.io/created-by: opendatahub-operator
    app.kubernetes.io/instance: default
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: datasciencecluster
    app.kubernetes.io/part-of: opendatahub-operator
spec:
  components:
    codeflare:
      managementState: Removed
    kserve:
      serving:
        ingressGateway:
          certificate:
            type: OpenshiftDefaultIngress
        managementState: Managed
        name: knative-serving
      managementState: Removed
      defaultDeploymentMode: Serverless
    modelregistry:
      registriesNamespace: odh-model-registries
      managementState: Removed
    trustyai:
      devFlags:
        manifests:
          - contextDir: config
            sourcePath: ''
            uri: 'https://github.com/trustyai-explainability/trustyai-service-operator/tarball/main'
      managementState: Managed
    ray:
      managementState: Removed
    kueue:
      managementState: Managed
    workbenches:
      managementState: Removed
    dashboard:
      managementState: Managed   
    modelmeshserving:
      managementState: Managed
    datasciencepipelines:
      managementState: Removed
    trainingoperator:
      managementState: Removed

LMEvalJob CR:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  suspend: true
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base 
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli" 
      template: "templates.classification.multi_class.relation.default" 
  logSamples: true
@ruivieira ruivieira added kind/bug Something isn't working lm-eval Issues related to LM-Eval labels Nov 14, 2024
@ruivieira ruivieira added this to the LM-Eval milestone Nov 14, 2024
@ruivieira ruivieira changed the title LMEvalJobs with Keueu suspend enabled start immediately LMEvalJobs with Kueue suspend enabled start immediately Nov 14, 2024
@yhwang
Copy link
Collaborator

yhwang commented Nov 14, 2024

one quick question: is the JOB_MGR added to the enable-services here: https://github.com/trustyai-explainability/trustyai-service-operator/blob/main/config/manager/manager.yaml#L36

The JOB_MGR is not in the deployment yet. The overlays that enable the JOB_MGR is not there yet. Need to manually enable this feature.

@yhwang
Copy link
Collaborator

yhwang commented Nov 15, 2024

Let me re-post my comment on another PR here. It's easier to know all the needed settings for enabling the Kueue integration with the LMEvalJob:

When enabling the Kueue for LMES, one extra update in the kueue-manager-config configmap is needed. LMEvalJob needs to be added into the externalFrameworks like this:

integrations:
  .......
  externalFrameworks:
  - "trustyai.opendatahub.io/lmevaljob"

@github-project-automation github-project-automation bot moved this from Todo to Done in TrustyAI planning Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working lm-eval Issues related to LM-Eval
Projects
Status: Done
2 participants