Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load/run LoRA Adapters on llama - 7B #2727

Open
kaushikmitr opened this issue Nov 5, 2024 · 0 comments
Open

Unable to load/run LoRA Adapters on llama - 7B #2727

kaushikmitr opened this issue Nov 5, 2024 · 0 comments

Comments

@kaushikmitr
Copy link

kaushikmitr commented Nov 5, 2024


System Info

TGI Versions:

  • 2.4.0: Deployment fails with the error:
    ERROR text_generation_launcher: Error when initializing model
  • 2.1.1: Deployment succeeds, but CURL requests fail with the error:
    ERROR text_generation_launcher: Method Prefill encountered an error

Both the following LoRA adapters exhibit the same issue across versions:

  1. vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
  2. xtuner/Llama-2-7b-qlora-moss-003-sft

Attached is the error log archive.


Configuration Details

  • Platform: Docker
  • Execution Method: Command Line Interface (CLI)

Deployment Setup

This YAML configuration was deployed on a single A100-40GB node (a2-highgpu-2g) within GKE:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi
  labels:
    app: tgi
spec:
  selector:
    matchLabels:
      app: tgi
  template:
    metadata:
      labels:
        app: tgi
        ai.gke.io/inference-server: text-generation-inference
        examples.ai.gke.io/source: ai-on-gke-benchmarks
    spec:
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 1Gi
        - name: data
          hostPath:
            path: /mnt/stateful_partition/kube-ephemeral-ssd/data
      containers:
        - name: text-generation-inference
          ports:
            - containerPort: 80
          image: "ghcr.io/huggingface/text-generation-inference:2.4.0"  # or "ghcr.io/huggingface/text-generation-inference:2.1.1"
          args: ["--model-id", "meta-llama/Llama-2-7b-hf", "--num-shard", "1", "--max-concurrent-requests", "128", "--lora-adapters", "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm"]
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              cpu: "1"
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
            - mountPath: /data
              name: data
          startupProbe:
            failureThreshold: 3600
            periodSeconds: 10
            timeoutSeconds: 60
            exec:
              command:
                - /usr/bin/curl
                - http://localhost:80/generate
                - -X
                - POST
                - -d
                - '{"inputs":"test", "parameters":{"max_new_tokens":1, "adapter_id" : "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm"}}'
                - -H
                - 'Content-Type: application/json'
                - --fail
---
apiVersion: v1
kind: Service
metadata:
  name: tgi
  labels:
    app: tgi
spec:
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: tgi

Expected Behavior

Both deployment and startup should succeed without errors, as the same LoRA adapters are known to work with vLLM and LoRAX.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant