Unable to load/run LoRA Adapters on llama - 7B #2727

kaushikmitr · 2024-11-05T23:03:59Z

System Info

TGI Versions:

2.4.0: Deployment fails with the error:
ERROR text_generation_launcher: Error when initializing model
2.1.1: Deployment succeeds, but CURL requests fail with the error:
ERROR text_generation_launcher: Method Prefill encountered an error

Both the following LoRA adapters exhibit the same issue across versions:

vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm
xtuner/Llama-2-7b-qlora-moss-003-sft

Attached is the error log archive.

Configuration Details

Platform: Docker
Execution Method: Command Line Interface (CLI)

Deployment Setup

This YAML configuration was deployed on a single A100-40GB node (a2-highgpu-2g) within GKE:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi
  labels:
    app: tgi
spec:
  selector:
    matchLabels:
      app: tgi
  template:
    metadata:
      labels:
        app: tgi
        ai.gke.io/inference-server: text-generation-inference
        examples.ai.gke.io/source: ai-on-gke-benchmarks
    spec:
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 1Gi
        - name: data
          hostPath:
            path: /mnt/stateful_partition/kube-ephemeral-ssd/data
      containers:
        - name: text-generation-inference
          ports:
            - containerPort: 80
          image: "ghcr.io/huggingface/text-generation-inference:2.4.0"  # or "ghcr.io/huggingface/text-generation-inference:2.1.1"
          args: ["--model-id", "meta-llama/Llama-2-7b-hf", "--num-shard", "1", "--max-concurrent-requests", "128", "--lora-adapters", "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm"]
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          resources:
            limits:
              nvidia.com/gpu: 1
            requests:
              cpu: "1"
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
            - mountPath: /data
              name: data
          startupProbe:
            failureThreshold: 3600
            periodSeconds: 10
            timeoutSeconds: 60
            exec:
              command:
                - /usr/bin/curl
                - http://localhost:80/generate
                - -X
                - POST
                - -d
                - '{"inputs":"test", "parameters":{"max_new_tokens":1, "adapter_id" : "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm"}}'
                - -H
                - 'Content-Type: application/json'
                - --fail
---
apiVersion: v1
kind: Service
metadata:
  name: tgi
  labels:
    app: tgi
spec:
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
  selector:
    app: tgi

Expected Behavior

Both deployment and startup should succeed without errors, as the same LoRA adapters are known to work with vLLM and LoRAX.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to load/run LoRA Adapters on llama - 7B #2727

Unable to load/run LoRA Adapters on llama - 7B #2727

kaushikmitr commented Nov 5, 2024 •

edited

Loading

Unable to load/run LoRA Adapters on llama - 7B #2727

Unable to load/run LoRA Adapters on llama - 7B #2727

Comments

kaushikmitr commented Nov 5, 2024 • edited Loading

System Info

Configuration Details

Deployment Setup

Expected Behavior

kaushikmitr commented Nov 5, 2024 •

edited

Loading