Skip to content

Latest commit

 

History

History
103 lines (68 loc) · 3.76 KB

serving-hugging-face-models.md

File metadata and controls

103 lines (68 loc) · 3.76 KB

Using Mistral-7B Model in Production

To use the Mistral-7B model from Hugging Face Hub in production without relying on the Hugging Face servers, you can download the model weights and set up a local environment to serve the model. Here’s a step-by-step guide on how to do this:

Step 1: Authenticate and Accept the User Agreement

Before downloading the model, you need to authenticate with Hugging Face and accept the user agreement for the Mistral-7B model.

  1. Create a Hugging Face account if you don't already have one.
  2. Accept the model's user agreement by visiting the model page on Hugging Face Hub and clicking on "Agree and Access".

Step 2: Install Hugging Face Transformers and Other Dependencies

You need to install the transformers library along with torch (PyTorch) to download and use the model. You can install these using pip:

pip install transformers torch

Step 3: Authenticate with Hugging Face CLI

You need to log in to the Hugging Face CLI to download the model weights.

huggingface-cli login

Follow the instructions to authenticate using your Hugging Face credentials.

Step 4: Download the Model

Use the transformers library to download the model and tokenizer. This will download the model weights to your local machine.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "mistralai/Mistral-7B-v0.1"

# Download and cache the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Save the model and tokenizer locally
model.save_pretrained("./mistral-7b-model")
tokenizer.save_pretrained("./mistral-7b-tokenizer")

Step 5: Load the Model Locally

Now, you can load the model from the saved directory without needing to connect to Hugging Face.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = "./mistral-7b-model"
tokenizer_dir = "./mistral-7b-tokenizer"

# Load the model and tokenizer from the local directory
model = AutoModelForCausalLM.from_pretrained(model_dir)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)

Step 6: Serve the Model

To serve the model in production, you can set up a web service using frameworks like Flask, FastAPI, or any other preferred web server framework. Here’s an example using FastAPI:

from fastapi import FastAPI, Request
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()

model_dir = "./mistral-7b-model"
tokenizer_dir = "./mistral-7b-tokenizer"

model = AutoModelForCausalLM.from_pretrained(model_dir)
tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)

class TextGenerationRequest(BaseModel):
    text: str
    max_length: int = 50

@app.post("/generate")
async def generate_text(request: TextGenerationRequest):
    inputs = tokenizer(request.text, return_tensors="pt")
    outputs = model.generate(inputs["input_ids"], max_length=request.max_length)
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return {"generated_text": generated_text}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Step 7: Deploy the Service

You can deploy this FastAPI app using a web server like Uvicorn or Gunicorn, and use a process manager like Supervisor or systemd to keep it running in production. You may also consider containerizing your application using Docker for easier deployment and scalability.

Summary

By following these steps, you can download the model weights, set up a local environment, and serve the model in production without depending on Hugging Face's servers. This ensures that your service remains available even if Hugging Face experiences downtime.