Unlock Powerful Large Language Model Inference at Argonne Leadership Computing Facility (ALCF)
The ALCF Inference Endpoints provide a robust API for running Large Language Model (LLM) inference using Globus Compute on ALCF HPC Clusters.
Cluster | Endpoint |
---|---|
Sophia | https://data-portal-dev.cels.anl.gov/resource_server/sophia |
🔒 Access Note:
- Endpoints are restricted. You must be on Argonne's network (Use VPN, Dash, or SSH to ANL machine).
- You will need to authenticate with Argonne or ALCF SSO (Single Sign On) using your credentials. See Authentication.
- vLLM - https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm
- Infinity - https://data-portal-dev.cels.anl.gov/resource_server/sophia/infinity
https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions
https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/completions
https://data-portal-dev.cels.anl.gov/resource_server/sophia/infinity/v1/embeddings
📝 Note Currently embeddings are only supported by the infinity framework. See usage and/or refer to OpenAI API docs for examples
- Qwen/Qwen2.5-14B-Instruct
- Qwen/Qwen2.5-7B-Instruct
- Qwen/QwQ-32B-Preview
- meta-llama/Meta-Llama-3-70B-Instruct
- meta-llama/Meta-Llama-3-8B-Instruct
- meta-llama/Meta-Llama-3.1-70B-Instruct
- meta-llama/Meta-Llama-3.1-8B-Instruct
- meta-llama/Meta-Llama-3.1-405B-Instruct
- meta-llama/Llama-3.3-70B-Instruct
- mistralai/Mistral-7B-Instruct-v0.3
- mistralai/Mistral-Large-Instruct-2407
- mistralai/Mixtral-8x22B-Instruct-v0.1
- mgoin/Nemotron-4-340B-Instruct-hf
- auroragpt/auroragpt-0.1-chkpt-7B-Base
- Qwen/Qwen2-VL-72B-Instruct (Ranked 1 in vision leaderboard)
- meta-llama/Llama-3.2-90B-Vision-Instruct
- nvidia/NV-Embed-v2 (Ranked 1 in embedding Leaderboard)
📝 Want to add a model? Add the HF-compatible, framework-supported model weights to
/eagle/argonne_tpc/model_weights/
and contact Aditya Tanikanti
When interacting with the inference endpoints, it's crucial to understand the system's operational characteristics:
-
Initial Model Loading
- The first query for a "cold" model takes approximately 10-15 minutes
- Loading time depends on the specific model's size
- A node must first be acquired and the model loaded into memory
-
Cluster Resource Constraints
- These endpoints run on a High-Performance Computing (HPC) cluster as PBS jobs
- The cluster is used for multiple tasks beyond inference
- During high-demand periods, your job might be queued
- You may need to wait until computational resources become available
-
Job and model running status
- To view currently running jobs along with the models served on the cluster you can run
curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/sophia/jobs" -H "Authorization: Bearer ${access_token}"
. See Authentication foraccess_token
- To view currently running jobs along with the models served on the cluster you can run
🚧 Future Improvements:
- The team is actively working on implementing a node reservation system to mitigate wait times and improve user experience.
- If you’re interested in extended model runtimes, reservations, or private model deployments, please get in touch with us.
The models are currently run as part of a 24-hour job on Sophia. Here's how the endpoint activation works:
- The first query by an authorized user dynamically acquires and activates the endpoints
- Subsequent queries by authorized users will re-use the running job/endpoint
# Create a new Conda environment
conda create -n globus_env python==3.11.9 --y
conda activate globus_env
# Install required package
pip install globus_sdk
# Install optional package
pip install openai
Generate an access token:
wget https://raw.githubusercontent.com/argonne-lcf/inference-endpoints/refs/heads/main/generate_auth_token.py
python3 generate_auth_token.py
access_token=$(cat access_token.txt)
⏰ Token Validity: Active for 48 hours
🔒 Access Note:
- Endpoints are restricted. You must be on Argonne's network (Use VPN, Dash, or SSH to ANL machine).
- You will need to authenticate with Argonne or ALCF SSO (Single Sign On) using your credentials.
List the status of running jobs/endpoints on the cluster
#!/bin/bash
# Define the access token
access_token=$(cat access_token.txt)
curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/sophia/jobs" \
-H "Authorization: Bearer ${access_token}"
List all available endpoints
#!/bin/bash
# Define the access token
access_token=$(cat access_token.txt)
curl -X GET "https://data-portal-dev.cels.anl.gov/resource_server/list-endpoints" \
-H "Authorization: Bearer ${access_token}"
Chat Completions Curl Example
#!/bin/bash
# Define the access token
access_token=$(cat access_token.txt)
# Define the base URL
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions"
# Define the model and parameters
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
temperature=0.2
max_tokens=150
# Define an array of messages
messages=(
"List all proteins that interact with RAD51"
"What are the symptoms of diabetes?"
"How does photosynthesis work?"
)
# Loop through the messages and send a POST request for each
for message in "${messages[@]}"; do
curl -X POST "$base_url" \
-H "Authorization: Bearer ${access_token}" \
-H "Content-Type: application/json" \
-d '{
"model": "'$model'",
"temperature": '$temperature',
"max_tokens": '$max_tokens',
"messages":[{"role": "user", "content": "'"$message"'"}]
}'
done
Completions Curl Example
#!/bin/bash
# Define the access token
access_token=$(cat access_token.txt)
# Define the base URL
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/completions"
# Define the model and parameters
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
temperature=0.2
max_tokens=150
# Define an array of prompts
prompts=(
"List all proteins that interact with RAD51"
"What are the symptoms of diabetes?"
"How does photosynthesis work?"
)
# Loop through the prompts and send a POST request for each
for prompt in "${prompts[@]}"; do
echo "'"$prompt"'"
curl -X POST "$base_url" \
-H "Authorization: Bearer ${access_token}" \
-H "Content-Type: application/json" \
-d '{
"model": "'$model'",
"temperature": '$temperature',
"max_tokens": '$max_tokens',
"prompt":"'"$prompt"'"
}'
done
Using Requests
import requests
import json
# Load access token
with open('access_token.txt', 'r') as file:
access_token = file.read().strip()
# Chat Completions Example
def send_chat_request(message):
url = "https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1/chat/completions"
headers = {
'Authorization': f'Bearer {access_token}',
'Content-Type': 'application/json'
}
data = {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": message}]
}
response = requests.post(url, headers=headers, data=json.dumps(data))
return response.json()
output = send_chat_request("What is the purpose of life?")
print(output)
Using OpenAI Package
from openai import OpenAI
# Load access token
with open('access_token.txt', 'r') as file:
access_token = file.read().strip()
client = OpenAI(
api_key=access_token,
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1"
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response)
Using Vision model
from openai import OpenAI
import base64
# Load access token
with open('access_token.txt', 'r') as file:
access_token = file.read().strip()
# Initialize the client
client = OpenAI(
api_key=access_token,
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/vllm/v1"
)
# Function to encode image to base64
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Prepare the image
image_path = "scientific_diagram.png"
base64_image = encode_image(image_path)
# Create vision model request
response = client.chat.completions.create(
model="Qwen/Qwen2-VL-72B-Instruct",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the key components in this scientific diagram"},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image}"}}
]
}
],
max_tokens=300
)
# Print the model's analysis
print(response.server_response)
Using Embedding model
from openai import OpenAI
import base64
# Load access token
with open('access_token.txt', 'r') as file:
access_token = file.read().strip()
# Initialize the client
client = OpenAI(
api_key=access_token,
base_url="https://data-portal-dev.cels.anl.gov/resource_server/sophia/infinity/v1"
)
# Create Embeddings
completion = client.embeddings.create(
model="nvidia/NV-Embed-v2",
input="The food was delicious and the waiter...",
encoding_format="float"
)
# Print the model's analysis
print(completion)
- Connection Timeout?
- Regenerate your access token
- Verify Argonne network access
- Your job is queued as the cluster has too many pending jobs