Basic Demo

In this demo, you will experience how to use the GLM-4-9B open source model to perform basic tasks.

Please follow the steps in the document strictly to avoid unnecessary errors.

Device and dependency check

Related inference test data

The data in this document are tested in the following hardware environment. The actual operating environment requirements and the GPU memory occupied by the operation are slightly different. Please refer to the actual operating environment.

Test hardware information:

OS: Ubuntu 22.04
Memory: 512GB
Python: 3.10.12 (recommend) / 3.12.3 have been tested
CUDA Version: 12.3
GPU Driver: 535.104.05
GPU: NVIDIA A100-SXM4-80GB * 8

The stress test data of relevant inference are as follows:

All tests are performed on a single GPU, and all GPU memory consumption is calculated based on the peak value

GLM-4-9B-Chat

Dtype	GPU Memory	Prefilling	Decode Speed	Remarks
BF16	19 GB	0.2s	27.8 tokens/s	Input length is 1000
BF16	21 GB	0.8s	31.8 tokens/s	Input length is 8000
BF16	28 GB	4.3s	14.4 tokens/s	Input length is 32000
BF16	58 GB	38.1s	3.4 tokens/s	Input length is 128000

Dtype	GPU Memory	Prefilling	Decode Speed	Remarks
INT4	8 GB	0.2s	23.3 tokens/s	Input length is 1000
INT4	10 GB	0.8s	23.4 tokens/s	Input length is 8000
INT4	17 GB	4.3s	14.6 tokens/s	Input length is 32000

GLM-4-9B-Chat-1M

Dtype	GPU Memory	Prefilling	Decode Speed	Remarks
BF16	74497MiB	98.4s	2.3653 tokens/s	Input length is 200000

If your input exceeds 200K, we recommend that you use the vLLM backend with multi gpus for inference to get better performance.

GLM-4V-9B

Dtype	GPU Memory	Prefilling	Decode Speed	Remarks
BF16	28 GB	0.1s	33.4 tokens/s	Input length is 1000
BF16	33 GB	0.7s	39.2 tokens/s	Input length is 8000

Dtype	GPU Memory	Prefilling	Decode Speed	Remarks
INT4	10 GB	0.1s	28.7 tokens/s	Input length is 1000
INT4	15 GB	0.8s	24.2 tokens/s	Input length is 8000

Minimum hardware requirements

If you want to run the most basic code provided by the official (transformers backend) you need:

Python >= 3.10
Memory of at least 32 GB

If you want to run all the codes in this folder provided by the official, you also need:

Linux operating system (Debian series is best)
GPU device with more than 8GB GPU memory, supporting CUDA or ROCM and supporting BF16 reasoning (FP16 precision cannot be finetuned, and there is a small probability of problems in infering)

Install dependencies

pip install -r requirements.txt

Basic function calls

**Unless otherwise specified, all demos in this folder do not support advanced usage such as Function Call and All Tools **

Use transformers backend code

Use the command line to communicate with the GLM-4-9B model.

python trans_cli_demo.py # GLM-4-9B-Chat
python trans_cli_vision_demo.py # GLM-4V-9B

Use the Gradio web client to communicate with the GLM-4-9B model.

python trans_web_demo.py  # GLM-4-9B-Chat
python trans_web_vision_demo.py # GLM-4V-9B

Use Batch inference.

python trans_batch_demo.py

Use vLLM backend code

Use the command line to communicate with the GLM-4-9B-Chat model.

python vllm_cli_demo.py

use LoRA adapters with vLLM on GLM-4-9B-Chat model.

# vllm_cli_demo.py
# add LORA_PATH = ''

Build the server by yourself and use the request format of OpenAI API to communicate with the glm-4-9b model. This demo supports Function Call and All Tools functions.
Modify the MODEL_PATH in open_api_server.py, and you can choose to build the GLM-4-9B-Chat or GLM-4v-9B server side.

Start the server:

python openai_api_server.py

Client request:

python openai_api_request.py

Stress test

Users can use this code to test the generation speed of the model on the transformers backend on their own devices:

python trans_stress_test.py

Use Ascend card to run code

Users can run the above code in the Ascend hardware environment. They only need to change the transformers to openmind and the cuda device in device to npu.

#from transformers import AutoModelForCausalLM, AutoTokenizer
from openmind import AutoModelForCausalLM, AutoTokenizer

#device = 'cuda'
device = 'npu'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

Basic Demo

Device and dependency check

Related inference test data

GLM-4-9B-Chat

GLM-4-9B-Chat-1M

GLM-4V-9B

Minimum hardware requirements

Basic function calls

Use transformers backend code

Use vLLM backend code

Stress test

Use Ascend card to run code

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

Basic Demo

Device and dependency check

Related inference test data

GLM-4-9B-Chat

GLM-4-9B-Chat-1M

GLM-4V-9B

Minimum hardware requirements

Basic function calls

Use transformers backend code

Use vLLM backend code

Stress test

Use Ascend card to run code