Add llama.cpp backend #2723

mfuntowicz · 2024-11-04T22:25:32Z

This PR is an initial implementation of llama.cpp as potential backend for TGI.

It mostly targets CPU inference in a single/multi stream scheduling fashion, potentially spawning multiple instances of the same model over a non-overlapping subset of the CPU cores.

The current implementation only allows a single request to be running on a working, this constraint will be removed later on.
The current implementation also dupplicate the weights for each worker, this constraint can potentially be removed later on.

# Conflicts: # Cargo.lock

…gpt2

…cting the model

…back

…he client

…ama_token(int32_t)

…n_ubatch

mfuntowicz added 30 commits November 14, 2024 08:42

feat(llamacpp): initial commit

aa1fcba

# Conflicts: # Cargo.lock

feat(llamacpp): correctly handle CMAKE_BUILD_TYPE for spdlog macros

7d1f8a2

feat(llamacpp): initial end2end build

52d57dc

misc(cmake): add parameter to build specific cuda arch

e4432d3

misc(cmake): wut

fa89d1e

feat(llamacpp): enable cuda

05ad684

feat(backend): correctly load llama.cpp model from llama api and not …

0911076

…gpt2

feat(backend): tell cmake to build llama-common and link to it

098c669

feat(backend): add some initial decoding steps

45d5a6a

feat(backend): use llama_token as TokenId type

92bb113

feat(backend): minor refactor

d4b5be1

feat(backend): expose frequency and repetition penalties

37faeb3

chore(backend): minor formatting

f9c2486

feat(backend): wip Rust binding

355d8a5

feat(backend): build and link through build.rs

e4d803c

misc(build): handle different lib destination folder lib/lib64

f0859c2

misc(build): refactor build type detection in cmake

179309b

feat(llamacpp): expose number of threads for the backend when constru…

a316c53

…cting the model

feat(llamacpp): wip explosion

0c1dd0e

misc(offline): link correctly

dbc5b7a

misc(offline): expose more parameters for generate

6115904

feat(backend): entirely rewrite backend

b98c635

misc(offline): update offline tester

6a5f6b0

feat(backend): full rework of the backend internal to safer c++

d52b4c4

misc(offline): match rework

3af2c68

feat(backend): add mapping for ignore_eos_token stopping criteria

f39edc7

feat(backend): add logit parameter in the callback fn

d4aee42

feat(backend): bind incoming request to the server

612f2f9

feat(backend): avoid dropping the boxed stream at the end of the call…

b50dcdd

…back

feat(backend): somewhat generates the final infer response

3e82f14

mfuntowicz added 12 commits November 14, 2024 08:42

feat(backend): handle all the tokenization failure and send back to t…

26d0266

…he client

misc(cmake): remove dependency on fmt

cf17928

misc(cmake): use URL base llama.cpp repo

4f5397c

feat(backend): simplify overall cpp structure

86d30ae

feat(backend): remove reinterpret_cast converting from uint32_t to ll…

6915fa3

…ama_token(int32_t)

feat(backend): remove unused function

7e2890f

feat(backend): fix invalid reference to context in release mode

488ba93

feat(backend): use std::ranges to map uint32_t to llama_token

363d5e4

chore(backend): minor improvements

02cd6fe

dockerfile(backend): initial working version of llama.cpp container

daf1631

feat(backend): simplify Rust callback

57b2154

feat(backend): wrap Arc tokenizer to avoid duplicating

6f059c4

mfuntowicz force-pushed the feat-backend-llamacpp branch from 85da2b1 to 6f059c4 Compare November 14, 2024 07:42

mfuntowicz added 8 commits November 14, 2024 09:04

feat(backend): update llamacpp to 4077

70c90ad

misc(build): improve build process

23d2bcf

feat(backend): multistream inference on CPU

5335bf9

feat(backend): bind thread and memory affinity for thread

50c3766

feat(backend): correctly setup llama_context providing n_threads and …

84eead2

…n_ubatch

feat(backend): rely on multi consumer queue to scheduler workers

5a85661

misc(docker): add numa lib as dependency

30ae996

misc(backend): allow rebinding numa core affinity

2d9465d

mfuntowicz marked this pull request as ready for review November 22, 2024 13:47

mfuntowicz added 3 commits November 22, 2024 14:48

misc(license): update LICENSE

4ee2ee5

misc(doc): c++ documentation

b9c04b9

misc(doc): rust documentation

862a519

mfuntowicz requested review from co42, Hugoch and OlivierDehaene November 22, 2024 14:37

chore: remove unrelated change to trtllm

9025a26

mfuntowicz requested a review from angt November 22, 2024 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama.cpp backend #2723

Add llama.cpp backend #2723

mfuntowicz commented Nov 4, 2024 •

edited

Loading

Add llama.cpp backend #2723

Are you sure you want to change the base?

Add llama.cpp backend #2723

Conversation

mfuntowicz commented Nov 4, 2024 • edited Loading

mfuntowicz commented Nov 4, 2024 •

edited

Loading