diff --git a/README.md b/README.md index 7dd961ea..018d1f62 100644 --- a/README.md +++ b/README.md @@ -37,13 +37,16 @@ The core of Mooncake is its KVCache-centric scheduler, which balances maximizing

🧩 Components

- - +![components](image/components.png) -- The bottom part of Mooncake is **Transfer Engine**, which supports rapid, reliable and flexible data transfer over TCP, RDMA, NVIDIA GPUDirect-based RDMA and and NVMe over Fabric (NVMe-of) protocols. Comparing with gloo (used by Distributed PyTorch) and TCP, Mooncake Transfer Engine has the lowest I/O latency. -- Based on **Transfer Engine**, we implemented the **P2P Store** library, supports sharing temporary objects (e.g., checkpoint files) among nodes in a cluster. It avoids bandwidth saturation on a single machine. -- Additionally, we modified vLLM so that **Transfer Engine** is integrated. It makes prefill-decode disaggregation more efficient by utilizing RDMA devices. -- **Mooncake Store** is based on **Transfer Engine**, which supports distributed pooled KVCache for vLLM's xPyD disaggregation. +**Mooncake Core Component: Transfer Engine (TE)** +The core of Mooncake is the Transfer Engine (TE), which provides a unified interface for batched data transfer across various storage devices and network links. Supporting multiple protocols including TCP, RDMA, CXL/shared-memory, and NVMe over Fabric (NVMe-of), TE is designed to enable fast and reliable data transfer for AI workloads. Compared to Gloo (used by Distributed PyTorch) and traditional TCP, TE achieves significantly lower I/O latency, making it a superior solution for efficient data transmission. + +**P2P Store and Mooncake Store** +Both P2P Store and Mooncake Store are built on the Transfer Engine and provide key/value caching for different scenarios. P2P Store focuses on sharing temporary objects (e.g., checkpoint files) across nodes in a cluster, preventing bandwidth saturation on a single machine. Mooncake Store, on the other hand, supports distributed pooled KVCache, specifically designed for XpYd disaggregation to enhance resource utilization and system performance. + +**Mooncake Integration with Leading LLM Inference Systems** +Mooncake has been seamlessly integrated with several popular large language model (LLM) inference systems. Through collaboration with the vLLM and SGLang teams, Mooncake now officially supports prefill-decode disaggregation. By leveraging the high-efficiency communication capabilities of RDMA devices, Mooncake significantly improves inference efficiency in prefill-decode disaggregation scenarios, providing robust technical support for large-scale distributed inference tasks.

🔥 Show Cases

@@ -73,13 +76,18 @@ P2P Store is built on the Transfer Engine and supports sharing temporary objects - **Efficient data distribution.** Designed to enhance the efficiency of large-scale data distribution, P2P Store *avoids bandwidth saturation* issues by allowing replicated nodes to share data directly. This reduces the CPU/RDMA NIC pressures of data providers (e.g., trainers). -#### Performance -Thanks to the high performance of Transfer Engine, P2P Stores can also distribute objects with full utilization of *hardware incoming bandwidth* (e.g., A 25Gbps NIC was used in the following figure, and the throughput of get replica is about 3.1 GB/s). + -![p2p-store.gif](image/p2p-store.gif) + ### Mooncake Store ([Guide](doc/en/mooncake-store-preview.md)) -Mooncake Store is a distributed KVCache storage engine specialized for LLM inference. It offers object-level APIs (`Put`, `Get` and `Remove`), and we will soon release an new vLLM integration to demonstrate xPyD disaggregation. Mooncake Store is the central component of the KVCache-centric disaggregated architecture. +Mooncake Store is a distributed KVCache storage engine specialized for LLM inference based on Transfer Engine. It is the central component of the KVCache-centric disaggregated architecture. The goal of Mooncake Store is to store the reusable KV caches across various locations in an inference cluster. Mooncake Store has been supported in [vLLM's prefill serving](https://docs.vllm.ai/en/latest/features/disagg_prefill.html). + +#### Highlights +- **Multi-replica support**: Mooncake Store supports storing multiple data replicas for the same object, effectively alleviating hotspots in access pressure. + +- **High bandwidth utilization**: Mooncake Store supports striping and parallel I/O transfer of large objects, fully utilizing multi-NIC aggregated bandwidth for high-speed data reads and writes. ### vLLM Integration ([Guide v0.2](doc/en/vllm-integration-v0.2.md)) To optimize LLM inference, the vLLM community is working on supporting [disaggregated prefilling (PR 10502)](https://github.com/vllm-project/vllm/pull/10502). This feature allows separating the **prefill** phase from the **decode** phase in different processes. The vLLM uses `nccl` and `gloo` as the transport layer by default, but currently it cannot efficiently decouple both phases in different machines. diff --git a/doc/en/build.md b/doc/en/build.md index ba1827ca..faaa6611 100644 --- a/doc/en/build.md +++ b/doc/en/build.md @@ -24,7 +24,7 @@ This document describes how to build Mooncake. ``` 3. Install Mooncake python package and mooncake_master executable ```bash - make install + sudo make install ``` ## Manual @@ -44,14 +44,21 @@ This document describes how to build Mooncake. ```bash # For debian/ubuntu apt-get install -y build-essential \ - cmake \ - libibverbs-dev \ - libgoogle-glog-dev \ - libgtest-dev \ - libjsoncpp-dev \ - libnuma-dev \ - libcurl4-openssl-dev \ - libhiredis-dev + cmake \ + libibverbs-dev \ + libgoogle-glog-dev \ + libgtest-dev \ + libjsoncpp-dev \ + libnuma-dev \ + libunwind-dev \ + libpython3-dev \ + libboost-all-dev \ + libssl-dev \ + pybind11-dev \ + libcurl4-openssl-dev \ + libhiredis-dev \ + pkg-config \ + patchelf # For centos/alibaba linux os yum install cmake \ @@ -107,14 +114,14 @@ This document describes how to build Mooncake. ## Use Mooncake in Docker Containers Mooncake supports Docker-based deployment. What you need is to get Docker image by `docker pull alogfans/mooncake`. -For the the container to use the host's network resources, you need to add the option when starting the container. The following is an example. +For the the container to use the host's network resources, you need to add the `--device` option when starting the container. The following is an example. ``` # In host sudo docker run --net=host --device=/dev/infiniband/uverbs0 --device=/dev/infiniband/rdma_cm --ulimit memlock=-1 -t -i mooncake:v0.9.0 /bin/bash -# In container -root@vm-5-3-3:/# cd /Mooncake-main/build/mooncake-transfer-engine/example -root@vm-5-3-3:/Mooncake-main/build/mooncake-transfer-engine/example# ./transfer_engine_bench --device_name=ibp6s0 --metadata_server=10.1.101.3:2379 --mode=target --local_server_name=10.1.100.3 +# Run transfer engine in container +cd /Mooncake-main/build/mooncake-transfer-engine/example +./transfer_engine_bench --device_name=ibp6s0 --metadata_server=10.1.101.3:2379 --mode=target --local_server_name=10.1.100.3 ``` ## Advanced Compile Options diff --git a/doc/en/transfer-engine.md b/doc/en/transfer-engine.md index 1a5f0046..3b2cdc26 100644 --- a/doc/en/transfer-engine.md +++ b/doc/en/transfer-engine.md @@ -102,7 +102,6 @@ After successfully compiling Transfer Engine, the test program `transfer_engine_ For example, the following command line can be used to start the etcd service: ```bash - # This is 10.0.0.1 etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://10.0.0.1:2379 ``` @@ -110,58 +109,44 @@ After successfully compiling Transfer Engine, the test program `transfer_engine_ For example, you can use the `http` service in the `mooncake-transfer-engine/example/http-metadata-server` example: ```bash - # This is 10.0.0.1 # cd mooncake-transfer-engine/example/http-metadata-server go run . --addr=:8080 ``` 2. **Start the target node.** ```bash - # This is 10.0.0.2 - export MC_GID_INDEX=n ./transfer_engine_bench --mode=target \ --metadata_server=etcd://10.0.0.1:2379 \ - --local_server_name=10.0.0.2:12345 \ - --device_name=erdma_0 + [--local_server_name=TARGET_NAME] \ + [--device_name=erdma_0 | --auto-discovery] ``` The meanings of the various parameters are as follows: - - The default value of the parameter corresponding to the environment variable `MC_GID_INDEX` is 0, which means that the Transfer Engine selects a GID that is most likely to be connected. Since this parameter depends on the specific network environment, the user has to set the value of the environment variable manually if the connection is hung. The environment variable `NCCL_IB_GID_INDEX` is equivalent to this function. - `--mode=target` indicates the start of the target node. The target node does not initiate read/write requests; it passively supplies or writes data as required by the initiator node. - > Note: In actual applications, there is no need to distinguish between target nodes and initiator nodes; each node can freely initiate read/write requests to other nodes in the cluster. + > [!NOTE] + > In actual applications, there is no need to distinguish between target nodes and initiator nodes; each node can freely initiate read/write requests to other nodes in the cluster. - `--metadata_server` is the address of the metadata server. Its form is `[proto]://[hostname:port]`. For example, the following addresses are VALID: - Use `etcd` as metadata storage: `"10.0.0.1:2379"`, `"etcd://10.0.0.1:2379"` or `"etcd://10.0.0.1:2379,10.0.0.2:2379"` - Use `redis` as metadata storage: `"redis://10.0.0.1:6379"` - Use `http` as metadata storage: `"http://10.0.0.1:8080/metadata"` - - `--local_server_name` represents the address of this machine, which does not need to be set in most cases. If this option is not set, the value is equivalent to the hostname of this machine (i.e., `hostname(2)`). Other nodes in the cluster will use this address to attempt out-of-band communication with this node to establish RDMA connections. - > Note: If out-of-band communication fails, the connection cannot be established. Therefore, if necessary, you need to modify the `/etc/hosts` file on all nodes in the cluster to locate the correct node through the hostname. - - `--device_name` indicates the name of the RDMA network card used in the transfer process. - > Tip: Advanced users can also pass in a JSON file of the network card priority matrix through `--nic_priority_matrix`, for details, refer to the developer manual of Transfer Engine. + - `--local_server_name` represents the segment name of current node, which does not need to be set in most cases. If this option is not set, the value is equivalent to the hostname of this machine (i.e., `hostname(2)`). This should keep unique among the cluster. + - `--device_name` indicates the name of the RDMA network card used in the transfer process (separated by commas without space). You can also specify `--auto_discovery` to enable discovery topology automatically, which generates a network card priority matrix based on the operating system configuration. - In network environments that only support TCP, the `--protocol=tcp` parameter can be used; in this case, there is no need to specify the `--device_name` parameter. - You can also specify `--auto_discovery` to enable discovery topology automatically, which generates a network card priority matrix based on the operating system configuration. Then, `--device_name` parameter is not required. - ``` - ./transfer_engine_bench --mode=target \ - --metadata_server=10.0.0.1:2379 \ - --local_server_name=10.0.0.2:12345 \ - --auto_discovery - ``` - 1. **Start the initiator node.** ```bash - # This is 10.0.0.3 - export MC_GID_INDEX=n - ./transfer_engine_bench --metadata_server=10.0.0.1:2379 \ - --segment_id=10.0.0.2:12345 \ - --local_server_name=10.0.0.3:12346 \ - --device_name=erdma_1 + ./transfer_engine_bench --metadata_server=etcd://10.0.0.1:2379 \ + --segment_id=TARGET_NAME \ + [--local_server_name=INITIATOR_NAME] \ + [--device_name=erdma_1 | --auto-discovery] ``` The meanings of the various parameters are as follows (the rest are the same as before): - - `--segment_id` can be simply understood as the hostname of the target node and needs to be consistent with the value passed to `--local_server_name` when starting the target node (if any). + - `--segment_id` is the segment name of target node. It needs to be consistent with the value passed to `--local_server_name` when starting the target node (if any). Under normal circumstances, the initiator node will start the transfer operation, wait for 10 seconds, and then display the "Test completed" message, indicating that the test is complete. The initiator node can also configure the following test parameters: `--operation` (can be `"read"` or `"write"`), `batch_size`, `block_size`, `duration`, `threads`, etc. +> [!NOTE] > If an exception occurs during execution, it is usually due to incorrect parameter settings. It is recommended to refer to the [troubleshooting document](troubleshooting.md) for preliminary troubleshooting. ### Sample Run @@ -402,9 +387,7 @@ TransferEngine needs to initializing by calling the `init` method before further TransferEngine(); int init(const std::string &metadata_conn_string, - const std::string &local_server_name, - const std::string &ip_or_host_name, - uint64_t rpc_port = 12345); + const std::string &local_server_name); ``` - `metadata_conn_string`: Connecting string of metadata storage servers, i.e., the IP address/hostname of `etcd`/`redis` or the URI of the http service. The general form is `[proto]://[hostname:port]`. For example, the following metadata server addresses are legal: @@ -413,9 +396,6 @@ The general form is `[proto]://[hostname:port]`. For example, the following meta - Using `http` as a metadata storage service: `“http://10.0.0.1:8080/metadata”` - `local_server_name`: The local server name, ensuring uniqueness within the cluster. It also serves as the name of the RAM Segment that other nodes refer to the current instance (i.e., Segment Name). -- `ip_or_host_name`: The name used for other clients to connect, which can be a hostname or IP address. -- `rpc_port`: The rpc port used for interaction with other clients. -- Return value: If successful, returns 0; if TransferEngine has already been init, returns -1. ```cpp ~TransferEngine(); @@ -427,13 +407,6 @@ Reclaims all allocated resources and also deletes the global meta data server in ### Using C/C++ Interface After compiling Mooncake Store, you can move the compiled static library file `libtransfer_engine.a` and the C header file `transfer_engine_c.h` into your own project. There is no need to reference other files under `src/transfer_engine`. -During the project build phase, you need to configure the following options for your application: -```bash --I/path/to/include --L/path/to/lib -ltransfer_engine --lnuma -lglog -libverbs -ljsoncpp -letcd-cpp-api -lprotobuf -lgrpc++ -lgrpc -``` - ### Using Golang Interface To support the operational needs of P2P Store, Transfer Engine provides a Golang interface wrapper, see `mooncake-p2p-store/src/p2pstore/transfer_engine.go`. diff --git a/doc/zh/transfer-engine.md b/doc/zh/transfer-engine.md index 03c8a738..a44ebe6a 100644 --- a/doc/zh/transfer-engine.md +++ b/doc/zh/transfer-engine.md @@ -76,7 +76,6 @@ Transfer Engine 使用SIEVE算法来管理端点的逐出。如果由于链路 例如,可使用如下命令行启动 `etcd` 服务: ```bash - # This is 10.0.0.1 etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://10.0.0.1:2379 ``` @@ -84,52 +83,39 @@ Transfer Engine 使用SIEVE算法来管理端点的逐出。如果由于链路 例如,可使用 `mooncake-transfer-engine/example/http-metadata-server` 示例中的 `http` 服务: ```bash - # This is 10.0.0.1 cd mooncake-transfer-engine/example/http-metadata-server go run . --addr=:8080 ``` 2. **启动目标节点。** ```bash - # This is 10.0.0.2 ./transfer_engine_bench --mode=target \ --metadata_server=etcd://10.0.0.1:2379 \ - --local_server_name=10.0.0.2:12345 \ - --device_name=erdma_0 + [--local_server_name=TARGET_NAME] \ + [--device_name=erdma_0 | --auto-discovery] ``` 各个参数的含义如下: - - 环境变量 `MC_GID_INDEX` 对应参数的默认值为 0,表示由 Transfer Engine 选取一个最可能连通的 GID。由于该参数取决于具体的网络环境存在差异,如果连接被挂起,用户需手工设置环境变量的值。环境变量 `NCCL_IB_GID_INDEX` 与此功能等价。 - `--mode=target` 表示启动目标节点。目标节点不发起读写请求,只是被动按发起节点的要求供给或写入数据。 - > 注意:实际应用中可不区分目标节点和发起节点,每个节点可以向集群内其他节点自由发起读写请求。 + > [!NOTE] + > 实际应用中可不区分目标节点和发起节点,每个节点可以向集群内其他节点自由发起读写请求。 - `--metadata_server` 为元数据服务器地址,一般形式是 `[proto]://[hostname:port]`。例如,下列元数据服务器地址是合法的: - 使用 `etcd` 作为元数据存储服务:`"10.0.0.1:2379"` 或 `"etcd://10.0.0.1:2379"` 或 `"etcd://10.0.0.1:2379,10.0.0.2:2379"` - 使用 `redis` 作为元数据存储服务:`"redis://10.0.0.1:6379"` - 使用 `http` 作为元数据存储服务:`"http://10.0.0.1:8080/metadata"` - - `--local_server_name` 表示本机器地址,大多数情况下无需设置。如果不设置该选项,则该值等同于本机的主机名(即 `hostname(2)` )。集群内的其它节点会使用此地址尝试与该节点进行带外通信,从而建立 RDMA 连接。 - > 注意:若带外通信失败则连接无法建立。因此,若有必要需修改集群所有节点的 `/etc/hosts` 文件,使得可以通过主机名定位到正确的节点。 - - `--device_name` 表示传输过程使用的 RDMA 网卡名称。 - > 提示:高级用户还可通过 `--nic_priority_matrix` 传入网卡优先级矩阵 JSON 文件,详细参考 [Transfer Engine 的开发者手册](#transferengineinstallorgettransport)。 + - `--local_server_name` 表示本节点创建段(segment)名称,供发起节点引用。大多数情况下无需设置。如果不设置该选项,则该值等同于本机的主机名(即 `hostname(2)` )。 + - `--device_name` 表示传输期间所用的网卡列表(使用逗号分割,不要添加空格)。也可使用`--auto_discovery` 检测安装的所有网卡列表并进行使用。 - 在仅支持 TCP 的网络环境中,可使用 `--protocol=tcp` 参数,此时不需要指定 `--device_name` 参数。 - 也可通过拓扑自动发现功能基于操作系统配置自动生成网卡优先级矩阵,此时不需要指定传输过程使用的 RDMA 网卡名称。 - ``` - ./transfer_engine_bench --mode=target \ - --metadata_server=10.0.0.1:2379 \ - --local_server_name=10.0.0.2:12345 \ - --auto_discovery - ``` - 1. **启动发起节点。** ```bash - # This is 10.0.0.3 export MC_GID_INDEX=n - ./transfer_engine_bench --metadata_server=10.0.0.1:2379 \ - --segment_id=10.0.0.2:12345 \ - --local_server_name=10.0.0.3:12346 \ - --device_name=erdma_1 + ./transfer_engine_bench --metadata_server=etcd://10.0.0.1:2379 \ + --segment_id=TARGET_NAME \ + [--local_server_name=INITIATOR_NAME] \ + [--device_name=erdma_1 | --auto-discovery] ``` 各个参数的含义如下(其余同前): - - `--segment_id` 可以简单理解为目标节点的主机名,需要和启动目标节点时 `--local_server_name` 传入的值(如果有)保持一致。 + - `--segment_id` 可以简单理解为目标节点对应的段名称,需要和启动目标节点时 `--local_server_name` 传入的值(如果有)保持一致。 正常情况下,发起节点将开始进行传输操作,等待 10s 后回显“Test completed”信息,表明测试完成。 @@ -141,6 +127,7 @@ Transfer Engine 使用SIEVE算法来管理端点的逐出。如果由于链路 ![transfer-engine-running](../../image/transfer-engine-running.gif) +> [!NOTE] > 如果在执行期间发生异常,大多数情况是参数设置不正确所致,建议参考[故障排除文档](troubleshooting.md)先行排查。 ## C/C++ API @@ -369,9 +356,7 @@ TransferEngine 在完成构造后需要调用 `init` 函数进行初始化: TransferEngine(); int init(const std::string &metadata_conn_string, - const std::string &local_server_name, - const std::string &ip_or_host_name, - uint64_t rpc_port = 12345); + const std::string &local_server_name); ``` - metadata_conn_string: 元数据存储服务连接字符串,表示 `etcd`/`redis` 的 IP 地址/主机名,或者 http 服务的 URI。一般形式是 `[proto]://[hostname:port]`。例如,下列元数据服务器地址是合法的: @@ -380,8 +365,6 @@ int init(const std::string &metadata_conn_string, - 使用 `http` 作为元数据存储服务:`"http://10.0.0.1:8080/metadata"` - local_server_name: 本地的 server name,保证在集群内唯一。它同时作为其他节点引用当前实例所属 RAM Segment 的名称(即 Segment Name) -- ip_or_host_name: 用于被其它 client 连接的 name,可为 hostname 或 ip 地址。 -- rpc_port:当前进程占用于与其它 client 交互的 rpc 端口。 - 返回值:若成功,返回 0;若 TransferEngine 已被 init 过,返回 -1。 ```cpp @@ -394,13 +377,6 @@ int init(const std::string &metadata_conn_string, ### 使用 C/C++ 接口二次开发 在完成 Mooncake Store 编译后,可将编译好的静态库文件 `libtransfer_engine.a` 及 C 头文件 `transfer_engine_c.h`,移入到你自己的项目里。不需要引用 `src/transfer_engine` 下的其他文件。 -在项目构建阶段,需要为你的应用配置如下选项: -``` --I/path/to/include --L/path/to/lib -ltransfer_engine --lnuma -lglog -libverbs -ljsoncpp -letcd-cpp-api -lprotobuf -lgrpc++ -lgrpc -``` - ### 使用 Golang 接口二次开发 为了支撑 P2P Store 的运行需求,Transfer Engine 提供了 Golang 接口的封装,详见 `mooncake-p2p-store/src/p2pstore/transfer_engine.go`。 diff --git a/image/components.png b/image/components.png index d28d26fa..0ee25e75 100644 Binary files a/image/components.png and b/image/components.png differ