Skip to content

[DOC] Update docs of README and Transfer Engine build guide to follow the latest code #228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 18 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,16 @@ The core of Mooncake is its KVCache-centric scheduler, which balances maximizing

<h2 id="components">🧩 Components</h2>

<!-- ![components](image/components.png) -->
<img src=image/components.png width=75% />
![components](image/components.png)

- The bottom part of Mooncake is **Transfer Engine**, which supports rapid, reliable and flexible data transfer over TCP, RDMA, NVIDIA GPUDirect-based RDMA and and NVMe over Fabric (NVMe-of) protocols. Comparing with gloo (used by Distributed PyTorch) and TCP, Mooncake Transfer Engine has the lowest I/O latency.
- Based on **Transfer Engine**, we implemented the **P2P Store** library, supports sharing temporary objects (e.g., checkpoint files) among nodes in a cluster. It avoids bandwidth saturation on a single machine.
- Additionally, we modified vLLM so that **Transfer Engine** is integrated. It makes prefill-decode disaggregation more efficient by utilizing RDMA devices.
- **Mooncake Store** is based on **Transfer Engine**, which supports distributed pooled KVCache for vLLM's xPyD disaggregation.
**Mooncake Core Component: Transfer Engine (TE)**
The core of Mooncake is the Transfer Engine (TE), which provides a unified interface for batched data transfer across various storage devices and network links. Supporting multiple protocols including TCP, RDMA, CXL/shared-memory, and NVMe over Fabric (NVMe-of), TE is designed to enable fast and reliable data transfer for AI workloads. Compared to Gloo (used by Distributed PyTorch) and traditional TCP, TE achieves significantly lower I/O latency, making it a superior solution for efficient data transmission.

**P2P Store and Mooncake Store**
Both P2P Store and Mooncake Store are built on the Transfer Engine and provide key/value caching for different scenarios. P2P Store focuses on sharing temporary objects (e.g., checkpoint files) across nodes in a cluster, preventing bandwidth saturation on a single machine. Mooncake Store, on the other hand, supports distributed pooled KVCache, specifically designed for XpYd disaggregation to enhance resource utilization and system performance.

**Mooncake Integration with Leading LLM Inference Systems**
Mooncake has been seamlessly integrated with several popular large language model (LLM) inference systems. Through collaboration with the vLLM and SGLang teams, Mooncake now officially supports prefill-decode disaggregation. By leveraging the high-efficiency communication capabilities of RDMA devices, Mooncake significantly improves inference efficiency in prefill-decode disaggregation scenarios, providing robust technical support for large-scale distributed inference tasks.

<h2 id="show-cases">🔥 Show Cases</h2>

Expand Down Expand Up @@ -73,13 +76,18 @@ P2P Store is built on the Transfer Engine and supports sharing temporary objects

- **Efficient data distribution.** Designed to enhance the efficiency of large-scale data distribution, P2P Store *avoids bandwidth saturation* issues by allowing replicated nodes to share data directly. This reduces the CPU/RDMA NIC pressures of data providers (e.g., trainers).

#### Performance
Thanks to the high performance of Transfer Engine, P2P Stores can also distribute objects with full utilization of *hardware incoming bandwidth* (e.g., A 25Gbps NIC was used in the following figure, and the throughput of get replica is about 3.1 GB/s).
<!-- #### Performance
Thanks to the high performance of Transfer Engine, P2P Stores can also distribute objects with full utilization of *hardware incoming bandwidth* (e.g., A 25Gbps NIC was used in the following figure, and the throughput of get replica is about 3.1 GB/s). -->

![p2p-store.gif](image/p2p-store.gif)
<!-- ![p2p-store.gif](image/p2p-store.gif) -->

### Mooncake Store ([Guide](doc/en/mooncake-store-preview.md))
Mooncake Store is a distributed KVCache storage engine specialized for LLM inference. It offers object-level APIs (`Put`, `Get` and `Remove`), and we will soon release an new vLLM integration to demonstrate xPyD disaggregation. Mooncake Store is the central component of the KVCache-centric disaggregated architecture.
Mooncake Store is a distributed KVCache storage engine specialized for LLM inference based on Transfer Engine. It is the central component of the KVCache-centric disaggregated architecture. The goal of Mooncake Store is to store the reusable KV caches across various locations in an inference cluster. Mooncake Store has been supported in [vLLM's prefill serving](https://docs.vllm.ai/en/latest/features/disagg_prefill.html).

#### Highlights
- **Multi-replica support**: Mooncake Store supports storing multiple data replicas for the same object, effectively alleviating hotspots in access pressure.

- **High bandwidth utilization**: Mooncake Store supports striping and parallel I/O transfer of large objects, fully utilizing multi-NIC aggregated bandwidth for high-speed data reads and writes.

### vLLM Integration ([Guide v0.2](doc/en/vllm-integration-v0.2.md))
To optimize LLM inference, the vLLM community is working on supporting [disaggregated prefilling (PR 10502)](https://github.com/vllm-project/vllm/pull/10502). This feature allows separating the **prefill** phase from the **decode** phase in different processes. The vLLM uses `nccl` and `gloo` as the transport layer by default, but currently it cannot efficiently decouple both phases in different machines.
Expand Down
33 changes: 20 additions & 13 deletions doc/en/build.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ This document describes how to build Mooncake.
```
3. Install Mooncake python package and mooncake_master executable
```bash
make install
sudo make install
```

## Manual
Expand All @@ -44,14 +44,21 @@ This document describes how to build Mooncake.
```bash
# For debian/ubuntu
apt-get install -y build-essential \
cmake \
libibverbs-dev \
libgoogle-glog-dev \
libgtest-dev \
libjsoncpp-dev \
libnuma-dev \
libcurl4-openssl-dev \
libhiredis-dev
cmake \
libibverbs-dev \
libgoogle-glog-dev \
libgtest-dev \
libjsoncpp-dev \
libnuma-dev \
libunwind-dev \
libpython3-dev \
libboost-all-dev \
libssl-dev \
pybind11-dev \
libcurl4-openssl-dev \
libhiredis-dev \
pkg-config \
patchelf

# For centos/alibaba linux os
yum install cmake \
Expand Down Expand Up @@ -107,14 +114,14 @@ This document describes how to build Mooncake.

## Use Mooncake in Docker Containers
Mooncake supports Docker-based deployment. What you need is to get Docker image by `docker pull alogfans/mooncake`.
For the the container to use the host's network resources, you need to add the option when starting the container. The following is an example.
For the the container to use the host's network resources, you need to add the `--device` option when starting the container. The following is an example.

```
# In host
sudo docker run --net=host --device=/dev/infiniband/uverbs0 --device=/dev/infiniband/rdma_cm --ulimit memlock=-1 -t -i mooncake:v0.9.0 /bin/bash
# In container
root@vm-5-3-3:/# cd /Mooncake-main/build/mooncake-transfer-engine/example
root@vm-5-3-3:/Mooncake-main/build/mooncake-transfer-engine/example# ./transfer_engine_bench --device_name=ibp6s0 --metadata_server=10.1.101.3:2379 --mode=target --local_server_name=10.1.100.3
# Run transfer engine in container
cd /Mooncake-main/build/mooncake-transfer-engine/example
./transfer_engine_bench --device_name=ibp6s0 --metadata_server=10.1.101.3:2379 --mode=target --local_server_name=10.1.100.3
```

## Advanced Compile Options
Expand Down
53 changes: 13 additions & 40 deletions doc/en/transfer-engine.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,66 +102,51 @@ After successfully compiling Transfer Engine, the test program `transfer_engine_

For example, the following command line can be used to start the etcd service:
```bash
# This is 10.0.0.1
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://10.0.0.1:2379
```

1.2. **`http`**

For example, you can use the `http` service in the `mooncake-transfer-engine/example/http-metadata-server` example:
```bash
# This is 10.0.0.1
# cd mooncake-transfer-engine/example/http-metadata-server
go run . --addr=:8080
```

2. **Start the target node.**
```bash
# This is 10.0.0.2
export MC_GID_INDEX=n
./transfer_engine_bench --mode=target \
--metadata_server=etcd://10.0.0.1:2379 \
--local_server_name=10.0.0.2:12345 \
--device_name=erdma_0
[--local_server_name=TARGET_NAME] \
[--device_name=erdma_0 | --auto-discovery]
```
The meanings of the various parameters are as follows:
- The default value of the parameter corresponding to the environment variable `MC_GID_INDEX` is 0, which means that the Transfer Engine selects a GID that is most likely to be connected. Since this parameter depends on the specific network environment, the user has to set the value of the environment variable manually if the connection is hung. The environment variable `NCCL_IB_GID_INDEX` is equivalent to this function.
- `--mode=target` indicates the start of the target node. The target node does not initiate read/write requests; it passively supplies or writes data as required by the initiator node.
> Note: In actual applications, there is no need to distinguish between target nodes and initiator nodes; each node can freely initiate read/write requests to other nodes in the cluster.
> [!NOTE]
> In actual applications, there is no need to distinguish between target nodes and initiator nodes; each node can freely initiate read/write requests to other nodes in the cluster.
- `--metadata_server` is the address of the metadata server. Its form is `[proto]://[hostname:port]`. For example, the following addresses are VALID:
- Use `etcd` as metadata storage: `"10.0.0.1:2379"`, `"etcd://10.0.0.1:2379"` or `"etcd://10.0.0.1:2379,10.0.0.2:2379"`
- Use `redis` as metadata storage: `"redis://10.0.0.1:6379"`
- Use `http` as metadata storage: `"http://10.0.0.1:8080/metadata"`
- `--local_server_name` represents the address of this machine, which does not need to be set in most cases. If this option is not set, the value is equivalent to the hostname of this machine (i.e., `hostname(2)`). Other nodes in the cluster will use this address to attempt out-of-band communication with this node to establish RDMA connections.
> Note: If out-of-band communication fails, the connection cannot be established. Therefore, if necessary, you need to modify the `/etc/hosts` file on all nodes in the cluster to locate the correct node through the hostname.
- `--device_name` indicates the name of the RDMA network card used in the transfer process.
> Tip: Advanced users can also pass in a JSON file of the network card priority matrix through `--nic_priority_matrix`, for details, refer to the developer manual of Transfer Engine.
- `--local_server_name` represents the segment name of current node, which does not need to be set in most cases. If this option is not set, the value is equivalent to the hostname of this machine (i.e., `hostname(2)`). This should keep unique among the cluster.
- `--device_name` indicates the name of the RDMA network card used in the transfer process (separated by commas without space). You can also specify `--auto_discovery` to enable discovery topology automatically, which generates a network card priority matrix based on the operating system configuration.
- In network environments that only support TCP, the `--protocol=tcp` parameter can be used; in this case, there is no need to specify the `--device_name` parameter.

You can also specify `--auto_discovery` to enable discovery topology automatically, which generates a network card priority matrix based on the operating system configuration. Then, `--device_name` parameter is not required.
```
./transfer_engine_bench --mode=target \
--metadata_server=10.0.0.1:2379 \
--local_server_name=10.0.0.2:12345 \
--auto_discovery
```

1. **Start the initiator node.**
```bash
# This is 10.0.0.3
export MC_GID_INDEX=n
./transfer_engine_bench --metadata_server=10.0.0.1:2379 \
--segment_id=10.0.0.2:12345 \
--local_server_name=10.0.0.3:12346 \
--device_name=erdma_1
./transfer_engine_bench --metadata_server=etcd://10.0.0.1:2379 \
--segment_id=TARGET_NAME \
[--local_server_name=INITIATOR_NAME] \
[--device_name=erdma_1 | --auto-discovery]
```
The meanings of the various parameters are as follows (the rest are the same as before):
- `--segment_id` can be simply understood as the hostname of the target node and needs to be consistent with the value passed to `--local_server_name` when starting the target node (if any).
- `--segment_id` is the segment name of target node. It needs to be consistent with the value passed to `--local_server_name` when starting the target node (if any).

Under normal circumstances, the initiator node will start the transfer operation, wait for 10 seconds, and then display the "Test completed" message, indicating that the test is complete.

The initiator node can also configure the following test parameters: `--operation` (can be `"read"` or `"write"`), `batch_size`, `block_size`, `duration`, `threads`, etc.

> [!NOTE]
> If an exception occurs during execution, it is usually due to incorrect parameter settings. It is recommended to refer to the [troubleshooting document](troubleshooting.md) for preliminary troubleshooting.

### Sample Run
Expand Down Expand Up @@ -402,9 +387,7 @@ TransferEngine needs to initializing by calling the `init` method before further
TransferEngine();

int init(const std::string &metadata_conn_string,
const std::string &local_server_name,
const std::string &ip_or_host_name,
uint64_t rpc_port = 12345);
const std::string &local_server_name);
```
- `metadata_conn_string`: Connecting string of metadata storage servers, i.e., the IP address/hostname of `etcd`/`redis` or the URI of the http service.
The general form is `[proto]://[hostname:port]`. For example, the following metadata server addresses are legal:
Expand All @@ -413,9 +396,6 @@ The general form is `[proto]://[hostname:port]`. For example, the following meta
- Using `http` as a metadata storage service: `“http://10.0.0.1:8080/metadata”`

- `local_server_name`: The local server name, ensuring uniqueness within the cluster. It also serves as the name of the RAM Segment that other nodes refer to the current instance (i.e., Segment Name).
- `ip_or_host_name`: The name used for other clients to connect, which can be a hostname or IP address.
- `rpc_port`: The rpc port used for interaction with other clients.
- Return value: If successful, returns 0; if TransferEngine has already been init, returns -1.

```cpp
~TransferEngine();
Expand All @@ -427,13 +407,6 @@ Reclaims all allocated resources and also deletes the global meta data server in
### Using C/C++ Interface
After compiling Mooncake Store, you can move the compiled static library file `libtransfer_engine.a` and the C header file `transfer_engine_c.h` into your own project. There is no need to reference other files under `src/transfer_engine`.

During the project build phase, you need to configure the following options for your application:
```bash
-I/path/to/include
-L/path/to/lib -ltransfer_engine
-lnuma -lglog -libverbs -ljsoncpp -letcd-cpp-api -lprotobuf -lgrpc++ -lgrpc
```

### Using Golang Interface
To support the operational needs of P2P Store, Transfer Engine provides a Golang interface wrapper, see `mooncake-p2p-store/src/p2pstore/transfer_engine.go`.

Expand Down
Loading