-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* support float16 tensor * use tensor.data_ptr directly * add fp16 test * fix * add MAG240M Example * fix bugs * add fp16 test * fix bugs * update * update distribute training * update parameters * add readme
- Loading branch information
Dalong
authored
Jul 2, 2022
1 parent
a1e2413
commit 1cd9007
Showing
13 changed files
with
601 additions
and
27 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# Introduction | ||
|
||
Distributed training setting on MAG240M dataset is almost the same as the [official example in DGL](https://github.com/dmlc/dgl/tree/master/examples/pytorch/ogb_lsc/MAG240M) except that we use `Quiver-Feature` for distributed feature collection. | ||
|
||
Our implementation is much faster than DGL's offical example while achieved similar accuracy. | ||
|
||
# Data Preprocess & Partition | ||
|
||
First, please run [preprocess.py](./preprocess.py) to generate `graph.dgl` and `full.npy`, you can check [DGL's official guide](https://github.com/dmlc/dgl/tree/master/examples/pytorch/ogb_lsc/MAG240M) for more details. | ||
|
||
Then we use [Range Partition](../../docs/partition_methods.md) to partition feature data, it is very easy to understand, you can check [process_quiver.py](./process_quiver.py) for more details. | ||
|
||
 | ||
|
||
|
||
# Running Training Script | ||
|
||
On each machine, please run: | ||
|
||
python3 distributed_training.py \ | ||
--rootdir . \ | ||
--graph-path ./graph.dgl \ | ||
--feature-partition-path ./feature_part.pt \ | ||
--server_world_size 2 | ||
--server_rank 0 | ||
|
||
Remember to: | ||
|
||
- Set shm size limit as large as your physical memory size. You can set by: | ||
|
||
sudo mount -o remount,size=300G /dev/shm | ||
|
||
- Set `MASTER_IP` as your master node's IP | ||
|
||
|
||
The validation accuracy is 0.680. We do not have ground truth test labels so we do not report test accuracy. | ||
|
||
# Performance | ||
|
||
With 2 machines and 1 GPU per machine, we need 2 minutes 10 seconds to train and 15 seconds to validate for each epoch. This is 3x faster than [DGL's performance result](https://github.com/dmlc/dgl/tree/master/examples/pytorch/ogb_lsc/MAG240M). | ||
|
||
|
||
# Hardware configurations | ||
|
||
We have 2 machines, each have 377G memory and they are connected by 100Gbps IB. Running training script will consume around 256GB memory. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
PORT_NUMBER = 3344 | ||
MASTER_IP = "155.198.152.17" | ||
#MASTER_IP = "127.0.0.1" | ||
HLPER_PORT = 5678 | ||
NODE_COUNT = 1200000 | ||
FEATURE_DIM = 128 | ||
FEATURE_TYPE_SIZE = 4 | ||
SAMPLE_NUM = 80000 | ||
ITER_NUM = 10 | ||
POST_LIST_SIZE = 128 | ||
QP_NUM = 8 | ||
TX_DEPTH = 2048 | ||
CTX_POLL_BATCH = TX_DEPTH // POST_LIST_SIZE | ||
TEST_TLB_OPTIMIZATION = True | ||
|
||
# For MAG240M Training | ||
SAMPLE_PARAM = [15, 25] | ||
BATCH_SIZE = 1024 |
Oops, something went wrong.