SWIM based Distributed Deep Learning Cluster.
- Ruipeng Han ([email protected])
- Tomoyoshi Kimura ([email protected])
For this project, we implemented a variant of the round-robin scheduling algorithm to enable fair allocation of tasks for inferencing multiple machine learning models across distributed nodes.
.
├── README.md
├── client
│ └── client.go
├── data
│ ├── ...
├── fileclient
│ └── fileclient.go
├── fileserver
│ └── fileserver.go
├── go.mod
├── go.sum
├── grep
│ ├── logger
│ └── querier
├── log
├── machine_i.log
├── node.go
├── proto
│ ├── filetransfer
│ │ ├── filetransfer.pb.go
│ │ ├── filetransfer.proto
│ │ └── filetransfer_grpc.pb.go
│ └── packet
│ ├── packet.pb.go
│ ├── packet.proto
│ └── packet_grpc.pb.go
├── report
│ └── CS 425 MP 3 Report.pdf
├── server
│ └── server.go
├── storage
│ ├── file.go
│ └── storage.go
├── targets
│ └── hrpmyson.dat
├── test
└── utils
└── utils.go
go run node.go
go build node.go
./node
Once build and run, you are entered a shell. Here are list of four commands you can do:
Membership List Commands
-
join
: This will makes the current process joins the group. If this is an introducer process, it will joins as an introducer role; otherwise it will ask the existing introducer to join the group. -
list_mem
: This will output a list of current members on the process's membership list. -
leave
: This command will make the process leave voluntarily. Running this command will notify its neighbors that it will leave, who will update their membership lists accordingly and propagate the leave message to their neighbors. -
list_self
: This command will list the current process's id.
Filesystem Commands
put localfilename sdfsfilename
: Put the localfilelocalfilename
assdfsfilename
on the filesystemget sdfsfilename localfilename
: Gets the filesdfsfilename
from the filesystem aslocalfilename
delete sdfsfilename
: Deletes thesdfsfilename
on the serverls sdfsfilename
: Lists machines that havesdfsfilename
store
: Lists files on the current node as a part of the filesystemget-versions sdfsfilename num-versions localfilename
: Gets the lastnum-versions
files from the filesystem
Distributed ML Commands
load_test_dataset
: Loads the test dataset that will be used for all ML models for inferencing.start_job job_id batch_size model_type model_name
: Initializes and creates a job status formodel_name
withjob_id
andbatch_size
.inference job_id
: Starts inferencing for the model with job idjob_id
job_status job_id
: Lists the current job status for the job with idjob_id