diff --git a/README.md b/README.md index c7984cc1..2ba66342 100644 --- a/README.md +++ b/README.md @@ -1,137 +1,254 @@ - - TensorHive === - - -![](https://img.shields.io/badge/release-v0.2.4-brightgreen.svg?style=popout-square) -![](https://img.shields.io/badge/pypi-v0.2.4-brightgreen.svg?style=popout-square) +![](https://img.shields.io/badge/release-v0.3-brightgreen.svg?style=popout-square) +![](https://img.shields.io/badge/pypi-v0.3-brightgreen.svg?style=popout-square) +![](https://img.shields.io/badge/Issues%20and%20PRs-welcome-yellow.svg?style=popout-square) ![](https://img.shields.io/badge/platform-Linux-blue.svg?style=popout-square) +![](https://img.shields.io/badge/hardware-Nvidia-green.svg?style=popout-square) ![](https://img.shields.io/badge/python-3.5%20|%203.6%20|%203.7-blue.svg?style=popout-square) ![](https://img.shields.io/badge/license-Apache%202.0-blue.svg?style=popout-square) -TensorHive is an open source system for managing and monitoring your computing resources across multiple hosts. -It solves the most common problems and nightmares about accessing and sharing your AI-focused infrastructure across multiple, often competing users. + + +TensorHive is an open source system for monitoring and managing computing resources across multiple hosts. +It solves the most common problems and nightmares about accessing and sharing your AI-oriented infrastructure across multiple, often competing users. + +It's designed with __simplicty, flexibility and configuration-friendliness__ in mind. + +Use cases +---------------------- +Our goal is to provide solutions for painful problems that ML engineers often have to struggle with when working with remote machines in order to run neural network trainings. + +#### You should really consider using TensorHive if anything described in profiles below matches you: +1. You're an **admin**, who is responsible for managing a cluster (or multiple servers) with powerful GPUs installed. +- :angry: There are more users than resources, so they have to compete for it, but you don't know how to deal with that chaos +- :ocean: Other popular tools are simply an overkill, have different purpose or require a lot of time to spend on reading documentation, installation and configuration (Graphana, Kubernetes, Slurm) +- :penguin: People using your infrastructure expect only one interface for all the things related to training models (besides terminal): monitoring, reservation calendar and scheduling distributed jobs +- :collision: Can't risk messing up sensitive configuration by installing software on each individual machine, prefering centralized solution which can be managed from one place + +2. You're a **standalone user** who has access to beefy GPUs scatterd across multiple machines. +- :part_alternation_mark: You want to be able to determine if batch size is too small or if there's a bottleneck when moving data from memory to GPU - charts with metrics such as `gpu_util`, `mem_util`, `mem_used` are great for this purpose +- :date: Visualizing names of training experiments using calendar helps you track how you're progressing on the project +- :snake: Launching distributed trainings is essential for you, no matter what the framework is +- :dizzy_face: Managing a list of training commands for all your distributed training experiments drives you nuts +- :zzz: Remembering to manually launch the training before going sleep is no fun anymore. + +What TensorHive has to offer +----------------------------- +:zero: Dead-simple one-machine installation and configuration, no `sudo` requirements + +:one: Users can make GPU reservations for specific time range in advance via **reservation mechanism** + +     :arrow_right: no more frustration caused by rules: **"first come, first served"** or **"the law of the jungle"**. -It's designed with __flexibility, lightness and configuration-friendliness__ in mind. +:two: Users can prepare and schedule custom tasks (commands) to be run on selected GPUs and hosts + +     :arrow_right: automate and simplify **distributed trainings** - **"one button to rule them all"** + +:three: Gather all useful GPU metrics, from all configured hosts **in one dashboard** + +     :arrow_right: no more manually logging in to each individual machine in order to check if GPU is currently in use or not + +For more details, check out the [full list of features](#features) Getting started --------------- ### Prerequisites -* Nodes should be accessible via SSH without password ([HOWTO set up SSH keys](https://www.ssh.com/ssh/keygen/)) -* Only NVIDIA GPUs are supported (```nvidia-smi``` is required) +* All nodes must be accessible via SSH, without password, using SSH Key-Based Authentication ([How to set up SSH keys](https://www.shellhacks.com/ssh-login-without-password/) - explained in [Quickstart section](#basic-usage) +* Only NVIDIA GPUs are supported (relying on ```nvidia-smi``` command) +* Currently TensorHive assumes that all users who want to register into the system must have identical UNIX usernames on all nodes configured by TensorHive administrator (not relevant to standalone developers) ### Installation -#### Via pip +#### via pip (not updated yet) ```shell -pip3 install tensorhive +pip install tensorhive ``` -#### From source -``` -git clone https://github.com/roscisz/TensorHive.git -cd TensorHive -pip install . +#### via conda (not updated yet) +```shell +conda install tensorhive ``` -If you want to also build the web app manually: +#### From source (recommended) +(optional) For development purposes we encourage separation from your current python packages using e.g. [Miniconda](https://docs.conda.io/en/latest/miniconda.html) +`conda create --name th_env python=3.5 pip; activate th_env` + ```shell -(cd tensorhive/app/web/dev && npm install && npm run build) +git clone https://github.com/roscisz/TensorHive.git && cd TensorHive +git checkout fixes/voicelab +pip install -e . ``` -Usage ------ -#### Required configuration -At first, you must tell TensorHive how it can establish SSH connections to hosts you want to work with. +TensorHive is already shipped with newest web app build, but in case you modify the source, you can can build it with `make app` (currently on `master` branch). For more useful commands see our [Makefile](https://github.com/roscisz/TensorHive/blob/master/tensorhive/Makefile). +Build tested with `Node v11.14.0` and `npm 6.7.0` -You can do this by editing `~/.config/TensorHive/hosts_config.ini` [(see example)](https://github.com/roscisz/TensorHive/blob/master/hosts_config.ini) +Basic usage +----- +#### Quickstart +Each command will guide you through basic configuration process: +``` +tensorhive init +tensorhive key +tensorhive test +``` -#### Run TensorHive -```shell +Now you should be ready to finally launch a TensorHive instance +``` tensorhive ``` -Sample output: - -The Web application and API Documentation can be accessed through te given URLs. +Web application and API Documentation can be accessed via URLs highlighted in green (Ctrl + click to open in browser) + +#### Advanced configuration +You can fully customize TensorHive behaviours via INI configuration (which will be created automatically after `tensorhive init` +``` +~/.config/TensorHive/main_config.ini +~/.config/TensorHive/mailbot_config.ini +~/.config/TensorHive/hosts_config.ini +``` +[(see example)](https://github.com/roscisz/TensorHive/blob/master/tensorhive/main_config.ini) + +#### Infrastructure monitoring dashboard +Accessible infrastructure can be monitored in the Nodes overview tab. Sample screenshot: +Here you can add new watches, select metrics and monitor ongoing GPU processes and its' owners. -If you need the Web application to be accessible from remote machines, set the `host` and `port` fields in the -`[web_app.server]` section in `~/.config/TensorHive/main_config.ini`. The host field should be set to a hostname -or IP that resolves to an external network interface. +![image](https://user-images.githubusercontent.com/12485656/61520152-d963bd80-aa0d-11e9-9caa-1f7203cc6b42.png) -#### Monitor infrastructure +#### GPU Reservation calendar -The available infrastructure can be monitored in the Nodes overview tab. Sample screenshot: +Each column represents all reservation events for a GPU on a given day. +In order to make a new reservation simply click and drag with your mouse, select GPU(s), add some meaningful title, optionally adjust time range. - +If there are many hosts and GPUs in our infrastructure, you can use our simplified, horizontal calendar to quickly identify empty time slots and filter out already reserved GPUs. -The "Add watch" button allows to add a new chart which can be configured to show chosen metrics of the selected devices. Currently, the metrics include GPU metrics from nvidia-smi and a process overview with corresponding usernames. +(UI prototype: redesign is coming) +![image](https://user-images.githubusercontent.com/12485656/61517527-bb935a00-aa07-11e9-8ea3-9db4a1529e24.png) -#### Reserve resources +From now on, **only your processes are eligible to run on reserved GPU(s)**. TensorHive periodically checks if some other user has violated it. He will be spammed with warnings on all his PTYs, emailed every once in a while, additionally admin will also be notified (it all depends on the configuration). -The computing resource reservations can be viewed and managed in the Reservations overview tab. Sample screenshot: +Terminal warning | Email warning +:-------------------------:|:-------------------------: +![image](https://user-images.githubusercontent.com/12485656/61520488-99e9a100-aa0e-11e9-8f35-b02c2e7de9ce.png) | ![image](https://user-images.githubusercontent.com/12485656/61520956-85f26f00-aa0f-11e9-8342-09023c93275b.png) + +#### What admin sees: - +![image](https://user-images.githubusercontent.com/12485656/61520807-4a57a500-aa0f-11e9-8a52-cb87208d6c71.png) -The select boxes at the bottom of the page (easily accessible by the Adjust Filters button) allow to specify which nodes or devices should be visible in the view. Adding reservations is possible through selecting a time interval and filling the reservation details in a form. Cancelling reservations is possible for the reservation owner and admin user by clicking on a given reservation and confirming the cancellation. +#### Task nursery -#### Optional configuration -You can fully customize TensorHive behaviour from `~/.config/TensorHive/main_config.ini` -[(see example)](https://github.com/roscisz/TensorHive/blob/master/main_config.ini) +Here you can define commands for tasks you want to run on any configured nodes. You can manage them manually or set spawn/terminate date. +Commands are run within `screen` session, so attaching to it while they are running is a piece of cake. +![image](https://user-images.githubusercontent.com/12485656/61518173-4163d500-aa09-11e9-9916-59c907c1590c.png) +It provides quite simple, but flexible (**framework-agnostic**) command templating mechanism that will help you automate multi-node trainings. +TensorHive requires that users who want to use this feature must append TensorHive's public key to their `~/.ssh/authorized_keys` on all nodes they want to connect to. +![image](https://user-images.githubusercontent.com/12485656/61518418-bcc58680-aa09-11e9-8943-88bddc964417.png) Features --------- +---------------------- #### Core -- [x] :mag_right: Monitor GPU parameters on each host +- [x] :mag_right: Monitor metrics on each host + - [x] :tm: Nvidia GPUs + - [ ] :pager: CPU, RAM, HDD - [x] :customs: Protection of reserved resources - [x] :warning: Send warning messages to terminal of users who violate the rules - [x] :mailbox_with_no_mail: Send e-mail warnings - - [ ] :bomb: Kill unwated processes -- [ ] :rocket: Automatic execution of user's predefined command -- [x] :watch: Track wasted reservation time (idle) - - [ ] Remind user when his reservation starts and ends - - [ ] Send e-mail if idle for too long -#### Dashboard + - [ ] :bomb: Kill unwanted processes +- [X] :rocket: Task nursery and scheduling + - [x] :old_key: Execute any command in the name of a user + - [x] :alarm_clock: Schedule spawn and termination + - [x] :repeat: Synchronize process status + - [x] :factory: Use `screen` command as backend - user can easily attach to running task + - [x] :skull: Remote process interruption, termination and kill + - [x] :floppy_disk: Save stdout to disk + - [ ] :page_facing_up: Capture stderr +- [x] :watch: Track wasted (idle) time during reservation + - [x] :hocho: Gather and calculate average gpu and mem utilization + - [ ] :loudspeaker: Remind user when his reservation starts and ends + - [ ] :incoming_envelope: Send e-mail if idle for too long + +#### Web - [x] :chart_with_downwards_trend: Configurable charts view - - [x] GPU metrics and active processes - - [ ] CPU, RAM, HDD metrics + - [x] Metrics and active processes + - [ ] Detailed harware specification - [x] :calendar: Calendar view - [x] Allow making reservations for selected GPUs - [x] Edit reservations - [x] Cancel reservations -- [ ] :scroll: Detailed hardware specification view + - [x] Attach jobs to reservation +- [x] :baby_symbol: Task nursery + - [x] Create parametrized tasks and assign to hosts, automatically set `CUDA_VISIBLE_DEVICES` + - [x] Buttons for task spawning/scheduling/termination/killing actions + - [x] Fetch log produced by running task + - [x] Group actions (spawn, schedule, terminate, kill selected) +- [ ] :straight_ruler: Detailed hardware specification panel (CPU clock speed, RAM, etc.) - [ ] :penguin: Admin panel - [ ] User banning - [ ] Accept/reject reservation requests + - [ ] Modify rules on-the-fly (without restarting) + - [ ] Show popups to users (something like message of the day - `motd`) + +#### CLI +- [ ] Implement command-line app that communicates with core via API +- [ ] Migrate all features from web app that don't require GUI (so no charts) #### API - [x] OpenAPI 2.0 specification with Swagger UI - [x] User authentication via JWT +TensorHive is currently being used in production in the following environments: +----- + +| Organization | Hardware | No. users | +| ------ | -------- | --------- | +| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100 16GB | 30+ | +| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 18x machines with GTX 1060 6GB | 20+ | +| ![](http://martyniak.tech/images/gradient_logo_small-628ed211.png)[Gradient PG](http://gradient.eti.pg.gda.pl/en/) | TITAN X 12GB | 10+ | +| ![](https://res-4.cloudinary.com/crunchbase-production/image/upload/c_lpad,h_20,w_20,f_auto,q_auto:eco/v1444894092/jeuh0l6opc159e1ltzky.png) [VoiceLab - Conversational Intelligence](https://www.voicelab.ai) | 30+ GTX and RTX cards | 10+ + Application examples and benchmarks -------- -Along with TensorHive, we are developing a set of [**sample deep neural network training applications**](https://github.com/roscisz/TensorHive/tree/master/examples) in Distributed TensorFlow which will be used as test applications for the system. They can also serve as benchmarks for various GPU, distributed multiGPU and distributed multinode architectures. For each example, a full set of instructions to reproduce is given. +Along with TensorHive, we are developing a set of [**sample deep neural network training applications**](https://github.com/roscisz/TensorHive/tree/master/examples) in Distributed TensorFlow which will be used as test applications for the system. They can also serve as benchmarks for single GPU, distributed multi-GPU and distributed multi-node architectures. For each example, a full set of instructions to reproduce is provided.
+TensorHive architecture (simplified) +----------------------- + +This diagram will help you to grasp the rough concept of the system. + +![TensorHive_diagram _final](https://user-images.githubusercontent.com/12485656/59147556-7853cd80-89fd-11e9-80bc-5848e95c7574.png) + + Contibution and feedback ------------------------ -We'd :heart: to collect your observations, issues and pull requests. +**Project is still in early beta version**, so there will be some inconveniences, just be patient and keep an eye on upcoming updates. -You can do this by making use of our [**issue template**](https://gist.github.com/micmarty/396c649bf693688245731f35854bf971). +We'd :heart: to collect your observations, issues and pull requests! + +Feel free to **report any configuration problems, we will help you**. +We plan to redesign the UI/UX side as well as improve reliability of the system until September 2019 :shipit:, so stay tuned! Credits ------- + +TensorHive has been created within a joint project between [**VoiceLab.ai**](https://voicelab.ai) and +[**Gdańsk University of Technology**](https://pg.edu.pl/) titled: "Exploration and selection of methods +for parallelization of neural network training using multiple GPUs". + Project created and maintained by: - Paweł Rościszewski [(@roscisz)](https://github.com/roscisz) - ![](https://avatars2.githubusercontent.com/u/12485656?s=22&v=4) [Michał Martyniak (@micmarty)](http://martyniak.me) - Filip Schodowski [(@filschod)](https://github.com/filschod) + + Recent contributions: - Tomasz Menet [(@tomenet)](https://github.com/tomenet) +- Dariusz Piotrowski [(@PiotrowskiD)](https://github.com/PiotrowskiD) - Karol Draszawka [(@szarakawka)](https://github.com/szarakawka) + License ------- [Apache License 2.0](https://github.com/roscisz/TensorHive/blob/master/LICENSE) diff --git a/tensorhive/hosts_config.ini b/tensorhive/hosts_config.ini index 49a219e4..65a8b6a4 100644 --- a/tensorhive/hosts_config.ini +++ b/tensorhive/hosts_config.ini @@ -1,12 +1,24 @@ -# Example: -; [example_hostname] -; user = example_username +# Example SSH configuration +# 1. Hosts must be accessible without password (key-based authentication) +# You can check if you're good to go with this command: `ssh username@hostname` +# If the command above does not not work, make sure to check out our detailed instructions in `README.md` +# 2. Hostname aliases defined in ~/.ssh/config are not respected, you must provide full hostname or address +# 3. 22 is default port value, so in most cases including this line is optional +# 4. Uncomment lines you need (starting with ;) +; [example_hostname_0] +; user = example_username_0 ; port = 22 -# Here you can enable proxy for all defined hosts -# ParallelSSHClient -> Proxy host -> Target hosts +; [example_hostname_1] +; user = example_username_1 + +; [example_hostname_2] +; user = example_username_2 + +# Here you can configure proxy server for connecting to all hosts defined above +# TensorHive -> Proxy host -> Target hosts ; [proxy_tunneling] -; enabled = no +; enabled = yes ; proxy_host = example_proxy_hostname ; proxy_user = example_proxy_username ; proxy_port = 22