From a0fd283f4effa2bc8c18c1d91fb4c96180458554 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 7 Jun 2019 17:23:36 +0200 Subject: [PATCH 01/37] Update README.md --- README.md | 69 ++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 48 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index c7984cc1..22000cb5 100644 --- a/README.md +++ b/README.md @@ -1,57 +1,82 @@ - + TensorHive === - -![](https://img.shields.io/badge/release-v0.2.4-brightgreen.svg?style=popout-square) -![](https://img.shields.io/badge/pypi-v0.2.4-brightgreen.svg?style=popout-square) +![](https://img.shields.io/badge/release-v0.3-brightgreen.svg?style=popout-square) +![](https://img.shields.io/badge/pypi-v0.3-brightgreen.svg?style=popout-square) ![](https://img.shields.io/badge/platform-Linux-blue.svg?style=popout-square) ![](https://img.shields.io/badge/python-3.5%20|%203.6%20|%203.7-blue.svg?style=popout-square) ![](https://img.shields.io/badge/license-Apache%202.0-blue.svg?style=popout-square) -TensorHive is an open source system for managing and monitoring your computing resources across multiple hosts. + + +TensorHive is an open source system for monitoring and managing computing resources across multiple hosts. It solves the most common problems and nightmares about accessing and sharing your AI-focused infrastructure across multiple, often competing users. -It's designed with __flexibility, lightness and configuration-friendliness__ in mind. +It's designed with __flexibility, lightness and configuration-friendliness__ in mind. + +
+ +Common use cases +---------------------- +TODO Product overview, use cases +Generally speaking TensorHive will improve the experience of using + +Currently TensorHive is being used on production in these 4 environments: + +| Where | Hardware | No. users | +| ------ | -------- | --------- | +| [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100 16GB | TODO | +| [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 18x machines with GTX 1060 6GB | TODO | +| [Gradient PG](http://gradient.eti.pg.gda.pl/en/) | TITAN X 12GB | TODO | +| [VoiceLab - Conversational Intelligence](voicelab.ai) | TODO | TODO Getting started --------------- ### Prerequisites -* Nodes should be accessible via SSH without password ([HOWTO set up SSH keys](https://www.ssh.com/ssh/keygen/)) -* Only NVIDIA GPUs are supported (```nvidia-smi``` is required) +* All hosts must be accessible via SSH, without password, using SSH Key-Based Authentication ([TODO OUR OWN LINK set up SSH keys](https://www.ssh.com/ssh/keygen/)) +* Only NVIDIA GPUs are supported (relying on ```nvidia-smi``` command) ### Installation #### Via pip ```shell -pip3 install tensorhive +pip install tensorhive ``` -#### From source +#### Via conda +```shell +conda install tensorhive ``` -git clone https://github.com/roscisz/TensorHive.git -cd TensorHive -pip install . + +#### From source +(optional) We encourage separation from your others python packages with Miniconda environment (TODO) + +```shell +conda create --name th_env python=3.5 pip +activate th_env ``` -If you want to also build the web app manually: ```shell -(cd tensorhive/app/web/dev && npm install && npm run build) +git clone https://github.com/roscisz/TensorHive.git && cd TensorHive +make dev ``` +TensorHive is shipped with built web app distribution, but you can build it separately with `make app`, for more useful commands see `Makefile`. -Usage +Basic usage ----- #### Required configuration At first, you must tell TensorHive how it can establish SSH connections to hosts you want to work with. -You can do this by editing `~/.config/TensorHive/hosts_config.ini` [(see example)](https://github.com/roscisz/TensorHive/blob/master/hosts_config.ini) +You can do this by editing `~/.config/TensorHive/hosts_config.ini` after first `tensorhive` launch [(see example)](https://github.com/roscisz/TensorHive/blob/master/TensorHive/hosts_config.ini). To add more hosts, just create new section. + #### Run TensorHive ```shell tensorhive ``` -Sample output: +Sample output TODO Update: The Web application and API Documentation can be accessed through te given URLs. @@ -80,16 +105,15 @@ The select boxes at the bottom of the page (easily accessible by the Adjust Filt You can fully customize TensorHive behaviour from `~/.config/TensorHive/main_config.ini` [(see example)](https://github.com/roscisz/TensorHive/blob/master/main_config.ini) - Features --------- +---------------------- #### Core - [x] :mag_right: Monitor GPU parameters on each host - [x] :customs: Protection of reserved resources - [x] :warning: Send warning messages to terminal of users who violate the rules - [x] :mailbox_with_no_mail: Send e-mail warnings - [ ] :bomb: Kill unwated processes -- [ ] :rocket: Automatic execution of user's predefined command +- [X] :rocket: Automatic execution of user's predefined command - [x] :watch: Track wasted reservation time (idle) - [ ] Remind user when his reservation starts and ends - [ ] Send e-mail if idle for too long @@ -110,6 +134,9 @@ Features - [x] OpenAPI 2.0 specification with Swagger UI - [x] User authentication via JWT +Deployment in production (for admins) +----- +TensorHive is being used with Application examples and benchmarks -------- From 7cd64045accf86a89a58c1d588bb4c68d8dad638 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 7 Jun 2019 18:24:08 +0200 Subject: [PATCH 02/37] Update README.md --- README.md | 42 +++++++++++++++++++++++++++--------------- 1 file changed, 27 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 22000cb5..3197f70f 100644 --- a/README.md +++ b/README.md @@ -12,25 +12,28 @@ TensorHive TensorHive is an open source system for monitoring and managing computing resources across multiple hosts. -It solves the most common problems and nightmares about accessing and sharing your AI-focused infrastructure across multiple, often competing users. +It solves the most common problems and nightmares about accessing and sharing your AI-oriented infrastructure across multiple, often competing users. -It's designed with __flexibility, lightness and configuration-friendliness__ in mind. +It's designed with __simplicty, flexibility and configuration-friendliness__ in mind.
-Common use cases +Top features ---------------------- -TODO Product overview, use cases -Generally speaking TensorHive will improve the experience of using -Currently TensorHive is being used on production in these 4 environments: +:one: Users can make GPU reservations for specific time range in advance via **reservation mechanism*** -| Where | Hardware | No. users | -| ------ | -------- | --------- | -| [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100 16GB | TODO | -| [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 18x machines with GTX 1060 6GB | TODO | -| [Gradient PG](http://gradient.eti.pg.gda.pl/en/) | TITAN X 12GB | TODO | -| [VoiceLab - Conversational Intelligence](voicelab.ai) | TODO | TODO +     :arrow_right: no more frustration caused by rules: **"first come, first served"** or **"the law of the jungle"**. + +:two: Users can prepare and schedule custom tasks (commands) to be run on selected GPUs and hosts + +     :arrow_right: automate and simplify **distributed trainings** - **"one button to rule them all"*** + +:three: Gather all useful GPU metrics, from all configured hosts **in one dashboard** + +     :arrow_right: no more manually logging in to each individual machine in order to check if GPU is currently taken or not + +**\*** For more details, check out the full list of [features](#features) Getting started --------------- @@ -51,7 +54,7 @@ conda install tensorhive ``` #### From source -(optional) We encourage separation from your others python packages with Miniconda environment (TODO) +(optional) For development purposes we encourage separation from your current python packages using e.g. Miniconda (TODO) ```shell conda create --name th_env python=3.5 pip @@ -62,7 +65,7 @@ activate th_env git clone https://github.com/roscisz/TensorHive.git && cd TensorHive make dev ``` -TensorHive is shipped with built web app distribution, but you can build it separately with `make app`, for more useful commands see `Makefile`. +TensorHive is already shipped with newest web app build, but in case you modify the source, you can can build it with `make app`. For more useful commands see our [Makefile](https://github.com/roscisz/TensorHive/blob/master/TensorHive/Makefile). Basic usage ----- @@ -136,7 +139,16 @@ Features Deployment in production (for admins) ----- -TensorHive is being used with +TODO Instructions + +Currently TensorHive is being used on production in these 4 environments: + +| Where | Hardware | No. users | +| ------ | -------- | --------- | +| [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100 16GB | TODO | +| [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 18x machines with GTX 1060 6GB | TODO | +| [Gradient PG](http://gradient.eti.pg.gda.pl/en/) | TITAN X 12GB | TODO | +| [VoiceLab - Conversational Intelligence](voicelab.ai) | TODO | TODO Application examples and benchmarks -------- From fd7c67d7cf8633838480fe79e551b5633dd98796 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 7 Jun 2019 18:57:42 +0200 Subject: [PATCH 03/37] Update README.md --- README.md | 58 ++++++++++++++++++++++++------------------------------- 1 file changed, 25 insertions(+), 33 deletions(-) diff --git a/README.md b/README.md index 3197f70f..5d1fe0ff 100644 --- a/README.md +++ b/README.md @@ -43,23 +43,19 @@ Getting started ### Installation -#### Via pip +#### via pip ```shell pip install tensorhive ``` -#### Via conda +#### via conda ```shell conda install tensorhive ``` #### From source (optional) For development purposes we encourage separation from your current python packages using e.g. Miniconda (TODO) - -```shell -conda create --name th_env python=3.5 pip -activate th_env -``` +`conda create --name th_env python=3.5 pip; activate th_env` ```shell git clone https://github.com/roscisz/TensorHive.git && cd TensorHive @@ -69,44 +65,30 @@ TensorHive is already shipped with newest web app build, but in case you modify Basic usage ----- -#### Required configuration -At first, you must tell TensorHive how it can establish SSH connections to hosts you want to work with. - -You can do this by editing `~/.config/TensorHive/hosts_config.ini` after first `tensorhive` launch [(see example)](https://github.com/roscisz/TensorHive/blob/master/TensorHive/hosts_config.ini). To add more hosts, just create new section. - - #### Run TensorHive ```shell tensorhive ``` -Sample output TODO Update: - - -The Web application and API Documentation can be accessed through te given URLs. -If you need the Web application to be accessible from remote machines, set the `host` and `port` fields in the -`[web_app.server]` section in `~/.config/TensorHive/main_config.ini`. The host field should be set to a hostname -or IP that resolves to an external network interface. - -#### Monitor infrastructure +#### Required configuration +As you see, you must configure TensorHive so it knows how to establish SSH connections to hosts you want to work with. -The available infrastructure can be monitored in the Nodes overview tab. Sample screenshot: +You can do this by editing `~/.config/TensorHive/hosts_config.ini` after first `tensorhive` launch [(see example)](https://github.com/roscisz/TensorHive/blob/master/TensorHive/hosts_config.ini). To configure more hosts, just add a new section for each. - -The "Add watch" button allows to add a new chart which can be configured to show chosen metrics of the selected devices. Currently, the metrics include GPU metrics from nvidia-smi and a process overview with corresponding usernames. +Web application and API Documentation can be accessed via URLs highlighted in green (Ctrl + click to open in browser) -#### Reserve resources +#### Infrastructure monitoring dashboard -The computing resource reservations can be viewed and managed in the Reservations overview tab. Sample screenshot: +Accessible infrastructure can be monitored in the Nodes overview tab. Sample screenshot: - +Here you can add new watches, configure displayed metrics, monitor running GPU processes and its' owners +TODO Update screenshot -The select boxes at the bottom of the page (easily accessible by the Adjust Filters button) allow to specify which nodes or devices should be visible in the view. Adding reservations is possible through selecting a time interval and filling the reservation details in a form. Cancelling reservations is possible for the reservation owner and admin user by clicking on a given reservation and confirming the cancellation. +#### GPU Reservation calendar -#### Optional configuration -You can fully customize TensorHive behaviour from `~/.config/TensorHive/main_config.ini` -[(see example)](https://github.com/roscisz/TensorHive/blob/master/main_config.ini) +TODO Update screenshot +TODO Write new usage instructions Features ---------------------- @@ -139,7 +121,17 @@ Features Deployment in production (for admins) ----- -TODO Instructions +#### Advanced configuration +You can fully customize TensorHive behaviour from `~/.config/TensorHive/main_config.ini` +[(see example)](https://github.com/roscisz/TensorHive/blob/master/TensorHive/main_config.ini) + +#### Database migration +TODO + +#### Web +The last step is to launch TensorHive to the public so it can be accessed by users. +In order to do this you must open `~/.config/TensorHive/main_config.ini` and fill in `host` and `port` under `[web_app.server]` section (`host` field can be either a hostname or IP) + Currently TensorHive is being used on production in these 4 environments: From 9250619510084879a5338451654c03f66e1e5a14 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 7 Jun 2019 19:39:46 +0200 Subject: [PATCH 04/37] Update README.md --- README.md | 38 +++++++++++++++++++++++++++----------- 1 file changed, 27 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 5d1fe0ff..fcbaf532 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,3 @@ - - TensorHive === @@ -93,27 +91,45 @@ TODO Write new usage instructions Features ---------------------- #### Core -- [x] :mag_right: Monitor GPU parameters on each host +- [x] :mag_right: Monitor metrics on each host + - [x] :tm: Nvidia GPUs + - [ ] :pager: CPU, RAM, HDD - [x] :customs: Protection of reserved resources - [x] :warning: Send warning messages to terminal of users who violate the rules - [x] :mailbox_with_no_mail: Send e-mail warnings - - [ ] :bomb: Kill unwated processes -- [X] :rocket: Automatic execution of user's predefined command -- [x] :watch: Track wasted reservation time (idle) - - [ ] Remind user when his reservation starts and ends - - [ ] Send e-mail if idle for too long -#### Dashboard + - [ ] :bomb: Kill unwanted processes +- [X] :rocket: Task nursery and scheduling + - [x] :old_key: Execute any command in the name of a user + - [x] :alarm_clock: Schedule spawn and termination time + - [x] :repeat: Synchronize process status + - [x] :factory: Use `screen` command as backend - user can easily attach to running task + - [x] :skull: Remote process interruption, termination or kill + - [x] :floppy_disk: Save stdout to disk + - [ ] :page_facing_up: Capture stderr +- [x] :watch: Track wasted (idle) time during reservation + - [x] :hocho: Gather and calculate average gpu and mem utilization + - [ ] :loudspeaker: Remind user when his reservation starts and ends + - [ ] :incoming_envelope: Send e-mail if idle for too long +#### Web - [x] :chart_with_downwards_trend: Configurable charts view - - [x] GPU metrics and active processes - - [ ] CPU, RAM, HDD metrics + - [x] Metrics and active processes + - [ ] Detailed host specification - [x] :calendar: Calendar view - [x] Allow making reservations for selected GPUs - [x] Edit reservations - [x] Cancel reservations + - [ ] Attach jobs to reservation +- [x] :baby_symbol: Task nursery + - [x] Create parametrized tasks and assign to hosts, automatically set `CUDA_VISIBLE_DEVICES` + - [x] Buttons for task spawning/scheduling/termination/killing actions + - [x] Fetch log produced by running task + - [x] Group actions (spawn, schedule, terminate, kill selected) - [ ] :scroll: Detailed hardware specification view - [ ] :penguin: Admin panel - [ ] User banning - [ ] Accept/reject reservation requests + - [ ] Modify rules on-the-fly (without restarting) + - [ ] Show popups to users (something like message of the day - `motd`) #### API - [x] OpenAPI 2.0 specification with Swagger UI From 6b5a875b88244b1a891803acb0280d265d8197b3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 7 Jun 2019 22:01:16 +0200 Subject: [PATCH 05/37] Update README.md --- README.md | 28 +++++++++++++++++++++++++--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index fcbaf532..5cb06820 100644 --- a/README.md +++ b/README.md @@ -3,10 +3,13 @@ TensorHive ![](https://img.shields.io/badge/release-v0.3-brightgreen.svg?style=popout-square) ![](https://img.shields.io/badge/pypi-v0.3-brightgreen.svg?style=popout-square) +![](https://img.shields.io/badge/Issues%20and%20PRs-welcome-yellow.svg?style=popout-square) ![](https://img.shields.io/badge/platform-Linux-blue.svg?style=popout-square) ![](https://img.shields.io/badge/python-3.5%20|%203.6%20|%203.7-blue.svg?style=popout-square) ![](https://img.shields.io/badge/license-Apache%202.0-blue.svg?style=popout-square) +TODO Issue template to file in repo + TensorHive is an open source system for monitoring and managing computing resources across multiple hosts. @@ -14,10 +17,24 @@ It solves the most common problems and nightmares about accessing and sharing yo It's designed with __simplicty, flexibility and configuration-friendliness__ in mind. -
- -Top features +About project ---------------------- +Our goal is to provide solutions for painful problems that ML engineers often have to struggle with when working with remote machines in order to run neural network trainings. + +#### You should really consider using TensorHive if anything described in profiles below matches you: +1. You're and **admin**, who is responsible for managing a cluster (or multiple servers) with powerful GPUs installed. +- :angry: There are more users than resources, so they have to compete for it, but you don't know how to deal with that chaos +- :ocean: Other popular tools are simply an overkill, have different purpose or require a lot of time to spend on reading documentation, installation and configuration (Graphana, Kubernetes, Slurm) +- :penguin: People that are using your infrastructure expect only one interface for all the things related to training models (besides terminal): monitoring, reservation calendar and scheduling distributed jobs +- :collision: Can't risk messing up sensitive configuration by installing software on each individual machine, prefering centralized solution which can be managed from one place. + +2. Standalone user who has access to beefy GPUs scatterd across multiple machines +- :part_alternation_mark: You want to be able to determine if batch size is too small or if there's a bottleneck when moving data from memory to GPU - `gpu_util`, `mem_util`, `mem_used` metrics are great for this purpose +- :date: Visualizing names of training experiments using calendar helps you track how you're progressing on the project +- Launching distributed trainings is essential for you, no matter what the framework is +- :dizzy_face: Managing a list of training commands for all your distributed training experiments drives you nuts (because you wrap them into bash scripts) +- :zzz: Remembering to manually launch the training before going sleep is no fun anymore + :one: Users can make GPU reservations for specific time range in advance via **reservation mechanism*** @@ -110,6 +127,7 @@ Features - [x] :hocho: Gather and calculate average gpu and mem utilization - [ ] :loudspeaker: Remind user when his reservation starts and ends - [ ] :incoming_envelope: Send e-mail if idle for too long + #### Web - [x] :chart_with_downwards_trend: Configurable charts view - [x] Metrics and active processes @@ -130,6 +148,10 @@ Features - [ ] Accept/reject reservation requests - [ ] Modify rules on-the-fly (without restarting) - [ ] Show popups to users (something like message of the day - `motd`) + +#### CLI +- [ ] Implement command-line app that communicates with core via API +- [ ] Migrate all features from web app that don't require GUI (so no charts) #### API - [x] OpenAPI 2.0 specification with Swagger UI From b9d013a10211185dec6e962fbc1362d325a38555 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 7 Jun 2019 22:25:48 +0200 Subject: [PATCH 06/37] Update README.md --- README.md | 27 +++++++++++++++------------ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 5cb06820..d39d9397 100644 --- a/README.md +++ b/README.md @@ -22,27 +22,29 @@ About project Our goal is to provide solutions for painful problems that ML engineers often have to struggle with when working with remote machines in order to run neural network trainings. #### You should really consider using TensorHive if anything described in profiles below matches you: -1. You're and **admin**, who is responsible for managing a cluster (or multiple servers) with powerful GPUs installed. +1. You're an **admin**, who is responsible for managing a cluster (or multiple servers) with powerful GPUs installed. - :angry: There are more users than resources, so they have to compete for it, but you don't know how to deal with that chaos - :ocean: Other popular tools are simply an overkill, have different purpose or require a lot of time to spend on reading documentation, installation and configuration (Graphana, Kubernetes, Slurm) - :penguin: People that are using your infrastructure expect only one interface for all the things related to training models (besides terminal): monitoring, reservation calendar and scheduling distributed jobs - :collision: Can't risk messing up sensitive configuration by installing software on each individual machine, prefering centralized solution which can be managed from one place. -2. Standalone user who has access to beefy GPUs scatterd across multiple machines -- :part_alternation_mark: You want to be able to determine if batch size is too small or if there's a bottleneck when moving data from memory to GPU - `gpu_util`, `mem_util`, `mem_used` metrics are great for this purpose +2. You're a **standalone user** who has access to beefy GPUs scatterd across multiple machines +- :part_alternation_mark: You want to be able to determine if batch size is too small or if there's a bottleneck when moving data from memory to GPU - charts with `gpu_util`, `mem_util`, `mem_used` metrics are great for this purpose - :date: Visualizing names of training experiments using calendar helps you track how you're progressing on the project -- Launching distributed trainings is essential for you, no matter what the framework is -- :dizzy_face: Managing a list of training commands for all your distributed training experiments drives you nuts (because you wrap them into bash scripts) +- :snake: Launching distributed trainings is essential for you, no matter what the framework is +- :dizzy_face: Managing a list of training commands for all your distributed training experiments drives you nuts (because you didn't wrap them into bash scripts) - :zzz: Remembering to manually launch the training before going sleep is no fun anymore +#### What TensorHive has to offer* +:zero: Dead-simple one-machine installation and configuration, no `sudo` requirements -:one: Users can make GPU reservations for specific time range in advance via **reservation mechanism*** +:one: Users can make GPU reservations for specific time range in advance via **reservation mechanism**      :arrow_right: no more frustration caused by rules: **"first come, first served"** or **"the law of the jungle"**. :two: Users can prepare and schedule custom tasks (commands) to be run on selected GPUs and hosts -     :arrow_right: automate and simplify **distributed trainings** - **"one button to rule them all"*** +     :arrow_right: automate and simplify **distributed trainings** - **"one button to rule them all"** :three: Gather all useful GPU metrics, from all configured hosts **in one dashboard** @@ -117,10 +119,10 @@ Features - [ ] :bomb: Kill unwanted processes - [X] :rocket: Task nursery and scheduling - [x] :old_key: Execute any command in the name of a user - - [x] :alarm_clock: Schedule spawn and termination time + - [x] :alarm_clock: Schedule spawn and termination - [x] :repeat: Synchronize process status - [x] :factory: Use `screen` command as backend - user can easily attach to running task - - [x] :skull: Remote process interruption, termination or kill + - [x] :skull: Remote process interruption, termination and kill - [x] :floppy_disk: Save stdout to disk - [ ] :page_facing_up: Capture stderr - [x] :watch: Track wasted (idle) time during reservation @@ -131,7 +133,7 @@ Features #### Web - [x] :chart_with_downwards_trend: Configurable charts view - [x] Metrics and active processes - - [ ] Detailed host specification + - [ ] Detailed harware specification (CPU clock speed, RAM, etc.) - [x] :calendar: Calendar view - [x] Allow making reservations for selected GPUs - [x] Edit reservations @@ -182,14 +184,15 @@ Currently TensorHive is being used on production in these 4 environments: Application examples and benchmarks -------- -Along with TensorHive, we are developing a set of [**sample deep neural network training applications**](https://github.com/roscisz/TensorHive/tree/master/examples) in Distributed TensorFlow which will be used as test applications for the system. They can also serve as benchmarks for various GPU, distributed multiGPU and distributed multinode architectures. For each example, a full set of instructions to reproduce is given. +Along with TensorHive, we are developing a set of [**sample deep neural network training applications**](https://github.com/roscisz/TensorHive/tree/master/examples) in Distributed TensorFlow which will be used as test applications for the system. They can also serve as benchmarks for single GPU, distributed multi-GPU and distributed multi-node architectures. For each example, a full set of instructions to reproduce is provided.
Contibution and feedback ------------------------ -We'd :heart: to collect your observations, issues and pull requests. +We'd :heart: to collect your observations, issues and pull requests! +TODO Add issue template to repo. Put link to Issues here You can do this by making use of our [**issue template**](https://gist.github.com/micmarty/396c649bf693688245731f35854bf971). Credits From b7d06795dc301b0e79ab8991fe561617d3f87031 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 7 Jun 2019 22:33:33 +0200 Subject: [PATCH 07/37] Update README.md --- README.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index d39d9397..985058bd 100644 --- a/README.md +++ b/README.md @@ -8,8 +8,6 @@ TensorHive ![](https://img.shields.io/badge/python-3.5%20|%203.6%20|%203.7-blue.svg?style=popout-square) ![](https://img.shields.io/badge/license-Apache%202.0-blue.svg?style=popout-square) -TODO Issue template to file in repo - TensorHive is an open source system for monitoring and managing computing resources across multiple hosts. @@ -35,7 +33,8 @@ Our goal is to provide solutions for painful problems that ML engineers often ha - :dizzy_face: Managing a list of training commands for all your distributed training experiments drives you nuts (because you didn't wrap them into bash scripts) - :zzz: Remembering to manually launch the training before going sleep is no fun anymore -#### What TensorHive has to offer* +What TensorHive has to offer +----------------------------- :zero: Dead-simple one-machine installation and configuration, no `sudo` requirements :one: Users can make GPU reservations for specific time range in advance via **reservation mechanism** @@ -50,7 +49,7 @@ Our goal is to provide solutions for painful problems that ML engineers often ha      :arrow_right: no more manually logging in to each individual machine in order to check if GPU is currently taken or not -**\*** For more details, check out the full list of [features](#features) +For more details, check out the [full list of features](#features) Getting started --------------- From 3a85a79e55778ab90b15db7140b5316af24665e5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 7 Jun 2019 22:46:55 +0200 Subject: [PATCH 08/37] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 985058bd..4c701fa9 100644 --- a/README.md +++ b/README.md @@ -20,14 +20,14 @@ About project Our goal is to provide solutions for painful problems that ML engineers often have to struggle with when working with remote machines in order to run neural network trainings. #### You should really consider using TensorHive if anything described in profiles below matches you: -1. You're an **admin**, who is responsible for managing a cluster (or multiple servers) with powerful GPUs installed. +1. You're an **admin**, who is responsible for managing a cluster (or multiple servers) with powerful GPUs installed. - :angry: There are more users than resources, so they have to compete for it, but you don't know how to deal with that chaos - :ocean: Other popular tools are simply an overkill, have different purpose or require a lot of time to spend on reading documentation, installation and configuration (Graphana, Kubernetes, Slurm) - :penguin: People that are using your infrastructure expect only one interface for all the things related to training models (besides terminal): monitoring, reservation calendar and scheduling distributed jobs - :collision: Can't risk messing up sensitive configuration by installing software on each individual machine, prefering centralized solution which can be managed from one place. -2. You're a **standalone user** who has access to beefy GPUs scatterd across multiple machines -- :part_alternation_mark: You want to be able to determine if batch size is too small or if there's a bottleneck when moving data from memory to GPU - charts with `gpu_util`, `mem_util`, `mem_used` metrics are great for this purpose +2. You're a **standalone user** who has access to beefy GPUs scatterd across multiple machines. +- :part_alternation_mark: You want to be able to determine if batch size is too small or if there's a bottleneck when moving data from memory to GPU - charts with metrics such as `gpu_util`, `mem_util`, `mem_used` are great for this purpose - :date: Visualizing names of training experiments using calendar helps you track how you're progressing on the project - :snake: Launching distributed trainings is essential for you, no matter what the framework is - :dizzy_face: Managing a list of training commands for all your distributed training experiments drives you nuts (because you didn't wrap them into bash scripts) From ea10c52f33fd22ebb9bd88c3ec14a8c2d7adbd4e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 7 Jun 2019 23:16:44 +0200 Subject: [PATCH 09/37] More descriptive and user-friendly example --- tensorhive/hosts_config.ini | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/tensorhive/hosts_config.ini b/tensorhive/hosts_config.ini index 49a219e4..67d4ac12 100644 --- a/tensorhive/hosts_config.ini +++ b/tensorhive/hosts_config.ini @@ -1,12 +1,24 @@ -# Example: -; [example_hostname] -; user = example_username +# Example SSH configuration +# 1. Hosts must be accessible without password (key-based authentication) +# You can check if you're if good to go with this command: `ssh username@hostname` +# If command above does not not work, make sure to check out our detailed instructions in `README.md` +# 2. Hostname aliases defined in ~/.ssh/config are not respected, you must provide full hostname or address +# 3. 22 is default port value, so in most cases including this line is optional +# 4. Uncomment lines you need (starting with ;) +; [example_hostname_0] +; user = example_username_0 ; port = 22 -# Here you can enable proxy for all defined hosts -# ParallelSSHClient -> Proxy host -> Target hosts +; [example_hostname_1] +; user = example_username_1 + +; [example_hostname_2] +; user = example_username_2 + +# Here you can configure proxy server for connecting to all hosts defined above +# TensorHive -> Proxy host -> Target hosts ; [proxy_tunneling] -; enabled = no +; enabled = yes ; proxy_host = example_proxy_hostname ; proxy_user = example_proxy_username ; proxy_port = 22 From 1b1d125c31508467a4fe42c0a14626ce9a8622e6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 7 Jun 2019 23:17:50 +0200 Subject: [PATCH 10/37] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 4c701fa9..1a6da1a4 100644 --- a/README.md +++ b/README.md @@ -77,7 +77,7 @@ conda install tensorhive git clone https://github.com/roscisz/TensorHive.git && cd TensorHive make dev ``` -TensorHive is already shipped with newest web app build, but in case you modify the source, you can can build it with `make app`. For more useful commands see our [Makefile](https://github.com/roscisz/TensorHive/blob/master/TensorHive/Makefile). +TensorHive is already shipped with newest web app build, but in case you modify the source, you can can build it with `make app`. For more useful commands see our [Makefile](https://github.com/roscisz/TensorHive/blob/master/tensorhive/Makefile). Basic usage ----- @@ -162,7 +162,7 @@ Deployment in production (for admins) ----- #### Advanced configuration You can fully customize TensorHive behaviour from `~/.config/TensorHive/main_config.ini` -[(see example)](https://github.com/roscisz/TensorHive/blob/master/TensorHive/main_config.ini) +[(see example)](https://github.com/roscisz/TensorHive/blob/master/tensorhive/main_config.ini) #### Database migration TODO From dbdaa4b4b9c4632e0698b272a24b38b9570dc362 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 7 Jun 2019 23:51:14 +0200 Subject: [PATCH 11/37] Update README.md --- README.md | 28 ++++++++++++++++++++++------ 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 1a6da1a4..4e70f840 100644 --- a/README.md +++ b/README.md @@ -54,7 +54,7 @@ For more details, check out the [full list of features](#features) Getting started --------------- ### Prerequisites -* All hosts must be accessible via SSH, without password, using SSH Key-Based Authentication ([TODO OUR OWN LINK set up SSH keys](https://www.ssh.com/ssh/keygen/)) +* All hosts must be accessible via SSH, without password, using SSH Key-Based Authentication ([How to set up SSH keys](https://www.shellhacks.com/ssh-login-without-password/) * Only NVIDIA GPUs are supported (relying on ```nvidia-smi``` command) ### Installation @@ -70,7 +70,7 @@ conda install tensorhive ``` #### From source -(optional) For development purposes we encourage separation from your current python packages using e.g. Miniconda (TODO) +(optional) For development purposes we encourage separation from your current python packages using e.g. [Miniconda](https://docs.conda.io/en/latest/miniconda.html) `conda create --name th_env python=3.5 pip; activate th_env` ```shell @@ -98,13 +98,28 @@ Web application and API Documentation can be accessed via URLs highlighted in gr Accessible infrastructure can be monitored in the Nodes overview tab. Sample screenshot: -Here you can add new watches, configure displayed metrics, monitor running GPU processes and its' owners -TODO Update screenshot +Here you can add new watches, configure displayed metrics, monitor running GPU processes and its' owners. + + +TODO Updated screenshot #### GPU Reservation calendar -TODO Update screenshot -TODO Write new usage instructions +Each column represents all reservation events for a GPU on a given day. +In order to make a new reservation simply click and drag with your mouse, select GPU(s), add some meaningful title, optionally adjust time range. + +If there are many hosts and GPUs in our infrastructure, you can use our simplified, horizontal calendar to quickly identify empty time slots and filter out already reserved GPUs. + + +TODO Updated screenshot + +From now on, **only your processes are eligible to run on reserved GPU(s)**. TensorHive periodically checks if some other user has violated it. He will be spammed with warnings on all his PTYs, emailed every once in a while, additionally admin will also be notified (it all depends on the configuration). + +#### Task nursery + +Here you can prepare commands +Simple but powerful command templating mechanism allows for **TODO** +Feel free to experiment **TODO** Features ---------------------- @@ -192,6 +207,7 @@ Contibution and feedback We'd :heart: to collect your observations, issues and pull requests! TODO Add issue template to repo. Put link to Issues here + You can do this by making use of our [**issue template**](https://gist.github.com/micmarty/396c649bf693688245731f35854bf971). Credits From 0ca165b3137f9e6b253028db9060a77202615a38 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Sat, 8 Jun 2019 12:32:26 +0200 Subject: [PATCH 12/37] Update README.md --- README.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 4e70f840..37c79b57 100644 --- a/README.md +++ b/README.md @@ -147,7 +147,7 @@ Features #### Web - [x] :chart_with_downwards_trend: Configurable charts view - [x] Metrics and active processes - - [ ] Detailed harware specification (CPU clock speed, RAM, etc.) + - [ ] Detailed harware specification - [x] :calendar: Calendar view - [x] Allow making reservations for selected GPUs - [x] Edit reservations @@ -158,7 +158,7 @@ Features - [x] Buttons for task spawning/scheduling/termination/killing actions - [x] Fetch log produced by running task - [x] Group actions (spawn, schedule, terminate, kill selected) -- [ ] :scroll: Detailed hardware specification view +- [ ] :straight_ruler: Detailed hardware specification panel (CPU clock speed, RAM, etc.) - [ ] :penguin: Admin panel - [ ] User banning - [ ] Accept/reject reservation requests @@ -216,9 +216,13 @@ Project created and maintained by: - Paweł Rościszewski [(@roscisz)](https://github.com/roscisz) - ![](https://avatars2.githubusercontent.com/u/12485656?s=22&v=4) [Michał Martyniak (@micmarty)](http://martyniak.me) - Filip Schodowski [(@filschod)](https://github.com/filschod) + + Top contributions: - Tomasz Menet [(@tomenet)](https://github.com/tomenet) +- Dariusz Piotrowski [(@PiotrowskiD)](https://github.com/PiotrowskiD) - Karol Draszawka [(@szarakawka)](https://github.com/szarakawka) + License ------- [Apache License 2.0](https://github.com/roscisz/TensorHive/blob/master/LICENSE) From f74b78efdf5cb09818ef18bb512a99475814d42c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Sat, 8 Jun 2019 13:55:38 +0200 Subject: [PATCH 13/37] Update README.md --- README.md | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 37c79b57..7be0ad12 100644 --- a/README.md +++ b/README.md @@ -89,7 +89,7 @@ tensorhive #### Required configuration As you see, you must configure TensorHive so it knows how to establish SSH connections to hosts you want to work with. -You can do this by editing `~/.config/TensorHive/hosts_config.ini` after first `tensorhive` launch [(see example)](https://github.com/roscisz/TensorHive/blob/master/TensorHive/hosts_config.ini). To configure more hosts, just add a new section for each. +You can do this by editing `~/.config/TensorHive/hosts_config.ini` after first `tensorhive` launch [(see example)](https://github.com/roscisz/TensorHive/blob/master/TensorHive/hosts_config.ini). In order to configure more hosts, just add a new section for each. Web application and API Documentation can be accessed via URLs highlighted in green (Ctrl + click to open in browser) @@ -179,7 +179,7 @@ Deployment in production (for admins) You can fully customize TensorHive behaviour from `~/.config/TensorHive/main_config.ini` [(see example)](https://github.com/roscisz/TensorHive/blob/master/tensorhive/main_config.ini) -#### Database migration +#### Mailbot TODO #### Web @@ -187,14 +187,18 @@ The last step is to launch TensorHive to the public so it can be accessed by use In order to do this you must open `~/.config/TensorHive/main_config.ini` and fill in `host` and `port` under `[web_app.server]` section (`host` field can be either a hostname or IP) +#### Database migrations +TODO + + Currently TensorHive is being used on production in these 4 environments: | Where | Hardware | No. users | | ------ | -------- | --------- | -| [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100 16GB | TODO | -| [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 18x machines with GTX 1060 6GB | TODO | -| [Gradient PG](http://gradient.eti.pg.gda.pl/en/) | TITAN X 12GB | TODO | -| [VoiceLab - Conversational Intelligence](voicelab.ai) | TODO | TODO +| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100 16GB | TODO | +| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 18x machines with GTX 1060 6GB | TODO | +| ![](http://martyniak.tech/images/gradient_logo_small-628ed211.png)[Gradient PG](http://gradient.eti.pg.gda.pl/en/) | TITAN X 12GB | TODO | +| ![](https://res-4.cloudinary.com/crunchbase-production/image/upload/c_lpad,h_20,w_20,f_auto,q_auto:eco/v1444894092/jeuh0l6opc159e1ltzky.png) [VoiceLab - Conversational Intelligence](voicelab.ai) | TODO | TODO Application examples and benchmarks -------- From d7092e546758b26a8224765b3fc38a4a60d5ad0c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Sat, 8 Jun 2019 13:56:16 +0200 Subject: [PATCH 14/37] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 7be0ad12..8d4f6c0a 100644 --- a/README.md +++ b/README.md @@ -193,7 +193,7 @@ TODO Currently TensorHive is being used on production in these 4 environments: -| Where | Hardware | No. users | +| Organization | Hardware | No. users | | ------ | -------- | --------- | | ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100 16GB | TODO | | ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 18x machines with GTX 1060 6GB | TODO | From 161d65679a40d0cc7866ef6b8e7da07151998aef Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Sat, 8 Jun 2019 15:06:34 +0200 Subject: [PATCH 15/37] Add architecture diagram --- README.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/README.md b/README.md index 8d4f6c0a..f492ec43 100644 --- a/README.md +++ b/README.md @@ -206,6 +206,14 @@ Along with TensorHive, we are developing a set of [**sample deep neural network
+TensorHive architecture (simplified) +----------------------- + +This diagram will help you grasp the rough concept of the system. + +![TensorHive_diagram _final](https://user-images.githubusercontent.com/12485656/59147556-7853cd80-89fd-11e9-80bc-5848e95c7574.png) + + Contibution and feedback ------------------------ We'd :heart: to collect your observations, issues and pull requests! From 7cd1b839c0fe7425290e3fe1078a63ed048b712f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Ro=C5=9Bciszewski?= Date: Mon, 10 Jun 2019 17:20:49 +0200 Subject: [PATCH 16/37] add project info --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index f492ec43..f63ce23b 100644 --- a/README.md +++ b/README.md @@ -224,6 +224,11 @@ You can do this by making use of our [**issue template**](https://gist.github.co Credits ------- + +TensorHive has been created within a joint project between [**VoiceLab.ai**](https://voicelab.ai) and +[**Gdańsk University of Technology**](https://pg.edu.pl/) titled: "Exploration and selection of methods +for parallelization of neural network training using multiple GPUs". + Project created and maintained by: - Paweł Rościszewski [(@roscisz)](https://github.com/roscisz) - ![](https://avatars2.githubusercontent.com/u/12485656?s=22&v=4) [Michał Martyniak (@micmarty)](http://martyniak.me) From 495e3215be8969ae9085cdd4d07f9972f1d4dbea Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Mon, 10 Jun 2019 18:48:08 +0200 Subject: [PATCH 17/37] Update README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Paweł Rościszewski --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f63ce23b..42e2d1ea 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ Our goal is to provide solutions for painful problems that ML engineers often ha 1. You're an **admin**, who is responsible for managing a cluster (or multiple servers) with powerful GPUs installed. - :angry: There are more users than resources, so they have to compete for it, but you don't know how to deal with that chaos - :ocean: Other popular tools are simply an overkill, have different purpose or require a lot of time to spend on reading documentation, installation and configuration (Graphana, Kubernetes, Slurm) -- :penguin: People that are using your infrastructure expect only one interface for all the things related to training models (besides terminal): monitoring, reservation calendar and scheduling distributed jobs +- :penguin: People using your infrastructure expect only one interface for all the things related to training models (besides terminal): monitoring, reservation calendar and scheduling distributed jobs - :collision: Can't risk messing up sensitive configuration by installing software on each individual machine, prefering centralized solution which can be managed from one place. 2. You're a **standalone user** who has access to beefy GPUs scatterd across multiple machines. From 768294b2335465d612f7e24a84e8b508147dc875 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Mon, 10 Jun 2019 18:57:25 +0200 Subject: [PATCH 18/37] Update tensorhive/hosts_config.ini MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Paweł Rościszewski --- tensorhive/hosts_config.ini | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tensorhive/hosts_config.ini b/tensorhive/hosts_config.ini index 67d4ac12..376529ac 100644 --- a/tensorhive/hosts_config.ini +++ b/tensorhive/hosts_config.ini @@ -1,6 +1,6 @@ # Example SSH configuration # 1. Hosts must be accessible without password (key-based authentication) -# You can check if you're if good to go with this command: `ssh username@hostname` +# You can check if you're good to go with this command: `ssh username@hostname` # If command above does not not work, make sure to check out our detailed instructions in `README.md` # 2. Hostname aliases defined in ~/.ssh/config are not respected, you must provide full hostname or address # 3. 22 is default port value, so in most cases including this line is optional From 527fe52346442e531cddc913548b3ca8b2d2defe Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Mon, 10 Jun 2019 18:57:48 +0200 Subject: [PATCH 19/37] Update tensorhive/hosts_config.ini MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Paweł Rościszewski --- tensorhive/hosts_config.ini | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tensorhive/hosts_config.ini b/tensorhive/hosts_config.ini index 376529ac..65a8b6a4 100644 --- a/tensorhive/hosts_config.ini +++ b/tensorhive/hosts_config.ini @@ -1,7 +1,7 @@ # Example SSH configuration # 1. Hosts must be accessible without password (key-based authentication) # You can check if you're good to go with this command: `ssh username@hostname` -# If command above does not not work, make sure to check out our detailed instructions in `README.md` +# If the command above does not not work, make sure to check out our detailed instructions in `README.md` # 2. Hostname aliases defined in ~/.ssh/config are not respected, you must provide full hostname or address # 3. 22 is default port value, so in most cases including this line is optional # 4. Uncomment lines you need (starting with ;) From b51b9bc408ba9feb9f06bc11bad3ee30eecdbac7 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Mon, 15 Jul 2019 16:12:26 +0200 Subject: [PATCH 20/37] Add Nvidia badge --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 42e2d1ea..176e2d42 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,7 @@ TensorHive ![](https://img.shields.io/badge/pypi-v0.3-brightgreen.svg?style=popout-square) ![](https://img.shields.io/badge/Issues%20and%20PRs-welcome-yellow.svg?style=popout-square) ![](https://img.shields.io/badge/platform-Linux-blue.svg?style=popout-square) +![](https://img.shields.io/badge/hardware-Nvidia-green.svg?style=popout-square) ![](https://img.shields.io/badge/python-3.5%20|%203.6%20|%203.7-blue.svg?style=popout-square) ![](https://img.shields.io/badge/license-Apache%202.0-blue.svg?style=popout-square) From 35c281ee40eb4414d4bd4fbcbcf4153a7bf08edd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 19 Jul 2019 09:12:41 +0200 Subject: [PATCH 21/37] Update README.md --- README.md | 27 +++++++++++++++------------ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 176e2d42..5752be1c 100644 --- a/README.md +++ b/README.md @@ -60,38 +60,41 @@ Getting started ### Installation -#### via pip +#### via pip (not updated yet) ```shell pip install tensorhive ``` -#### via conda +#### via conda (not updated yet) ```shell conda install tensorhive ``` -#### From source +#### From source (recommended) (optional) For development purposes we encourage separation from your current python packages using e.g. [Miniconda](https://docs.conda.io/en/latest/miniconda.html) `conda create --name th_env python=3.5 pip; activate th_env` ```shell git clone https://github.com/roscisz/TensorHive.git && cd TensorHive -make dev +git checkout fixes/voicelab +pip install --editable . ``` TensorHive is already shipped with newest web app build, but in case you modify the source, you can can build it with `make app`. For more useful commands see our [Makefile](https://github.com/roscisz/TensorHive/blob/master/tensorhive/Makefile). Basic usage ----- -#### Run TensorHive -```shell -tensorhive +#### Quickstart +Each command will guide you through basic configuration process: +``` +tensorhive init +tensorhive key +tensorhive test ``` -#### Required configuration -As you see, you must configure TensorHive so it knows how to establish SSH connections to hosts you want to work with. - -You can do this by editing `~/.config/TensorHive/hosts_config.ini` after first `tensorhive` launch [(see example)](https://github.com/roscisz/TensorHive/blob/master/TensorHive/hosts_config.ini). In order to configure more hosts, just add a new section for each. - +Now you should be ready to finally launch a TensorHive instance +``` +tensorhive +``` Web application and API Documentation can be accessed via URLs highlighted in green (Ctrl + click to open in browser) From 1ebba1c766e467c18fe793c52720023c793259a9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 19 Jul 2019 09:44:47 +0200 Subject: [PATCH 22/37] Update README.md --- README.md | 54 ++++++++++++++++++++++++++---------------------------- 1 file changed, 26 insertions(+), 28 deletions(-) diff --git a/README.md b/README.md index 5752be1c..dbe04146 100644 --- a/README.md +++ b/README.md @@ -52,6 +52,9 @@ What TensorHive has to offer For more details, check out the [full list of features](#features) + + + Getting started --------------- ### Prerequisites @@ -79,7 +82,8 @@ git clone https://github.com/roscisz/TensorHive.git && cd TensorHive git checkout fixes/voicelab pip install --editable . ``` -TensorHive is already shipped with newest web app build, but in case you modify the source, you can can build it with `make app`. For more useful commands see our [Makefile](https://github.com/roscisz/TensorHive/blob/master/tensorhive/Makefile). + +TensorHive is already shipped with newest web app build, but in case you modify the source, you can can build it with `make app` (currently on `master` branch). For more useful commands see our [Makefile](https://github.com/roscisz/TensorHive/blob/master/tensorhive/Makefile). Basic usage ----- @@ -98,14 +102,21 @@ tensorhive Web application and API Documentation can be accessed via URLs highlighted in green (Ctrl + click to open in browser) -#### Infrastructure monitoring dashboard +#### Advanced configuration +You can fully customize TensorHive behaviours via INI configuration (which will be created automatically after `tensorhive init` +``` +~/.config/TensorHive/main_config.ini +~/.config/TensorHive/mailbot_config.ini +~/.config/TensorHive/hosts_config.ini +``` +[(see example)](https://github.com/roscisz/TensorHive/blob/master/tensorhive/main_config.ini) +#### Infrastructure monitoring dashboard Accessible infrastructure can be monitored in the Nodes overview tab. Sample screenshot: Here you can add new watches, configure displayed metrics, monitor running GPU processes and its' owners. - -TODO Updated screenshot +![image](https://user-images.githubusercontent.com/12485656/61517685-188f1000-aa08-11e9-9f7c-2a4b10ce0dd3.png) #### GPU Reservation calendar @@ -114,8 +125,8 @@ In order to make a new reservation simply click and drag with your mouse, select If there are many hosts and GPUs in our infrastructure, you can use our simplified, horizontal calendar to quickly identify empty time slots and filter out already reserved GPUs. - -TODO Updated screenshot +(UI prototype: redesign is coming) +![image](https://user-images.githubusercontent.com/12485656/61517527-bb935a00-aa07-11e9-8ea3-9db4a1529e24.png) From now on, **only your processes are eligible to run on reserved GPU(s)**. TensorHive periodically checks if some other user has violated it. He will be spammed with warnings on all his PTYs, emailed every once in a while, additionally admin will also be notified (it all depends on the configuration). @@ -125,6 +136,9 @@ Here you can prepare commands Simple but powerful command templating mechanism allows for **TODO** Feel free to experiment **TODO** +![image](https://user-images.githubusercontent.com/12485656/61518173-4163d500-aa09-11e9-9916-59c907c1590c.png) +![image](https://user-images.githubusercontent.com/12485656/61518418-bcc58680-aa09-11e9-8943-88bddc964417.png) + Features ---------------------- #### Core @@ -177,32 +191,16 @@ Features - [x] OpenAPI 2.0 specification with Swagger UI - [x] User authentication via JWT -Deployment in production (for admins) ------ -#### Advanced configuration -You can fully customize TensorHive behaviour from `~/.config/TensorHive/main_config.ini` -[(see example)](https://github.com/roscisz/TensorHive/blob/master/tensorhive/main_config.ini) - -#### Mailbot -TODO -#### Web -The last step is to launch TensorHive to the public so it can be accessed by users. -In order to do this you must open `~/.config/TensorHive/main_config.ini` and fill in `host` and `port` under `[web_app.server]` section (`host` field can be either a hostname or IP) - - -#### Database migrations -TODO - - -Currently TensorHive is being used on production in these 4 environments: +TensorHive is currently being used on production in the following environments: +----- | Organization | Hardware | No. users | | ------ | -------- | --------- | -| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100 16GB | TODO | -| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 18x machines with GTX 1060 6GB | TODO | -| ![](http://martyniak.tech/images/gradient_logo_small-628ed211.png)[Gradient PG](http://gradient.eti.pg.gda.pl/en/) | TITAN X 12GB | TODO | -| ![](https://res-4.cloudinary.com/crunchbase-production/image/upload/c_lpad,h_20,w_20,f_auto,q_auto:eco/v1444894092/jeuh0l6opc159e1ltzky.png) [VoiceLab - Conversational Intelligence](voicelab.ai) | TODO | TODO +| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100 16GB | 30+ | +| ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 18x machines with GTX 1060 6GB | 20+ | +| ![](http://martyniak.tech/images/gradient_logo_small-628ed211.png)[Gradient PG](http://gradient.eti.pg.gda.pl/en/) | TITAN X 12GB | 10+ | +| ![](https://res-4.cloudinary.com/crunchbase-production/image/upload/c_lpad,h_20,w_20,f_auto,q_auto:eco/v1444894092/jeuh0l6opc159e1ltzky.png) [VoiceLab - Conversational Intelligence](voicelab.ai) | 30+ GTX and RTX cards | 10+ Application examples and benchmarks -------- From 8d2684cd82a6bf94be6edad9af81d9cb210e7860 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 19 Jul 2019 10:04:19 +0200 Subject: [PATCH 23/37] Update README.md --- README.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index dbe04146..6105288e 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,5 @@ TensorHive === - ![](https://img.shields.io/badge/release-v0.3-brightgreen.svg?style=popout-square) ![](https://img.shields.io/badge/pypi-v0.3-brightgreen.svg?style=popout-square) ![](https://img.shields.io/badge/Issues%20and%20PRs-welcome-yellow.svg?style=popout-square) @@ -52,14 +51,12 @@ What TensorHive has to offer For more details, check out the [full list of features](#features) - - - Getting started --------------- ### Prerequisites -* All hosts must be accessible via SSH, without password, using SSH Key-Based Authentication ([How to set up SSH keys](https://www.shellhacks.com/ssh-login-without-password/) +* All nodes must be accessible via SSH, without password, using SSH Key-Based Authentication ([How to set up SSH keys](https://www.shellhacks.com/ssh-login-without-password/) - explained in [Quickstart section](#basic-usage) * Only NVIDIA GPUs are supported (relying on ```nvidia-smi``` command) +* Currently TensorHive assumes that all users who want to register into the system must have identical UNIX usernames on all nodes configured by TensorHive administrator (not relevant to standalone developers) ### Installation @@ -113,8 +110,7 @@ You can fully customize TensorHive behaviours via INI configuration (which will #### Infrastructure monitoring dashboard Accessible infrastructure can be monitored in the Nodes overview tab. Sample screenshot: - -Here you can add new watches, configure displayed metrics, monitor running GPU processes and its' owners. +Here you can add new watches, select metrics and monitor ongoing GPU processes and its' owners. ![image](https://user-images.githubusercontent.com/12485656/61517685-188f1000-aa08-11e9-9f7c-2a4b10ce0dd3.png) @@ -132,9 +128,11 @@ From now on, **only your processes are eligible to run on reserved GPU(s)**. Ten #### Task nursery -Here you can prepare commands -Simple but powerful command templating mechanism allows for **TODO** +Here you can define commands for tasks you want to run on any configured nodes. You can manage them manually or set spawn/terminate date. +Commands are run within `screen` session, so attaching to it while they are running is a piece of cake. +It provides quite simple but flexible command templating mechanism that allows for **TODO** Feel free to experiment **TODO** +(In order for this to work, individual users must copy public TensorHive key into `~/.ssh/authorized_keys` ![image](https://user-images.githubusercontent.com/12485656/61518173-4163d500-aa09-11e9-9916-59c907c1590c.png) ![image](https://user-images.githubusercontent.com/12485656/61518418-bcc58680-aa09-11e9-8943-88bddc964417.png) From 4ad17a31825a1f8e08c2bda64aff461649368c0f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 19 Jul 2019 10:27:48 +0200 Subject: [PATCH 24/37] Update README.md --- README.md | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 6105288e..69b4c881 100644 --- a/README.md +++ b/README.md @@ -112,7 +112,7 @@ You can fully customize TensorHive behaviours via INI configuration (which will Accessible infrastructure can be monitored in the Nodes overview tab. Sample screenshot: Here you can add new watches, select metrics and monitor ongoing GPU processes and its' owners. -![image](https://user-images.githubusercontent.com/12485656/61517685-188f1000-aa08-11e9-9f7c-2a4b10ce0dd3.png) +![image](https://user-images.githubusercontent.com/12485656/61520152-d963bd80-aa0d-11e9-9caa-1f7203cc6b42.png) #### GPU Reservation calendar @@ -126,15 +126,22 @@ If there are many hosts and GPUs in our infrastructure, you can use our simplifi From now on, **only your processes are eligible to run on reserved GPU(s)**. TensorHive periodically checks if some other user has violated it. He will be spammed with warnings on all his PTYs, emailed every once in a while, additionally admin will also be notified (it all depends on the configuration). +Terminal warning | Email warning +:-------------------------:|:-------------------------: +![image](https://user-images.githubusercontent.com/12485656/61520488-99e9a100-aa0e-11e9-8f35-b02c2e7de9ce.png) | ![image](https://user-images.githubusercontent.com/12485656/61520956-85f26f00-aa0f-11e9-8342-09023c93275b.png) + +#### What admin sees: + +![image](https://user-images.githubusercontent.com/12485656/61520807-4a57a500-aa0f-11e9-8a52-cb87208d6c71.png) + #### Task nursery Here you can define commands for tasks you want to run on any configured nodes. You can manage them manually or set spawn/terminate date. Commands are run within `screen` session, so attaching to it while they are running is a piece of cake. -It provides quite simple but flexible command templating mechanism that allows for **TODO** -Feel free to experiment **TODO** -(In order for this to work, individual users must copy public TensorHive key into `~/.ssh/authorized_keys` - ![image](https://user-images.githubusercontent.com/12485656/61518173-4163d500-aa09-11e9-9916-59c907c1590c.png) + +It provides quite simple, but flexible (**framework-agnostic**) command templating mechanism that will help you automate multi-node trainings. +TensorHive requires that users who want to use this feature must append TensorHive's public key to their `~/.ssh/authorized_keys` on all nodes they want to connect to. ![image](https://user-images.githubusercontent.com/12485656/61518418-bcc58680-aa09-11e9-8943-88bddc964417.png) Features From 775b0a3431f00f97721c945b9a985d0afae058fd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 19 Jul 2019 10:33:33 +0200 Subject: [PATCH 25/37] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 69b4c881..10979dee 100644 --- a/README.md +++ b/README.md @@ -77,7 +77,7 @@ conda install tensorhive ```shell git clone https://github.com/roscisz/TensorHive.git && cd TensorHive git checkout fixes/voicelab -pip install --editable . +pip install . ``` TensorHive is already shipped with newest web app build, but in case you modify the source, you can can build it with `make app` (currently on `master` branch). For more useful commands see our [Makefile](https://github.com/roscisz/TensorHive/blob/master/tensorhive/Makefile). From 0aaf3af980c54f17877926a78c424fa465a0f2bc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 19 Jul 2019 11:08:09 +0200 Subject: [PATCH 26/37] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 10979dee..e275da39 100644 --- a/README.md +++ b/README.md @@ -81,6 +81,7 @@ pip install . ``` TensorHive is already shipped with newest web app build, but in case you modify the source, you can can build it with `make app` (currently on `master` branch). For more useful commands see our [Makefile](https://github.com/roscisz/TensorHive/blob/master/tensorhive/Makefile). +Build tested with `Node v11.14.0` and `npm 6.7.0` Basic usage ----- From 22785d03ff0aa59a04ee99baffa1427e8ca3760c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 19 Jul 2019 11:32:34 +0200 Subject: [PATCH 27/37] Update README.md --- README.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index e275da39..4b9fbc58 100644 --- a/README.md +++ b/README.md @@ -224,11 +224,12 @@ This diagram will help you grasp the rough concept of the system. Contibution and feedback ------------------------ -We'd :heart: to collect your observations, issues and pull requests! +**Project is still in early beta version**, so there will be some inconveniences, just be patient and keep an eye on upcoming updates. -TODO Add issue template to repo. Put link to Issues here +We'd :heart: to collect your observations, issues and pull requests! -You can do this by making use of our [**issue template**](https://gist.github.com/micmarty/396c649bf693688245731f35854bf971). +Feel free to **report any configuration problems, we will help you**. +We plan to redesign the UI/UX side as well as improve reliability of the system until September 2019 :shipit:, so stay tuned! Credits ------- From d2fee988e89c716ac561b1e8ff640d2b5f4aaa3f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 19 Jul 2019 11:37:17 +0200 Subject: [PATCH 28/37] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 4b9fbc58..8646a156 100644 --- a/README.md +++ b/README.md @@ -206,7 +206,7 @@ TensorHive is currently being used on production in the following environments: | ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Gdansk University of Technology](https://eti.pg.edu.pl/en) | NVIDIA DGX Station (4x Tesla V100 16GB | 30+ | | ![](https://cdn.pg.edu.pl/ekontakt-updated-theme/images/favicon/favicon-16x16.png?v=jw6lLb8YQ4) [Lab at GUT](https://eti.pg.edu.pl/katedra-architektury-systemow-komputerowych/main) | 18x machines with GTX 1060 6GB | 20+ | | ![](http://martyniak.tech/images/gradient_logo_small-628ed211.png)[Gradient PG](http://gradient.eti.pg.gda.pl/en/) | TITAN X 12GB | 10+ | -| ![](https://res-4.cloudinary.com/crunchbase-production/image/upload/c_lpad,h_20,w_20,f_auto,q_auto:eco/v1444894092/jeuh0l6opc159e1ltzky.png) [VoiceLab - Conversational Intelligence](voicelab.ai) | 30+ GTX and RTX cards | 10+ +| ![](https://res-4.cloudinary.com/crunchbase-production/image/upload/c_lpad,h_20,w_20,f_auto,q_auto:eco/v1444894092/jeuh0l6opc159e1ltzky.png) [VoiceLab - Conversational Intelligence](https://www.voicelab.ai) | 30+ GTX and RTX cards | 10+ Application examples and benchmarks -------- From d312674653f2ba92530a0b85be31b373d86862d4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 19 Jul 2019 11:38:30 +0200 Subject: [PATCH 29/37] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 8646a156..26ba2510 100644 --- a/README.md +++ b/README.md @@ -243,7 +243,7 @@ Project created and maintained by: - ![](https://avatars2.githubusercontent.com/u/12485656?s=22&v=4) [Michał Martyniak (@micmarty)](http://martyniak.me) - Filip Schodowski [(@filschod)](https://github.com/filschod) - Top contributions: + Recent contributions: - Tomasz Menet [(@tomenet)](https://github.com/tomenet) - Dariusz Piotrowski [(@PiotrowskiD)](https://github.com/PiotrowskiD) - Karol Draszawka [(@szarakawka)](https://github.com/szarakawka) From cc72f318487037a003622b650314c026d5452217 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Mon, 22 Jul 2019 15:22:06 +0200 Subject: [PATCH 30/37] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 26ba2510..8ded6404 100644 --- a/README.md +++ b/README.md @@ -77,7 +77,7 @@ conda install tensorhive ```shell git clone https://github.com/roscisz/TensorHive.git && cd TensorHive git checkout fixes/voicelab -pip install . +pip install -e . ``` TensorHive is already shipped with newest web app build, but in case you modify the source, you can can build it with `make app` (currently on `master` branch). For more useful commands see our [Makefile](https://github.com/roscisz/TensorHive/blob/master/tensorhive/Makefile). From de2485baad83fd398378ba59d9d466ce4d61f353 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Ro=C5=9Bciszewski?= Date: Thu, 25 Jul 2019 16:18:42 +0200 Subject: [PATCH 31/37] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 8ded6404..fed1f539 100644 --- a/README.md +++ b/README.md @@ -47,7 +47,7 @@ What TensorHive has to offer :three: Gather all useful GPU metrics, from all configured hosts **in one dashboard** -     :arrow_right: no more manually logging in to each individual machine in order to check if GPU is currently taken or not +     :arrow_right: no more manually logging in to each individual machine in order to check if GPU is currently in use or not For more details, check out the [full list of features](#features) From ca435514d555c5540b347b2845cc14a3641732f5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 26 Jul 2019 10:30:09 +0200 Subject: [PATCH 32/37] Update README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Paweł Rościszewski --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index fed1f539..b62e3694 100644 --- a/README.md +++ b/README.md @@ -217,7 +217,7 @@ Along with TensorHive, we are developing a set of [**sample deep neural network TensorHive architecture (simplified) ----------------------- -This diagram will help you grasp the rough concept of the system. +This diagram will help you to grasp the rough concept of the system. ![TensorHive_diagram _final](https://user-images.githubusercontent.com/12485656/59147556-7853cd80-89fd-11e9-80bc-5848e95c7574.png) From 2accff59a38f3826dd810c40e61ba35a8db215de Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 26 Jul 2019 10:30:55 +0200 Subject: [PATCH 33/37] Update README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Paweł Rościszewski --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b62e3694..2c5b9a96 100644 --- a/README.md +++ b/README.md @@ -176,7 +176,7 @@ Features - [x] Allow making reservations for selected GPUs - [x] Edit reservations - [x] Cancel reservations - - [ ] Attach jobs to reservation + - [x] Attach jobs to reservation - [x] :baby_symbol: Task nursery - [x] Create parametrized tasks and assign to hosts, automatically set `CUDA_VISIBLE_DEVICES` - [x] Buttons for task spawning/scheduling/termination/killing actions From 89248a8bed34340a818dd9bbe85c8489a737b54e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Micha=C5=82=20Martyniak?= Date: Fri, 26 Jul 2019 10:36:19 +0200 Subject: [PATCH 34/37] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 2c5b9a96..5d8ef05b 100644 --- a/README.md +++ b/README.md @@ -24,13 +24,13 @@ Our goal is to provide solutions for painful problems that ML engineers often ha - :angry: There are more users than resources, so they have to compete for it, but you don't know how to deal with that chaos - :ocean: Other popular tools are simply an overkill, have different purpose or require a lot of time to spend on reading documentation, installation and configuration (Graphana, Kubernetes, Slurm) - :penguin: People using your infrastructure expect only one interface for all the things related to training models (besides terminal): monitoring, reservation calendar and scheduling distributed jobs -- :collision: Can't risk messing up sensitive configuration by installing software on each individual machine, prefering centralized solution which can be managed from one place. +- :collision: Can't risk messing up sensitive configuration by installing software on each individual machine, prefering centralized solution which can be managed from one place 2. You're a **standalone user** who has access to beefy GPUs scatterd across multiple machines. - :part_alternation_mark: You want to be able to determine if batch size is too small or if there's a bottleneck when moving data from memory to GPU - charts with metrics such as `gpu_util`, `mem_util`, `mem_used` are great for this purpose - :date: Visualizing names of training experiments using calendar helps you track how you're progressing on the project - :snake: Launching distributed trainings is essential for you, no matter what the framework is -- :dizzy_face: Managing a list of training commands for all your distributed training experiments drives you nuts (because you didn't wrap them into bash scripts) +- :dizzy_face: Managing a list of training commands for all your distributed training experiments drives you nuts - :zzz: Remembering to manually launch the training before going sleep is no fun anymore What TensorHive has to offer From c30f8b1b5ba1fcd7d19b40ad7f591ef3837bcedb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Ro=C5=9Bciszewski?= Date: Fri, 26 Jul 2019 10:47:06 +0200 Subject: [PATCH 35/37] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 5d8ef05b..3f949374 100644 --- a/README.md +++ b/README.md @@ -31,7 +31,7 @@ Our goal is to provide solutions for painful problems that ML engineers often ha - :date: Visualizing names of training experiments using calendar helps you track how you're progressing on the project - :snake: Launching distributed trainings is essential for you, no matter what the framework is - :dizzy_face: Managing a list of training commands for all your distributed training experiments drives you nuts -- :zzz: Remembering to manually launch the training before going sleep is no fun anymore +- :zzz: Remembering to manually launch the training before going sleep is no fun anymore. What TensorHive has to offer ----------------------------- From 2b7e59d195976beeff26d6e5f3c1aa1619d58e7e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Ro=C5=9Bciszewski?= Date: Fri, 26 Jul 2019 10:55:43 +0200 Subject: [PATCH 36/37] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 3f949374..c702f64d 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ It solves the most common problems and nightmares about accessing and sharing yo It's designed with __simplicty, flexibility and configuration-friendliness__ in mind. -About project +Use cases ---------------------- Our goal is to provide solutions for painful problems that ML engineers often have to struggle with when working with remote machines in order to run neural network trainings. From 2315921fa97231d363c99e4f4b4b0316910597aa Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Pawe=C5=82=20Ro=C5=9Bciszewski?= Date: Fri, 26 Jul 2019 10:55:51 +0200 Subject: [PATCH 37/37] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c702f64d..2ba66342 100644 --- a/README.md +++ b/README.md @@ -198,7 +198,7 @@ Features - [x] User authentication via JWT -TensorHive is currently being used on production in the following environments: +TensorHive is currently being used in production in the following environments: ----- | Organization | Hardware | No. users |