Skip to content

Commit

Permalink
v1.0.0 release
Browse files Browse the repository at this point in the history
add GPU table page.
add timestamp format.

update readme for v1.0.0.

fix minor bug.
  • Loading branch information
a-maumau authored Mar 24, 2019
2 parents aef2065 + a6e01d4 commit 7b37002
Show file tree
Hide file tree
Showing 18 changed files with 543 additions and 258 deletions.
90 changes: 59 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,15 @@ This script is depending on `Python3`, and `nvidia-smi`, `awk`, `ps` commands.
pip install -r requirements.txt
```
If there is a missing package, please install by yourself using pip.
also you need setup vesta/settings.py for your environment.
Also you might need to setup some configurations for your own environment.

# Configuration
When using `gpu_status_server.py` and `gpu_info_sender.py`, there are many options to change the settings, or you can use .yaml file for overwriting the arguments.
Use `-h` to see all arguments and use `--local_settings_yaml_path` option to overwrite.
Example is in `example/local_settings.yaml`

# Usage
You can use simple wrapper,
for Server
```
python gpu_status_server.py
Expand All @@ -19,17 +25,27 @@ for Nodes
python gpu_info_sender.py
```

For automatical process, using systemd and crontab will do the works.
For automation, using systemd and crontab will do the work.

## from Terminal
To get GPU information from terminal app, use curl and access `http://<server_address>/?term=true`.
You will get like
```
$ curl "http://0.0.0.0:8080/?term=true"
+------------------------------------------------------------------------------+
| vesta ver. 1.0.0 gpu info. |
+------------------+------------------------+-----------------+--------+-------+
| host | gpu | memory usage | volat. | temp. |
+------------------+------------------------+-----------------+--------+-------+
|host1 | 0:GeForce GTX 1080 Ti | 235 / 11169 | 0 %| 36 °C|
|mau_local | 0:GeForce GTX 1080 Ti | 8018 / 11169 | 0 %| 80 °C|
| | 1:GeForce GTX 1080 Ti | 2 / 11172 | 0 %| 38 °C|
+------------------+------------------------+-----------------+--------+-------+
|mau_local_11a7c5eb| 0:GeForce GTX 1080 Ti | 8018 / 11169 | 92 %| 80 °C|
| | 1:GeForce GTX 1080 Ti | 2 / 11172 | 0 %| 38 °C|
+------------------+------------------------+-----------------+--------+-------+
|mau_local_ac993634| 0:GeForce GTX 1080 Ti | 8018 / 11169 | 92 %| 80 °C|
| | 1:GeForce GTX 1080 Ti | 2 / 11172 | 0 %| 38 °C|
| | 1:GeForce GTX 1080 Ti | 2 / 11172 | 0 %| 38 °C|
| | 1:GeForce GTX 1080 Ti | 2 / 11172 | 0 %| 38 °C|
+------------------+------------------------+-----------------+--------+-------+
```
Expand All @@ -38,25 +54,25 @@ If you want to see detail information you can use `detail` option like `http://<
You will get like
```
$ curl "http://0.0.0.0:8080/?term=true&detail=true"
vesta ver. 1.0.0
### host1 :: 127.0.0.1 #########################################################
last update: 2018/12/03 23:16:59
#### mau_local_19e5d26c :: 127.0.0.1 ###########################################
last update: 24/03/2019 20:27:10
--------------------------------------------------------------------------------
┌[ gpu:0 GeForce GTX 1080 Ti 2018/12/01 14:32:37.140 ]─────────────────────┐
┌[ gpu:0 GeForce GTX 1080 Ti 2019/03/24 20:00:00.000 ]─────────────────────┐
│ memory used memory available gpu volatile temperature │
235 / 11169MiB 10934MiB 0% 36°C │
8018 / 11169MiB 3151MiB 92% 80°C │
│ │
│ mem [/ ] 2% │
│ ├── /usr/bin/X 148MiB
│ └── compiz 84MiB
│ mem [/////////////////////////////////////////// ] 71% │
│ ├── train1 6400MiB root
│ └── train2 1618MiB user1
└──────────────────────────────────────────────────────────────────────────┘
┌[ gpu:1 GeForce GTX 1080 Ti 2018/12/01 14:32:37.141 ]─────────────────────┐
┌[ gpu:1 GeForce GTX 1080 Ti 2019/03/24 20:00:00.000 ]─────────────────────┐
│ memory used memory available gpu volatile temperature │
│ 2 / 11172MiB 11170MiB 0% 38°C │
│ │
│ mem [ ] 0% │
│ └── /usr/bin/X 148MiB │
└──────────────────────────────────────────────────────────────────────────┘
________________________________________________________________________________
Expand All @@ -72,7 +88,7 @@ Just access `http://<server_address>/`
You will get like
![sample web broser image](imgs/browser_sample_resized.png "sample")

# Response
# API Response
User can get the information of GPU by accessing `http://<server_address>/states/`.
Json response is like
```
Expand All @@ -85,28 +101,29 @@ Json response is like
{ # each GPU will be denote by "gpu:<device_num>"
'gpu_data':{
'gpu:0':{'available_memory': '10934',
'device_num': '0',
'device_num': 0,
'gpu_name': 'GeForce GTX 1080 Ti',
'gpu_volatile': '0',
'gpu_volatile': 92,
'processes': [
{
'name': '/usr/bin/X',
'pid': '1963',
'used_memory': '148',
{
'name': 'train1',
'pid': "31415",
'used_memory': 6400,
'user': 'root'
},
{
'name': 'compiz',
'pid': '3437',
'used_memory': '84',
},
{
'name': 'train2',
'pid': "27182",
'used_memory': 1618,
'user': 'user1'
}
}
],
'temperature': '36',
'timestamp': '2018/11/30 23:29:47.115',
'total_memory': '11169',
'used_memory': '235',
'uuid': 'GPU-...'},
'temperature': 80,
'timestamp': '2019/03/24 20:00:00.000',
'total_memory': 11169,
'used_memory': 8018,
'uuid': 'GPU-...'
},
'gpu:1':{
'available_memory': '11170',
'device_num': '1',
Expand All @@ -128,6 +145,17 @@ Json response is like
}
```

# Slack Notification
If you set slack's webhook and bot setting, you can receive notification via slack.
## up and down
![sample notification image](imgs/noti_up_down_sample_resized.png "notificate_up_and_down")

## interact with bot
![sample interact image](imgs/bot_interact_sample_resized.png "bot_interact")

For specifying slack setting, use `--slack_webhook`, `--slack_bot_token`, and `--slack_bot_post_channel` for `gpu_status_server.py`.
Or you can use .yaml file, see `example/local_settings.yaml`

# Topology
Topology is very simple, Master (server) and Slave (each local machine) style, but it is ad hoc.
Server is only waiting the slaves to post the gpu information.
Expand Down Expand Up @@ -161,4 +189,4 @@ Table field is
| timestamp_n | data_n |

`timestamp` is based on server time zone and the style is "YYYYMMDDhhmmss".
`data` is a Python dict object while it is serialized and compressed by Python bz2.
`data` is a Python dict object while it is serialized and compressed by Python pickle and bz2.
23 changes: 8 additions & 15 deletions examples/local_settings.yaml
Original file line number Diff line number Diff line change
@@ -1,15 +1,4 @@
# in sec.
WS_RECEIVE_TIMEOUT: 1

# in sec.
SLACK_BOT_SLEEP_TIME: 1

# at least interval time for saving data (in sec.)
# if you want to save all data, set this to 0
SAVE_INTERVAL: 60

# sort type, "ip" or "name"
SORT_BY: "ip"
# every key must be in capital letter

# this ip address is the server address for the client which send the gpu information
IP: "127.0.0.1"
Expand All @@ -23,8 +12,12 @@ TOKEN: '0000'
# how many information to read in each page
PAGE_PER_HOST_NUM: 8

PAGE_TITLE: "AWSOME GPUs"
PAGE_DESCRIPTION: "awsome description"
MAIN_PAGE_TITLE: "AWSOME GPUs"
MAIN_PAGE_DESCRIPTION: "awsome description"
TABLE_PAGE_TITLE: "AWSOME Table"
TABLE_PAGE_DESCRIPTION: "awsome description"

TIMESTAMP_FORMAT: "DMY"

# you can notificate at slack
SLACK_WEBHOOK: "https://hooks.slack.com/services/<your web hook>"
Expand All @@ -37,7 +30,7 @@ SLACK_BOT_POST_CHANNEL: "your channel"
# it will be used in re.search, so you can use regular expression
VALID_NETWORK: "127.0.0.1"

# you can use python schedule module to schedule the announcement of somethin
# you can use python schedule module to schedule the announcement of something
# in this case, it will use the function self.send_hosts_statuses for every day at00:00
# must be a array
SCHEDULE_FUNCTION:
Expand Down
2 changes: 1 addition & 1 deletion gpu_info_sender.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
if settings.local_settings_yaml_path is not None:
try:
with open(settings.local_settings_yaml_path, "r") as yaml_file:
yaml_data = yaml.load(yaml_file)
yaml_data = yaml.load(yaml_file, yaml.safe_load)
except Exception as e:
print(e)
yaml_data = []
Expand Down
57 changes: 27 additions & 30 deletions gpu_status_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,30 @@
help='yaml file path which overwrite the contents args.')

# args for server
parser.add_argument('--db_name', dest='DB_NAME', type=str, default="gpu_states.db", help='database name.')
parser.add_argument('--db_dir', dest='DB_DIR', type=str, default="data", help='dir of database.')
parser.add_argument('--ip', dest='IP', type=str, default="127.0.0.1",
help='this ip address is the server address for the client which send the gpu information.\nit is mainly for a machine which is sending the data to server.')
parser.add_argument('--port_num', dest='PORT_NUM', type=int, default=8080, help="server's open port.")
parser.add_argument('--token', dest='TOKEN', type=str, default="0000",
help="url parameter token for posting data.\nwhatever you want, actually it's doing nothing now. it is only for preventing accidental posting.")
parser.add_argument('--server_name', dest='SERVER_NAME', type=str, default="gpu_monitor", help='')
parser.add_argument('--bind_host', dest='BIND_HOST', type=str, default="0.0.0.0",
help='bind host IP address.\nthis should be 0.0.0.0.\nif you want to filter IP addresses, use `--valid_network`.')

# ssl settings certfile, keyfile=None, password
parser.add_argument('--ssl_cert', dest='SSL_CERT', type=str, default=None, help='path of ssl certificate file.')
parser.add_argument('--ssl_key', dest='SSL_KEY', type=str, default=None, help='path for ssl key file.')
parser.add_argument('--db_name', dest='DB_NAME', type=str, default="gpu_states.db", help='database name.')
parser.add_argument('--db_dir', dest='DB_DIR', type=str, default="data", help='dir of database.')
parser.add_argument('--timestamp_format', dest='TIMESTAMP_FORMAT', type=str, default="MDY", choices=['YMD', 'MDY', 'DMY'],
help='timestamp format. default is `MM/DD/YYYY`. choose from `YMD`, `MDY` or `DMY`.')

parser.add_argument('--page_per_host_num', dest='PAGE_PER_HOST_NUM', type=int, default=8,
help='how many information to read in each page.\nit is controlling the view of html page.')
parser.add_argument('--main_page_title', dest='MAIN_PAGE_TITLE', type=str, default="GPU info", help='page title of main page.')
parser.add_argument('--main_page_description', dest='MAIN_PAGE_DESCRIPTION', type=str, default="", help='page description of main page.')
parser.add_argument('--table_page_title', dest='TABLE_PAGE_TITLE', type=str, default="GPU Table", help='page title of gpu table page.')
parser.add_argument('--table_page_description', dest='TABLE_PAGE_DESCRIPTION', type=str, default="", help='page description of gpu table page.')

parser.add_argument('--term_width', dest='TERM_WIDTH', type=int, default=80, help='width of terminal printing.')
parser.add_argument('--sort_by', dest='SORT_BY', type=str, default="ip", choices=['ip', 'name'],
help='sort type of machine arrangement. choice from `ip` or `name`.')

# args for waching part
parser.add_argument('--server_sleep_time', dest='SERVER_SLEEP_TIME', type=int, default=5, help='server sleeping time in sec.')
Expand All @@ -34,20 +47,6 @@
parser.add_argument('--slack_bot_sleep_time', dest='SLACK_BOT_SLEEP_TIME', type=int, default=1, help="slack bot's waiting time (response time) in sec.")
parser.add_argument('--save_interval', dest='SAVE_INTERVAL', type=int, default=60,
help='at least interval time for saving data in sec.\nthis is for controlling the data which is will save in database. if you want to save all data, set this to 0.')

parser.add_argument('--sort_by', dest='SORT_BY', type=str, default="ip", choices=['ip', 'name'],
help='sort type of machine arrangement. choice from `ip` or `name`.')

parser.add_argument('--ip', dest='IP', type=str, default="127.0.0.1",
help='this ip address is the server address for the client which send the gpu information.\nit is mainly for a machine which is sending the data to server.')
parser.add_argument('--port_num', dest='PORT_NUM', type=int, default=8080, help="server's open port.")
parser.add_argument('--token', dest='TOKEN', type=str, default="0000",
help="url parameter token for posting data.\nwhatever you want, actually it's doing nothing now. it is only for preventing accidental posting.")

parser.add_argument('--page_per_host_num', dest='PAGE_PER_HOST_NUM', type=int, default=8,
help='how many information to read in each page.\nit is controlling the view of html page.')
parser.add_argument('--page_title', dest='PAGE_TITLE', type=str, default="GPUs", help='page title of html page.')
parser.add_argument('--page_description', dest='PAGE_DESCRIPTION', type=str, default="", help='page description of html page.')

parser.add_argument('--slack_webhook', dest='SLACK_WEBHOOK', type=str, default="",
help='for slack notification. set a webhook url.\nit will send a up/down notification to this webhook.')
Expand All @@ -62,7 +61,7 @@
parser.add_argument('--shedule_function', dest='SCHEDULE_FUNCTION', type=str, nargs='*', default=[],
help="if you want send shceduled status report, use this function.\nyou can use python schedule module to schedule the announcement of something like `'schedule.every().day.at('00:00').do(self.send_hosts_statuses, 'SCHEDULED_STATUS_REPORT')'`. this will go through `exec()` be careful.")

# notification message ########################################
# notification message
parser.add_argument('--register_uplink_msg', dest='REGISTER_UPLINK_MSG', type=str, default="⬆︎⬆︎⬆︎ `Uplink` Detected - New uplink from `{}`. Hello!",
help='notification message of new host came.\nif you use {} it will be filled with `host name`')
parser.add_argument('--re_uplink_msg', dest='RE_UPLINK_MSG', type=str, default="⬆︎⬆︎⬆︎ ` Up ` Detected - Uplink from `{}`. Welcome back!",
Expand All @@ -85,17 +84,18 @@

parser.add_argument('-quiet', dest='QUIET', action="store_true", default=False, help='show only critical error message.')

settings = parser.parse_args()
# ssl settings certfile, keyfile=None, password
"""
parser.add_argument('--ssl_cert', dest='SSL_CERT', type=str, default=None, help='path of ssl certificate file.')
parser.add_argument('--ssl_key', dest='SSL_KEY', type=str, default=None, help='path for ssl key file.')
"""

ssl_context = None
if settings.SSL_CERT is not None:
ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLSv1_2)
ssl_context.load_cert_chain(settings.SSL_CERT, settings.SSL_KEY)
settings = parser.parse_args()

if settings.local_settings_yaml_path is not None:
try:
with open(settings.local_settings_yaml_path, "r") as yaml_file:
yaml_data = yaml.load(yaml_file)
yaml_data = yaml.load(yaml_file, yaml.FullLoader)
except Exception as e:
print(e)
yaml_data = []
Expand All @@ -104,8 +104,5 @@
if arg_key in settings:
setattr(settings, arg_key, yaml_data[arg_key])

print(settings.SCHEDULE_FUNCTION)

server = HTTPServer(settings)
server.start(ssl_context=ssl_context)
server.watch_and_sleep()
server.start()
Binary file added imgs/bot_interact_sample_resized.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified imgs/browser_sample_resized.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/noti_up_down_sample_resized.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion vesta/__version__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__title__ = 'vesta'
__description__ = 'simple gpu monitoring script'
__url__ = 'https://github.com/a-maumau/vesta'
__version__ = '0.5.3'
__version__ = '1.0.0'
Loading

0 comments on commit 7b37002

Please sign in to comment.