-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathREADME
633 lines (464 loc) · 24.5 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
ML-Logger, A Simple and Scalable Logging Utility With a Beautiful Visualization Dashboard That Is Super Fast
============================================================================================================
|Downloads|
ML-Logger makes it easy to:
- save data locally and remotely, as **binary**, in a transparent
``pickle`` file, with the same API and zero configuration.
- write from 500+ worker containers to a single instrumentation server
- visualize ``matplotlib.pyplot`` figures from a remote server locally
with ``logger.savefig('my_figure.png?raw=true')``
And ml-logger does all of these with *minimal configuration* — you can
use the same logging code-block both locally and remotely with no code-block change.
ML-logger is highly performant – the remote writes are asynchronous. For
this reason it doesn’t slow down your training even with 100+ metric
keys.
Why did we built this, you might ask? Because we want to make it easy
for people in ML to use the same logging code-block in all of they projects,
so that it is easy to get started with someone else’s baseline.
Getting Started!
----------------
To **install** ``ml_logger``, do:
.. code-block:: bash
pip install ml-logger
..
The landscape of python modules is a lot messier than that of
javascript. The most up-to-date ``graphene`` requires the following
versions:
::
pip install sanic==20.9.0
pip install sanic-cors==0.10.0.post3
pip install sanic-graphql==1.1.0
yes | pip install graphene==2.1.3
yes | pip install graphql-core==2.1
yes | pip install graphql-relay==0.4.5
yes | pip install graphql-server-core==1.1.1
Add the following to your ``~/.zshrc``
.. code-block:: bash
export ML_LOGGER_ROOT=<your-logging-server-address>
export ML_LOGGER_USER=<your-handle> # make sure it is the same on the cluster.
Now you can rock!
.. code-block:: python
from ml_logger import logger
print(logger)
# ~> logging data to /tmp/ml-logger-debug
Log key/value pairs, and metrics:
.. code-block:: python
for i in range(1):
logger.log(metrics={'some_val/smooth': 10, 'status': f"step ({i})"}, reward=20, timestep=i)
### flush the data, otherwise the value would be overwritten with new values in the next iteration.
logger.flush()
# outputs ~>
# ╒════════════════════╤════════════════════════════╕
# │ reward │ 20 │
# ├────────────────────┼────────────────────────────┤
...
Simple Logging Example
----------------------
In your project files, do:
.. code-block:: python
from params_proto import ParamsProto, Proto, Flag
class Args(ParamsProto):
""" CLI Args for the program
Try:
python3 example.py --help
And it should print out the help strings
"""
seed = Proto(1, help="random seed")
D_lr = 5e-4
G_lr = 1e-4
Q_lr = 1e-4
T_lr = 1e-4
plot_interval = 10
verbose = Flag("the verbose flag")
if __name__ == '__main__':
import scipy
import numpy as np
from ml_logger import logger, LOGGER_USER
# Put in your ~/.bashrc
# export ML_LOGGER_ROOT = "http://<your-logging-server>:8081"
# export ML_LOGGER_USER = $USER
logger.configure(prefix=f"{LOGGER_USER}/scratch/your-experiment-prefix!")
logger.log_params(Args=vars(Args))
logger.upload_file(__file__)
for epoch in range(10):
logger.log(step=epoch, D_loss=0.2, G_loss=0.1, mutual_information=0.01)
logger.log_key_value(epoch, 'some string key', 0.0012)
# when the step index updates, logger flushes all of the key-value pairs to file system/logging server
logger.flush()
# Images
face = scipy.misc.face()
face_bw = scipy.misc.face(gray=True)
logger.save_image(face, "figures/face_rgb.png?raw=true")
logger.save_image(face_bw, "figures/face_bw.png?raw=true")
image_bw = np.zeros((64, 64, 1))
image_bw_2 = scipy.misc.face(gray=True)[::4, ::4]
logger.save_video([face] * 5, "videos/face.mp4")
And it should print out
.. code-block:: bash
✓ created a new logging client
Dashboard: http://app.dash.ml/user-name/your-experiment-prefix!
Log_directory: http://<your-server-ip>:8080
══════════════════════════════════════════
Args
────────────────────┬─────────────────────
seed │ 1
D_lr │ 0.0005
G_lr │ 0.0001
Q_lr │ 0.0001
T_lr │ 0.0001
plot_interval │ 10
════════════════════╧═════════════════════
╒════════════════════╤════════════════════╕
│ step │ 9 │
├────────────────────┼────────────────────┤
│ D loss │ 0.200 │
├────────────────────┼────────────────────┤
│ G loss │ 0.100 │
├────────────────────┼────────────────────┤
│ mutual information │ 0.010 │
╘════════════════════╧════════════════════╛
Process finished with exit code-block 0
To View the Results
~~~~~~~~~~~~~~~~~~~
|ML-Logger Dashboard|
Note, the ``jwt`` based access control has not been implemented yet. So
you should leave the token field empty.
Logging to a Server
~~~~~~~~~~~~~~~~~~~
**Skip this if you just want to log locally.** When training in
parallel, you want to kickstart an logging server (Instrument Server).
To do so, run:
.. code-block:: bash
python -m ml_logger.server
Use ssh tunnel if you are running on a managed cluster (with SLURM for
instance). **Important**: to set allow remote logging, you need to pass
in ``--host=0.0.0.0`` so that the server accepts non-localhost
connections.
.. code-block:: bash
python -m ml_logger.server --host=0.0.0.0
Asynchronously log the summary of LOTs of training metrics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A common scenario is you only want to upload averaged statistics of your
metrics. A pattern that @jachiam uses is the following:
``store_metrics()``, ``peak_stored_metrics()``, and
``log_metrics_summary()``
.. code-block:: python
# You log lots of metrics during training.
for i in range(100):
logger.store_metrics(metrics={'some_val/smooth': 10}, some=20, timestep=i)
# you can peak what's inside the cache and print out a table like this:
logger.peek_stored_metrics(len=4)
# outputs ~>
# some | timestep |some_val/smooth
# ━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━
# 20 | 0 | 10
# 20 | 1 | 10
# 20 | 2 | 10
# 20 | 3 | 10
# The metrics are stored in-memory. Now we need to actually log the summaries:
logger.log_metrics_summary(silent=True)
# outputs ~> . (data is now logged to the server)
Table of Contents
-----------------
- logging ``matplotlib.pyplot`` figures on an headless server
- [documentation under construction]
How to Develop
--------------
First clone repo, install dev dependencies, and install the module under
evaluation mode.
.. code-block:: bash
git clone https://github.com/episodeyang/ml_logger.git
cd ml_logger && cd ml_logger && pip install -r requirements-dev.txt
pip install -e .
Testing local-mode (without a server)
-------------------------------------
You should be inside ml_logger/ml_logger folder
.. code-block:: bash
pwd # ~> ml_logger/ml_logger
make test
Testing with a server (You need to do both for an PR)
-----------------------------------------------------
To test with a live server, first run (in a separate console)
::
python -m ml_logger.server --data-dir /tmp/ml-logger-debug
or do:
.. code-block:: bash
make start-test-server
Then run this test script with the option:
.. code-block:: bash
python -m pytest tests --capture=no --data-dir http://0.0.0.0:8081
or do
.. code-block:: bash
make test-with-server
Your PR should have both of these two tests working. ToDo: add CI to
this repo.
To Publish
~~~~~~~~~~
You need ``twine``, ``rst-lint`` etc, which are included in the
``requirements-dev.txt`` file.
--------------
Logging Matplotlib pyplots
~~~~~~~~~~~~~~~~~~~~~~~~~~
Configuring The Experiment Folder
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
from ml_logger import logger, ML_Logger
from datetime import datetime
now = datetime.now()
logger.configure("/tmp/ml-logger-demo", "deep_Q_learning", f"{now:%Y%m%d-%H%M%S}")
This is a singleton pattern similar to ``matplotlib.pyplot``. However,
you could also use the logger constructor
.. code-block:: python
logger = ML_Logger(root_dir="/tmp/ml-logger-demo", prefix=f"deep_Q_learning/{now:%Y%m%d-%H%M%S}")
Logging Text, and Metrics
~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
logger.log({"some_var/smooth": 10}, some=Color(0.85, 'yellow', percent), step=3)
colored output: (where the values are yellow)
.. code-block:: text
╒════════════════════╤════════════════════╕
│ some var/smooth │ 10 │
├────────────────────┼────────────────────┤
│ some │ 85.0% │
╘════════════════════╧════════════════════╛
Logging Matplotlib Figures
~~~~~~~~~~~~~~~~~~~~~~~~~~
We have optimized ML-Logger, so it supports any format that ``pyplot``
supports. To save a figure locally or remotely,
.. code-block:: python
import matplotlib.pyplot as plt
import numpy as np
xs = np.linspace(-5, 5)
plt.plot(xs, np.cos(xs), label='Cosine Func')
logger.savefig('cosine_function.pdf')
Logging Videos
~~~~~~~~~~~~~~
It is especially hard to visualize RL training sessions on a remote
computer. With ML-Logger this is easy, and super fast. We optimized the
serialization and transport process, so that a large stack of video
tensor gets first compressed by ``ffmepg`` before getting sent over the
wire.
The compression rate (and speed boost) can be 2000:1.
.. code-block:: python
import numpy as np
def im(x, y):
canvas = np.zeros((200, 200))
for i in range(200):
for j in range(200):
if x - 5 < i < x + 5 and y - 5 < j < y + 5:
canvas[i, j] = 1
return canvas
frames = [im(100 + i, 80) for i in range(20)]
logger.log_video(frames, "test_video.mp4")
Saving PyTorch Modules
~~~~~~~~~~~~~~~~~~~~~~
PyTorch has a very nice module saving and loading API that has inspired
the one in ``Keras``. We make it easy to save this state dictionary
(``state_dict``) to a server, and load it. This way you can load from
100+ of your previous experiments, without having to download those
weights to your code-block repository.
.. code-block:: python
# save a module
logger.save_module(cnn, "FastCNN.pkl")
# load a module
logger.load_module(cnn, f"FastCNN.pkl")
Saving Tensorflow Models
~~~~~~~~~~~~~~~~~~~~~~~~
The format tensorflow uses to save the models is opaque. I prefer to
save model weights in ``pickle`` as a dictionary. This way the weight
files are transparent. ML_Logger offers easy helper functions to save
and load from checkpoints saved in this format:
.. code-block:: python
## To save checkpoint
from ml_logger import logger
import tensorflow as tf
logger.configure(log_directory="/tmp/ml-logger-demos")
x = tf.get_variable('x', shape=[], initializer=tf.constant_initializer(0.0))
y = tf.get_variable('y', shape=[], initializer=tf.constant_initializer(10.0))
c = tf.Variable(1000)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
trainables = tf.trainable_variables()
logger.save_variables(trainables, path="variables.pkl", namespace="checkpoints")
which creates a file ``checkpoints/variables.pkl`` under
``/tmp/ml-logger-demos``.
Visualization
-------------
An idea visualization dashboard would be 1. **Fast, instantaneous.** On
an AWS headless server? View the plots as if they are on your local
computer. 2. **Searchable, performantly.** So that you don’t have to
remember where an experiment is from last week. 3. **Answer Questions,
from 100+ Experiments.** We make available Google’s internal
hyperparameter visualization tool, on your own computer.
Searching for Hyper Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Experiments are identified by the ``metrics.pkl`` file. You can log
multiple times to the same ``metrics.pkl`` file, and the later parameter
values overwrites earlier ones with the same key. We enforce namespace
in this file, so each key/value argument you pass into the
``logger.log_parameters`` function call has to be a dictionary.
.. code-block:: python
Args = dict(
learning_rate=10,
hidden_size=200
)
logger.log_parameters(Args=Args)
How to launch the Vis App
~~~~~~~~~~~~~~~~~~~~~~~~~
**This requires node.js and yarn dev environment** at the moment. We
will streamline this process without these requirements soon.
0. download this repository
1. go to ``ml-vis-app`` folder
2. Install the dev dependencies
1. install node: `Installation <https://nodejs.org/en/download/>`__
2. install yarn:
`Installation <https://yarnpkg.com/lang/en/docs/install/#mac-stable>`__
3. install the dependencies of this visualization app:
1. ``yarn install``
3. in that folder, run ``yarn``.
**The IP address of the server is currently hard
code-blockd**\ `here <https://github.com/episodeyang/ml_logger/blob/master/ml-vis-app/src/App.js#L11>`__\ **.**
To use this with your own instrumentation server, over-write this line.
I’m planning on making this configuration more accessible.
Full Logging API
----------------
.. code-block:: python
from ml_logger import logger, Color, percent
logger.log_params(G=dict(some_config="hey"))
logger.log(some=Color(0.1, 'yellow'), step=0)
logger.log(some=Color(0.28571, 'yellow', lambda v: "{:.5f}%".format(v * 100)), step=1)
logger.log(some=Color(0.85, 'yellow', percent), step=2)
logger.log({"some_var/smooth": 10}, some=Color(0.85, 'yellow', percent), step=3)
logger.log(some=Color(10, 'yellow'), step=4)
colored output: (where the values are yellow)
.. code-block:: text
╒════════════════════╤════════════════════╕
│ some │ 0.1 │
╘════════════════════╧════════════════════╛
╒════════════════════╤════════════════════╕
│ some │ 28.57100% │
╘════════════════════╧════════════════════╛
╒════════════════════╤════════════════════╕
│ some │ 85.0% │
╘════════════════════╧════════════════════╛
╒════════════════════╤════════════════════╕
│ some var/smooth │ 10 │
├────────────────────┼────────────────────┤
│ some │ 85.0% │
╘════════════════════╧════════════════════╛
This version of logger also prints out a tabular printout of the data
you are logging to your ``stdout``. - can silence ``stdout`` per key
(per ``logger.log`` call) - can print with color:
``logger.log(timestep, some_key=green(some_data))`` - can print with
custom formatting:
``logger.log(timestep, some_key=green(some_data, percent))`` where
``percent`` - uses the correct ``unix`` table characters (please stop
using ``|`` and ``+``. **Use ``│``, ``┼`` instead**)
A typical print out of this logger look like the following:
.. code-block:: python
from ml_logger import ML_Logger
logger = ML_Logger(root_dir=f"/mnt/bucket/deep_Q_learning/{datetime.now(%Y%m%d-%H%M%S.%f):}")
logger.log_params(G=vars(G), RUN=vars(RUN), Reporting=vars(Reporting))
outputs the following
.. code-block:: text
═════════════════════════════════════════════════════
G
───────────────────────────────┬─────────────────────
env_name │ MountainCar-v0
seed │ None
stochastic_action │ True
conv_params │ None
value_params │ (64,)
use_layer_norm │ True
buffer_size │ 50000
replay_batch_size │ 32
prioritized_replay │ True
alpha │ 0.6
beta_start │ 0.4
beta_end │ 1.0
prioritized_replay_eps │ 1e-06
grad_norm_clipping │ 10
double_q │ True
use_dueling │ False
exploration_fraction │ 0.1
final_eps │ 0.1
n_timesteps │ 100000
learning_rate │ 0.001
gamma │ 1.0
learning_start │ 1000
learn_interval │ 1
target_network_update_interval │ 500
═══════════════════════════════╧═════════════════════
RUN
───────────────────────────────┬─────────────────────
log_directory │ /mnt/slab/krypton/machine_learning/ge_dqn/2017-11-20/162048.353909-MountainCar-v0-prioritized_replay(True)
checkpoint │ checkpoint.cp
log_file │ output.log
═══════════════════════════════╧═════════════════════
Reporting
───────────────────────────────┬─────────────────────
checkpoint_interval │ 10000
reward_average │ 100
print_interval │ 10
═══════════════════════════════╧═════════════════════
╒════════════════════╤════════════════════╕
│ timestep │ 1999 │
├────────────────────┼────────────────────┤
│ episode │ 10 │
├────────────────────┼────────────────────┤
│ total reward │ -200.0 │
├────────────────────┼────────────────────┤
│ total reward/mean │ -200.0 │
├────────────────────┼────────────────────┤
│ total reward/max │ -200.0 │
├────────────────────┼────────────────────┤
│time spent exploring│ 82.0% │
├────────────────────┼────────────────────┤
│ replay beta │ 0.41 │
╘════════════════════╧════════════════════╛
TODO:
-----
Visualization (Preview):boom:
-----------------------------
In addition, ml-logger also comes with a powerful visualization
dashboard that beats tensorboard in every aspect.
.. figure:: https://github.com/episodeyang/ml_logger/blob/master/figures/ml_visualization_dashboard_preview.png?raw=true
:alt: ml visualization dashboard
ml visualization dashboard
An Example Log from ML-Logger
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A common pain that comes after getting to launch ML training jobs on AWS
is a lack of a good way to manage and visualize your data. So far, a
common practice is to upload your experiment data to aws s3 or google
cloud buckets. Then one quickly realizes that downloading data from s3
can be slow. s3 does
not offer diffsync like gcloud-cli’s ``g rsync``. This makes it hard to
sync a large collection of data that is constantly appended to.
So far the best way we have found for organizing experimental data is to
have a centralized instrumentation server. Compared with managing your
data on S3, a centralized instrumentation server makes it much easier to
move experiments around, run analysis that is co-located with your data,
and hosting visualization dashboards on the same machine. To download
data locally, you can use ``sshfs``, ``smba``, ``rsync`` or a variety of
remote disks. All faster than s3.
ML-Logger is the logging utility that allows you to do this. To make
ML_logger easy to use, we made it so that you can use ml-logger with
zero configuration, logging to your local hard-drive by default. When
the logging directory field
``logger.configure(log_directory= <your directory>)`` is an http end
point, the logger will instantiate a fast, future based logging client
that launches http requests in a separate thread. We optimized the
client so that it won’t slow down your training code-block.
API wise, ML-logger makes it easy for you to log textual printouts,
simple scalars, numpy tensors, image tensors, and ``pyplot`` figures.
Because you might also want to read data from the instrumentation
server, we also made it possible to load numpy, pickle, text and binary
files remotely.
In the future, we will start building an integrated dashboard with fast
search, live figure update and markdown-based reporting/dashboarding to
go with ml-logger.
Now give this a try, and profit!
.. |Downloads| image:: http://pepy.tech/badge/ml-logger
:target: http://pepy.tech/project/ml-logger
.. |ML-Logger Dashboard| image:: figures/app-dash-ml_setup.png?raw=true
:target: https://app.dash.ml/