Provide more information to the user #358

StarCycle · 2024-08-12T15:52:10Z

remove abbreviations and use "longer names" in logs
enable progress bar for evaluation by default
enable asynchronized env by default
add readme intro for resuming training

Add introduction about resuming training

alexander-soare

Thanks for these suggestions. I've left some comments.

README.md

lerobot/configs/default.yaml

lerobot/scripts/eval.py

lerobot/scripts/train.py

StarCycle · 2024-08-13T04:33:38Z

Perhaps you can also print the config at the beginning of training/eval.

Currently you are using hydra and different config files may override each other. Sometimes a user may not remember the setting in another config file, or not know his/her config is override by another config file.

Simply printing all configs at the beginning of a training/eval process can solve this problem, like what they did for mmengine.

…e it in default.yaml

lerobot/configs/default.yaml

lerobot/configs/env/aloha.yaml

lerobot/configs/env/dora_aloha_real.yaml

lerobot/scripts/eval.py

lerobot/scripts/train.py

Cadene

Thanks for your contribution! We will improve onboarding and ease of use.

Left a few comments

lerobot/scripts/train.py

lerobot/configs/default.yaml

README.md

lerobot/common/utils/utils.py

StarCycle · 2024-08-14T10:20:57Z

@Cadene:

Re progressbar: you are right, I will not make it as an option. I still suggest to enable both progress bars (i.e., the progress bar for episodes and the bar for steps in an episode). Users can easily locate problems of evaluation if a step takes too long, or there are too many episodes

StarCycle · 2024-08-14T13:17:26Z

As a side note, now it logs this at the beginning of training, which is very easy to read:

INFO 2024-08-14 13:15:24       <stdin>:1 {'dataset_repo_id': 'lerobot/pusht',
 'device': 'cuda',
 'env': {'action_dim': 2,
         'episode_length': 300,
         'fps': '${fps}',
         'gym': {'obs_type': 'pixels_agent_pos',
                 'render_mode': 'rgb_array',
                 'visualization_height': 384,
                 'visualization_width': 384},
         'image_size': 96,
         'name': 'pusht',
         'state_dim': 2,
         'task': 'PushT-v0'},
 'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': True},
 'fps': 10,
 'override_dataset_stats': {'action': {'max': [511.0, 511.0],
                                       'min': [12.0, 25.0]},
                            'observation.image': {'mean': [[[0.5]],
                                                           [[0.5]],
                                                           [[0.5]]],
                                                  'std': [[[0.5]],
                                                          [[0.5]],
                                                          [[0.5]]]},
                            'observation.state': {'max': [496.14618, 510.9579],
                                                  'min': [13.456424,
                                                          32.938293]}},
 'policy': {'beta_end': 0.02,
            'beta_schedule': 'squaredcos_cap_v2',
            'beta_start': 0.0001,
            'clip_sample': True,
            'clip_sample_range': 1.0,
            'crop_is_random': True,
            'crop_shape': [84, 84],
            'diffusion_step_embed_dim': 128,
            'do_mask_loss_for_padding': False,
            'down_dims': [512, 1024, 2048],
            'horizon': 16,
            'input_normalization_modes': {'observation.image': 'mean_std',
                                          'observation.state': 'min_max'},
            'input_shapes': {'observation.image': [3, 96, 96],
                             'observation.state': ['${env.state_dim}']},
            'kernel_size': 5,
            'n_action_steps': 8,
            'n_groups': 8,
            'n_obs_steps': 2,
            'name': 'diffusion',
            'noise_scheduler_type': 'DDPM',
            'num_inference_steps': None,
            'num_train_timesteps': 100,
            'output_normalization_modes': {'action': 'min_max'},
            'output_shapes': {'action': ['${env.action_dim}']},
            'prediction_type': 'epsilon',
            'pretrained_backbone_weights': None,
            'spatial_softmax_num_keypoints': 32,
            'use_film_scale_modulation': True,
            'use_group_norm': True,
            'vision_backbone': 'resnet18'},
 'resume': False,
 'seed': 100000,
 'training': {'adam_betas': [0.95, 0.999],
              'adam_eps': 1e-08,
              'adam_weight_decay': 1e-06,
              'batch_size': 64,
              'delta_timestamps': {'action': '[i / ${fps} for i in range(1 - '
                                             '${policy.n_obs_steps}, 1 - '
                                             '${policy.n_obs_steps} + '
                                             '${policy.horizon})]',
                                   'observation.image': '[i / ${fps} for i in '
                                                        'range(1 - '
                                                        '${policy.n_obs_steps}, '
                                                        '1)]',
                                   'observation.state': '[i / ${fps} for i in '
                                                        'range(1 - '
                                                        '${policy.n_obs_steps}, '
                                                        '1)]'},
              'do_online_rollout_async': False,
              'drop_n_last_frames': 7,
              'eval_freq': 100,
              'grad_clip_norm': 10,
              'image_transforms': {'brightness': {'min_max': [0.8, 1.2],
                                                  'weight': 1},
                                   'contrast': {'min_max': [0.8, 1.2],
                                                'weight': 1},
                                   'enable': False,
                                   'hue': {'min_max': [-0.05, 0.05],
                                           'weight': 1},
                                   'max_num_transforms': 3,
                                   'random_order': False,
                                   'saturation': {'min_max': [0.5, 1.5],
                                                  'weight': 1},
                                   'sharpness': {'min_max': [0.8, 1.2],
                                                 'weight': 1}},
              'log_freq': 200,
              'lr': 0.0001,
              'lr_scheduler': 'cosine',
              'lr_warmup_steps': 500,
              'num_workers': 4,
              'offline_steps': 200000,
              'online_buffer_capacity': None,
              'online_buffer_seed_size': 0,
              'online_env_seed': None,
              'online_rollout_batch_size': 1,
              'online_rollout_n_episodes': 1,
              'online_sampling_ratio': 0.5,
              'online_steps': 0,
              'online_steps_between_rollouts': 1,
              'save_checkpoint': True,
              'save_freq': 100},
 'use_amp': False,
 'video_backend': 'pyav',
 'wandb': {'disable_artifact': False,
           'enable': False,
           'notes': '',
           'project': 'lerobot'}}

alexander-soare · 2024-08-14T13:31:28Z

Thanks for revising this @StarCycle . My status is now "approving". I will also wait on @Cadene to approve as he has become involved.

Btw, looks like style tests are not passing. Have you seen CONTRIBUTING.md for instructions on how to set up the pre-commit hook?

Cadene · 2024-08-15T18:48:14Z

@StarCycle By any chance, could you provide code to try this PR?

I feel like at least the third section is missing from the PR description among the sections we advise to add:

What the PR adds:
How it was tested
How to checkout & try? (for the reviewer) <--- example code

See this PR description for instance: #281

Thanks!

StarCycle · 2024-08-16T03:39:57Z

@Cadene

You are right! I explain it here:

What this does

Enable progress bar for evaluation by default, except in slurm
Enable asynchronized env by default, except for aloha environments.
Add readme intro for resuming training.
Add tutorial intro about explainations of abbreviations of the metrics in log
It will print all the configuration at the beginning of a training process

How it was tested?

Not too much difference from the original code, just run python lerobot/scripts/train.py policy=diffusion env=pusht

How to checkout and try?

Just run python lerobot/scripts/train.py policy=diffusion env=pusht

Cadene

I am very thankful for your time and suggestions.
I left some comments. I think some parts of this PR requires a bit more work.

On our side, we should better convey the purpose for the two sections of the PR description. They are actually quite important to make faster progress on LeRobot. I would have edited the PR description with these info:

How it was tested?

Ran diffusion training and evaluation on pusht. Configs are displayed and look like this:

Ran diffusion training and evaluation on pusht. Progress bars are displayed and look like this:

TODO:

Run training and evaluation on aloha to validate use_async_envs: false
Run training and evaluation on xarm to validate use_async_envs: false

How to checkout and try?

To checkout the code:

git remote add starcycle [email protected]:StarCycle/easier_lerobot.git
git fetch starcycle
git checkout starcycle/main

To run diffusion training and evaluation on pusht:

python lerobot/scripts/train.py policy=diffusion env=pusht \
eval.batch_size=1 eval.use_async_envs=false training.eval_freq=2

To run diffusion training and evaluation on aloha:
TODO
To run diffusion training and evaluation on xarm:
TODO

Thanks for your help!!!!

lerobot/common/utils/utils.py

lerobot/scripts/train.py

lerobot/configs/env/xarm.yaml

lerobot/configs/env/aloha.yaml

When cfg.eval.batch_size > cfg.eval.n_episodes, raise an error instead of modifying cfg.eval.batch_size silently Co-authored-by: Remi <[email protected]>

StarCycle · 2024-08-17T07:08:58Z

Just out of curiosity, does LeRobot support multi-gpu training now? (you just mentioned slurm ^^

Cadene · 2024-08-17T13:19:34Z

@StarCycle Yes we are working on a PR using accelerate: #317

StarCycle · 2024-08-17T14:13:12Z

@StarCycle Yes we are working on a PR using accelerate: #317

Nice!

StarCycle · 2024-08-21T02:13:07Z

Hi @Cadene,

Are there other things that I need to complete to merge this PR?

(ﾉ"◑ڡ◑)ﾉ

Cadene

Last update required and ready to merge!

lerobot/common/utils/utils.py

Co-authored-by: Remi <[email protected]>

Co-authored-by: Alexander Soare <[email protected]> Co-authored-by: Remi <[email protected]>

StarCycle added 4 commits August 12, 2024 15:42

provide more log information to beginner users

8f0a50e

Update README.md

a513214

Add introduction about resuming training

Merge branch 'main' into main

a1d5671

Update train.py

e6d0580

alexander-soare suggested changes Aug 12, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

lerobot/configs/default.yaml Show resolved Hide resolved

lerobot/scripts/eval.py Outdated Show resolved Hide resolved

lerobot/scripts/train.py Outdated Show resolved Hide resolved

alexander-soare self-assigned this Aug 12, 2024

StarCycle added 7 commits August 13, 2024 13:02

Add link to the resumption example

d86344c

set async env by default by synv env for aloha

10d76a6

enalble eval progbars from default.yaml

faa89cc

always print current config before training/eval starts

a62c83d

let enable_progbar and enable_inner_progbar be false because we enabl…

9b1b0ff

…e it in default.yaml

add logic to fix eval.batch_size > eval.n_episodes

2cbb109

also set async=false in xarm env

242d9f9

StarCycle requested a review from alexander-soare August 14, 2024 06:21