Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide more information to the user #358

Merged
merged 29 commits into from
Aug 23, 2024
Merged

Conversation

StarCycle
Copy link
Contributor

  1. remove abbreviations and use "longer names" in logs
  2. enable progress bar for evaluation by default
  3. enable asynchronized env by default
  4. add readme intro for resuming training

Copy link
Contributor

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these suggestions. I've left some comments.

@alexander-soare alexander-soare self-assigned this Aug 12, 2024
@StarCycle
Copy link
Contributor Author

Perhaps you can also print the config at the beginning of training/eval.

Currently you are using hydra and different config files may override each other. Sometimes a user may not remember the setting in another config file, or not know his/her config is override by another config file.

Simply printing all configs at the beginning of a training/eval process can solve this problem, like what they did for mmengine.

Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! We will improve onboarding and ease of use.

Left a few comments

@StarCycle
Copy link
Contributor Author

StarCycle commented Aug 14, 2024

@Cadene:

Re progressbar: you are right, I will not make it as an option. I still suggest to enable both progress bars (i.e., the progress bar for episodes and the bar for steps in an episode). Users can easily locate problems of evaluation if a step takes too long, or there are too many episodes

@StarCycle
Copy link
Contributor Author

As a side note, now it logs this at the beginning of training, which is very easy to read:

INFO 2024-08-14 13:15:24       <stdin>:1 {'dataset_repo_id': 'lerobot/pusht',
 'device': 'cuda',
 'env': {'action_dim': 2,
         'episode_length': 300,
         'fps': '${fps}',
         'gym': {'obs_type': 'pixels_agent_pos',
                 'render_mode': 'rgb_array',
                 'visualization_height': 384,
                 'visualization_width': 384},
         'image_size': 96,
         'name': 'pusht',
         'state_dim': 2,
         'task': 'PushT-v0'},
 'eval': {'batch_size': 50, 'n_episodes': 50, 'use_async_envs': True},
 'fps': 10,
 'override_dataset_stats': {'action': {'max': [511.0, 511.0],
                                       'min': [12.0, 25.0]},
                            'observation.image': {'mean': [[[0.5]],
                                                           [[0.5]],
                                                           [[0.5]]],
                                                  'std': [[[0.5]],
                                                          [[0.5]],
                                                          [[0.5]]]},
                            'observation.state': {'max': [496.14618, 510.9579],
                                                  'min': [13.456424,
                                                          32.938293]}},
 'policy': {'beta_end': 0.02,
            'beta_schedule': 'squaredcos_cap_v2',
            'beta_start': 0.0001,
            'clip_sample': True,
            'clip_sample_range': 1.0,
            'crop_is_random': True,
            'crop_shape': [84, 84],
            'diffusion_step_embed_dim': 128,
            'do_mask_loss_for_padding': False,
            'down_dims': [512, 1024, 2048],
            'horizon': 16,
            'input_normalization_modes': {'observation.image': 'mean_std',
                                          'observation.state': 'min_max'},
            'input_shapes': {'observation.image': [3, 96, 96],
                             'observation.state': ['${env.state_dim}']},
            'kernel_size': 5,
            'n_action_steps': 8,
            'n_groups': 8,
            'n_obs_steps': 2,
            'name': 'diffusion',
            'noise_scheduler_type': 'DDPM',
            'num_inference_steps': None,
            'num_train_timesteps': 100,
            'output_normalization_modes': {'action': 'min_max'},
            'output_shapes': {'action': ['${env.action_dim}']},
            'prediction_type': 'epsilon',
            'pretrained_backbone_weights': None,
            'spatial_softmax_num_keypoints': 32,
            'use_film_scale_modulation': True,
            'use_group_norm': True,
            'vision_backbone': 'resnet18'},
 'resume': False,
 'seed': 100000,
 'training': {'adam_betas': [0.95, 0.999],
              'adam_eps': 1e-08,
              'adam_weight_decay': 1e-06,
              'batch_size': 64,
              'delta_timestamps': {'action': '[i / ${fps} for i in range(1 - '
                                             '${policy.n_obs_steps}, 1 - '
                                             '${policy.n_obs_steps} + '
                                             '${policy.horizon})]',
                                   'observation.image': '[i / ${fps} for i in '
                                                        'range(1 - '
                                                        '${policy.n_obs_steps}, '
                                                        '1)]',
                                   'observation.state': '[i / ${fps} for i in '
                                                        'range(1 - '
                                                        '${policy.n_obs_steps}, '
                                                        '1)]'},
              'do_online_rollout_async': False,
              'drop_n_last_frames': 7,
              'eval_freq': 100,
              'grad_clip_norm': 10,
              'image_transforms': {'brightness': {'min_max': [0.8, 1.2],
                                                  'weight': 1},
                                   'contrast': {'min_max': [0.8, 1.2],
                                                'weight': 1},
                                   'enable': False,
                                   'hue': {'min_max': [-0.05, 0.05],
                                           'weight': 1},
                                   'max_num_transforms': 3,
                                   'random_order': False,
                                   'saturation': {'min_max': [0.5, 1.5],
                                                  'weight': 1},
                                   'sharpness': {'min_max': [0.8, 1.2],
                                                 'weight': 1}},
              'log_freq': 200,
              'lr': 0.0001,
              'lr_scheduler': 'cosine',
              'lr_warmup_steps': 500,
              'num_workers': 4,
              'offline_steps': 200000,
              'online_buffer_capacity': None,
              'online_buffer_seed_size': 0,
              'online_env_seed': None,
              'online_rollout_batch_size': 1,
              'online_rollout_n_episodes': 1,
              'online_sampling_ratio': 0.5,
              'online_steps': 0,
              'online_steps_between_rollouts': 1,
              'save_checkpoint': True,
              'save_freq': 100},
 'use_amp': False,
 'video_backend': 'pyav',
 'wandb': {'disable_artifact': False,
           'enable': False,
           'notes': '',
           'project': 'lerobot'}}

@alexander-soare
Copy link
Contributor

Thanks for revising this @StarCycle . My status is now "approving". I will also wait on @Cadene to approve as he has become involved.

Btw, looks like style tests are not passing. Have you seen CONTRIBUTING.md for instructions on how to set up the pre-commit hook?

@Cadene
Copy link
Collaborator

Cadene commented Aug 15, 2024

@StarCycle By any chance, could you provide code to try this PR?

I feel like at least the third section is missing from the PR description among the sections we advise to add:

  1. What the PR adds:
  2. How it was tested
  3. How to checkout & try? (for the reviewer) <--- example code

See this PR description for instance: #281

Thanks!

@StarCycle
Copy link
Contributor Author

StarCycle commented Aug 16, 2024

@Cadene

You are right! I explain it here:

What this does

  1. Enable progress bar for evaluation by default, except in slurm
  2. Enable asynchronized env by default, except for aloha environments.
  3. Add readme intro for resuming training.
  4. Add tutorial intro about explainations of abbreviations of the metrics in log
  5. It will print all the configuration at the beginning of a training process

How it was tested?

Not too much difference from the original code, just run python lerobot/scripts/train.py policy=diffusion env=pusht

How to checkout and try?

Just run python lerobot/scripts/train.py policy=diffusion env=pusht

@StarCycle StarCycle requested a review from Cadene August 16, 2024 09:16
Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very thankful for your time and suggestions.
I left some comments. I think some parts of this PR requires a bit more work.

On our side, we should better convey the purpose for the two sections of the PR description. They are actually quite important to make faster progress on LeRobot. I would have edited the PR description with these info:

How it was tested?

  • Ran diffusion training and evaluation on pusht. Configs are displayed and look like this:
Screenshot 2024-08-16 at 16 06 09 Screenshot 2024-08-16 at 16 06 20
  • Ran diffusion training and evaluation on pusht. Progress bars are displayed and look like this:
Screenshot 2024-08-16 at 15 58 37

TODO:

  • Run training and evaluation on aloha to validate use_async_envs: false
  • Run training and evaluation on xarm to validate use_async_envs: false

How to checkout and try?

  • To checkout the code:
git remote add starcycle [email protected]:StarCycle/easier_lerobot.git
git fetch starcycle
git checkout starcycle/main
  • To run diffusion training and evaluation on pusht:
python lerobot/scripts/train.py policy=diffusion env=pusht \
eval.batch_size=1 eval.use_async_envs=false training.eval_freq=2
  • To run diffusion training and evaluation on aloha:
    TODO

  • To run diffusion training and evaluation on xarm:
    TODO

Thanks for your help!!!!

StarCycle and others added 4 commits August 17, 2024 09:02
When cfg.eval.batch_size > cfg.eval.n_episodes, raise an error instead of modifying cfg.eval.batch_size silently

Co-authored-by: Remi <[email protected]>
@StarCycle StarCycle requested a review from Cadene August 17, 2024 05:05
@StarCycle
Copy link
Contributor Author

Just out of curiosity, does LeRobot support multi-gpu training now? (you just mentioned slurm ^^

@Cadene
Copy link
Collaborator

Cadene commented Aug 17, 2024

@StarCycle Yes we are working on a PR using accelerate: #317

@StarCycle
Copy link
Contributor Author

@StarCycle Yes we are working on a PR using accelerate: #317

Nice!

@StarCycle
Copy link
Contributor Author

StarCycle commented Aug 21, 2024

Hi @Cadene,

Are there other things that I need to complete to merge this PR?

(ノ"◑ڡ◑)ノ

Copy link
Collaborator

@Cadene Cadene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last update required and ready to merge!

@alexander-soare alexander-soare merged commit a2592a5 into huggingface:main Aug 23, 2024
6 checks passed
amandip7 pushed a commit to amandip7/lerobot that referenced this pull request Oct 10, 2024
menhguin pushed a commit to menhguin/lerobot that referenced this pull request Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants