Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training results are very poor. #4

Open
xungeer29 opened this issue Nov 6, 2023 · 7 comments
Open

Training results are very poor. #4

xungeer29 opened this issue Nov 6, 2023 · 7 comments

Comments

@xungeer29
Copy link

xungeer29 commented Nov 6, 2023

Thank you very much for making your training code public.
I used your default config file to train the model, just modify the batch_size to 256. The final results achieved 81+ for MPJPE and 88+ for MPVPE.
I don't know why the results are so bad.

[11/04 02:22:49] Training INFO: [Epoch 49/50][Batch 300/357][lr 0.000000][loss_seg: 0.0541][loss_dense: 0.0002][loss_lovasz: 0.0125][loss_joint_left_uv_0: 0.0055][loss_joint_right_uv_0: 0.0054][loss_mesh_left_uv_0: 0.0071][loss_mesh_right_uv_0: 0.0075][loss_joint_left_xyz_0: 0.0066][loss_joint_right_xyz_0: 0.0064][loss_mesh_left_xyz_0: 0.0080][loss_mesh_right_xyz_0: 0.0082][loss_edge_left_0: 0.0135][loss_edge_right_0: 0.0138][loss_normal_left_0: 0.0294][loss_normal_right_0: 0.0300][loss_offset_0: 0.0033][loss_joint_left_uv_1: 0.0052][loss_joint_right_uv_1: 0.0052][loss_mesh_left_uv_1: 0.0069][loss_mesh_right_uv_1: 0.0068][loss_joint_left_xyz_1: 0.0065][loss_joint_right_xyz_1: 0.0063][loss_mesh_left_xyz_1: 0.0079][loss_mesh_right_xyz_1: 0.0078][loss_edge_left_1: 0.0135][loss_edge_right_1: 0.0137][loss_normal_left_1: 0.0292][loss_normal_right_1: 0.0291][loss_offset_1: 0.0031][loss_joint_left_uv_2: 0.0051][loss_joint_right_uv_2: 0.0050][loss_mesh_left_uv_2: 0.0068][loss_mesh_right_uv_2: 0.0067][loss_joint_left_xyz_2: 0.0065][loss_joint_right_xyz_2: 0.0063][loss_mesh_left_xyz_2: 0.0079][loss_mesh_right_xyz_2: 0.0078][loss_edge_left_2: 0.0135][loss_edge_right_2: 0.0136][loss_normal_left_2: 0.0291][loss_normal_right_2: 0.0291][loss_offset_2: 0.0031]
[11/04 02:24:36] Training INFO: Save checkpoint to ./checkpoints/DIR/checkpoint/latest.pth
[11/04 02:30:25] Training INFO: MPJPE_0: left 80.77075399604499 mm, right 81.86647319326214 mm, AVG 81.31861359465356 mm
[11/04 02:30:25] Training INFO: MPVPE_0: left 87.17533539907605 mm, right 89.05994439241933 mm, AVG 88.11763989574769 mm
[11/04 02:30:25] Training INFO: MPJPE_1: left 80.8181789283659 mm, right 82.71075034258412 mm, AVG 81.76446463547501 mm
[11/04 02:30:25] Training INFO: MPVPE_1: left 87.29970373359382 mm, right 89.37457580776776 mm, AVG 88.33713977068078 mm
[11/04 02:30:25] Training INFO: MPJPE_2: left 81.22921638629016 mm, right 83.13988862084408 mm, AVG 82.18455250356712 mm
[11/04 02:30:25] Training INFO: MPVPE_2: left 87.71276979469785 mm, right 89.84211043399922 mm, AVG 88.77744011434854 mm
@walsvid
Copy link
Collaborator

walsvid commented Nov 14, 2023

Hi @xungeer29. Could you reproduce the metrics mentioned in the readme using the released checkpoint? My suggestion is to first ensure that the results of the official model inference can be reproduced before making any modifications. Additionally, according to the linear scaling rule, when the batch size changes, the learning rate also needs to be adjusted accordingly to achieve similar performance.

@xungeer29
Copy link
Author

Hi @xungeer29. Could you reproduce the metrics mentioned in the readme using the released checkpoint? My suggestion is to first ensure that the results of the official model inference can be reproduced before making any modifications. Additionally, according to the linear scaling rule, when the batch size changes, the learning rate also needs to be adjusted accordingly to achieve similar performance.

I can reproduce the metrics mentioned in the readme using the released checkpoint, as shown below

joint mean error:
    left: 10.745436884462833 mm, right: 9.604906663298607 mm
    all: 10.17517177388072 mm
vert mean error:
    left: 10.490822605788708 mm, right: 9.404349140822887 mm
    all: 9.947585873305798 mm
pixel joint mean error:
    left: 6.331959247589111 mm, right: 5.808093070983887 mm
    all: 6.070026397705078 mm
pixel vert mean error:
    left: 6.235781669616699 mm, right: 5.725203037261963 mm
    all: 5.98049259185791 mm
root error: 28.982944786548615 mm

LR only have a small impact on the results and cannot make the network completely non convergent.
And I tried to linearly increase lr based on the batch size, but it also had no effect.

@luckyday2022
Copy link

非常感谢您公开您的训练代码。我使用您的默认配置文件来训练模型,只需将batch_size修改为 256。MPJPE的最终结果为81+,MPVPE的最终结果为88+。我不知道为什么结果这么糟糕。

[11/04 02:22:49] Training INFO: [Epoch 49/50][Batch 300/357][lr 0.000000][loss_seg: 0.0541][loss_dense: 0.0002][loss_lovasz: 0.0125][loss_joint_left_uv_0: 0.0055][loss_joint_right_uv_0: 0.0054][loss_mesh_left_uv_0: 0.0071][loss_mesh_right_uv_0: 0.0075][loss_joint_left_xyz_0: 0.0066][loss_joint_right_xyz_0: 0.0064][loss_mesh_left_xyz_0: 0.0080][loss_mesh_right_xyz_0: 0.0082][loss_edge_left_0: 0.0135][loss_edge_right_0: 0.0138][loss_normal_left_0: 0.0294][loss_normal_right_0: 0.0300][loss_offset_0: 0.0033][loss_joint_left_uv_1: 0.0052][loss_joint_right_uv_1: 0.0052][loss_mesh_left_uv_1: 0.0069][loss_mesh_right_uv_1: 0.0068][loss_joint_left_xyz_1: 0.0065][loss_joint_right_xyz_1: 0.0063][loss_mesh_left_xyz_1: 0.0079][loss_mesh_right_xyz_1: 0.0078][loss_edge_left_1: 0.0135][loss_edge_right_1: 0.0137][loss_normal_left_1: 0.0292][loss_normal_right_1: 0.0291][loss_offset_1: 0.0031][loss_joint_left_uv_2: 0.0051][loss_joint_right_uv_2: 0.0050][loss_mesh_left_uv_2: 0.0068][loss_mesh_right_uv_2: 0.0067][loss_joint_left_xyz_2: 0.0065][loss_joint_right_xyz_2: 0.0063][loss_mesh_left_xyz_2: 0.0079][loss_mesh_right_xyz_2: 0.0078][loss_edge_left_2: 0.0135][loss_edge_right_2: 0.0136][loss_normal_left_2: 0.0291][loss_normal_right_2: 0.0291][loss_offset_2: 0.0031]
[11/04 02:24:36] Training INFO: Save checkpoint to ./checkpoints/DIR/checkpoint/latest.pth
[11/04 02:30:25] Training INFO: MPJPE_0: left 80.77075399604499 mm, right 81.86647319326214 mm, AVG 81.31861359465356 mm
[11/04 02:30:25] Training INFO: MPVPE_0: left 87.17533539907605 mm, right 89.05994439241933 mm, AVG 88.11763989574769 mm
[11/04 02:30:25] Training INFO: MPJPE_1: left 80.8181789283659 mm, right 82.71075034258412 mm, AVG 81.76446463547501 mm
[11/04 02:30:25] Training INFO: MPVPE_1: left 87.29970373359382 mm, right 89.37457580776776 mm, AVG 88.33713977068078 mm
[11/04 02:30:25] Training INFO: MPJPE_2: left 81.22921638629016 mm, right 83.13988862084408 mm, AVG 82.18455250356712 mm
[11/04 02:30:25] Training INFO: MPVPE_2: left 87.71276979469785 mm, right 89.84211043399922 mm, AVG 88.77744011434854 mm

Have you solved this problem?

@xungeer29
Copy link
Author

非常感谢您公开您的训练代码。我使用您的默认配置文件来训练模型,只需将batch_size修改为 256。MPJPE的最终结果为81+,MPVPE的最终结果为88+。我不知道为什么结果这么糟糕。

[11/04 02:22:49] Training INFO: [Epoch 49/50][Batch 300/357][lr 0.000000][loss_seg: 0.0541][loss_dense: 0.0002][loss_lovasz: 0.0125][loss_joint_left_uv_0: 0.0055][loss_joint_right_uv_0: 0.0054][loss_mesh_left_uv_0: 0.0071][loss_mesh_right_uv_0: 0.0075][loss_joint_left_xyz_0: 0.0066][loss_joint_right_xyz_0: 0.0064][loss_mesh_left_xyz_0: 0.0080][loss_mesh_right_xyz_0: 0.0082][loss_edge_left_0: 0.0135][loss_edge_right_0: 0.0138][loss_normal_left_0: 0.0294][loss_normal_right_0: 0.0300][loss_offset_0: 0.0033][loss_joint_left_uv_1: 0.0052][loss_joint_right_uv_1: 0.0052][loss_mesh_left_uv_1: 0.0069][loss_mesh_right_uv_1: 0.0068][loss_joint_left_xyz_1: 0.0065][loss_joint_right_xyz_1: 0.0063][loss_mesh_left_xyz_1: 0.0079][loss_mesh_right_xyz_1: 0.0078][loss_edge_left_1: 0.0135][loss_edge_right_1: 0.0137][loss_normal_left_1: 0.0292][loss_normal_right_1: 0.0291][loss_offset_1: 0.0031][loss_joint_left_uv_2: 0.0051][loss_joint_right_uv_2: 0.0050][loss_mesh_left_uv_2: 0.0068][loss_mesh_right_uv_2: 0.0067][loss_joint_left_xyz_2: 0.0065][loss_joint_right_xyz_2: 0.0063][loss_mesh_left_xyz_2: 0.0079][loss_mesh_right_xyz_2: 0.0078][loss_edge_left_2: 0.0135][loss_edge_right_2: 0.0136][loss_normal_left_2: 0.0291][loss_normal_right_2: 0.0291][loss_offset_2: 0.0031]
[11/04 02:24:36] Training INFO: Save checkpoint to ./checkpoints/DIR/checkpoint/latest.pth
[11/04 02:30:25] Training INFO: MPJPE_0: left 80.77075399604499 mm, right 81.86647319326214 mm, AVG 81.31861359465356 mm
[11/04 02:30:25] Training INFO: MPVPE_0: left 87.17533539907605 mm, right 89.05994439241933 mm, AVG 88.11763989574769 mm
[11/04 02:30:25] Training INFO: MPJPE_1: left 80.8181789283659 mm, right 82.71075034258412 mm, AVG 81.76446463547501 mm
[11/04 02:30:25] Training INFO: MPVPE_1: left 87.29970373359382 mm, right 89.37457580776776 mm, AVG 88.33713977068078 mm
[11/04 02:30:25] Training INFO: MPJPE_2: left 81.22921638629016 mm, right 83.13988862084408 mm, AVG 82.18455250356712 mm
[11/04 02:30:25] Training INFO: MPVPE_2: left 87.71276979469785 mm, right 89.84211043399922 mm, AVG 88.77744011434854 mm

Have you solved this problem?

No.

@luckyday2022
Copy link

非常感谢您公开您的训练代码。我使用您的默认配置文件来训练模型,只需将batch_size修改为 256。MPJPE的最终结果为81+,MPVPE的最终结果为88+。我不知道为什么结果这么糟糕。

[11/04 02:22:49] Training INFO: [Epoch 49/50][Batch 300/357][lr 0.000000][loss_seg: 0.0541][loss_dense: 0.0002][loss_lovasz: 0.0125][loss_joint_left_uv_0: 0.0055][loss_joint_right_uv_0: 0.0054][loss_mesh_left_uv_0: 0.0071][loss_mesh_right_uv_0: 0.0075][loss_joint_left_xyz_0: 0.0066][loss_joint_right_xyz_0: 0.0064][loss_mesh_left_xyz_0: 0.0080][loss_mesh_right_xyz_0: 0.0082][loss_edge_left_0: 0.0135][loss_edge_right_0: 0.0138][loss_normal_left_0: 0.0294][loss_normal_right_0: 0.0300][loss_offset_0: 0.0033][loss_joint_left_uv_1: 0.0052][loss_joint_right_uv_1: 0.0052][loss_mesh_left_uv_1: 0.0069][loss_mesh_right_uv_1: 0.0068][loss_joint_left_xyz_1: 0.0065][loss_joint_right_xyz_1: 0.0063][loss_mesh_left_xyz_1: 0.0079][loss_mesh_right_xyz_1: 0.0078][loss_edge_left_1: 0.0135][loss_edge_right_1: 0.0137][loss_normal_left_1: 0.0292][loss_normal_right_1: 0.0291][loss_offset_1: 0.0031][loss_joint_left_uv_2: 0.0051][loss_joint_right_uv_2: 0.0050][loss_mesh_left_uv_2: 0.0068][loss_mesh_right_uv_2: 0.0067][loss_joint_left_xyz_2: 0.0065][loss_joint_right_xyz_2: 0.0063][loss_mesh_left_xyz_2: 0.0079][loss_mesh_right_xyz_2: 0.0078][loss_edge_left_2: 0.0135][loss_edge_right_2: 0.0136][loss_normal_left_2: 0.0291][loss_normal_right_2: 0.0291][loss_offset_2: 0.0031]
[11/04 02:24:36] Training INFO: Save checkpoint to ./checkpoints/DIR/checkpoint/latest.pth
[11/04 02:30:25] Training INFO: MPJPE_0: left 80.77075399604499 mm, right 81.86647319326214 mm, AVG 81.31861359465356 mm
[11/04 02:30:25] Training INFO: MPVPE_0: left 87.17533539907605 mm, right 89.05994439241933 mm, AVG 88.11763989574769 mm
[11/04 02:30:25] Training INFO: MPJPE_1: left 80.8181789283659 mm, right 82.71075034258412 mm, AVG 81.76446463547501 mm
[11/04 02:30:25] Training INFO: MPVPE_1: left 87.29970373359382 mm, right 89.37457580776776 mm, AVG 88.33713977068078 mm
[11/04 02:30:25] Training INFO: MPJPE_2: left 81.22921638629016 mm, right 83.13988862084408 mm, AVG 82.18455250356712 mm
[11/04 02:30:25] Training INFO: MPVPE_2: left 87.71276979469785 mm, right 89.84211043399922 mm, AVG 88.77744011434854 mm

Have you solved this problem?

No.

Is there something wrong with the test code?

@xungeer29
Copy link
Author

xungeer29 commented Dec 4, 2023

非常感谢您公开您的训练代码。我使用您的默认配置文件来训练模型,只需将batch_size修改为 256。MPJPE的最终结果为81+,MPVPE的最终结果为88+。我不知道为什么结果这么糟糕。

[11/04 02:22:49] Training INFO: [Epoch 49/50][Batch 300/357][lr 0.000000][loss_seg: 0.0541][loss_dense: 0.0002][loss_lovasz: 0.0125][loss_joint_left_uv_0: 0.0055][loss_joint_right_uv_0: 0.0054][loss_mesh_left_uv_0: 0.0071][loss_mesh_right_uv_0: 0.0075][loss_joint_left_xyz_0: 0.0066][loss_joint_right_xyz_0: 0.0064][loss_mesh_left_xyz_0: 0.0080][loss_mesh_right_xyz_0: 0.0082][loss_edge_left_0: 0.0135][loss_edge_right_0: 0.0138][loss_normal_left_0: 0.0294][loss_normal_right_0: 0.0300][loss_offset_0: 0.0033][loss_joint_left_uv_1: 0.0052][loss_joint_right_uv_1: 0.0052][loss_mesh_left_uv_1: 0.0069][loss_mesh_right_uv_1: 0.0068][loss_joint_left_xyz_1: 0.0065][loss_joint_right_xyz_1: 0.0063][loss_mesh_left_xyz_1: 0.0079][loss_mesh_right_xyz_1: 0.0078][loss_edge_left_1: 0.0135][loss_edge_right_1: 0.0137][loss_normal_left_1: 0.0292][loss_normal_right_1: 0.0291][loss_offset_1: 0.0031][loss_joint_left_uv_2: 0.0051][loss_joint_right_uv_2: 0.0050][loss_mesh_left_uv_2: 0.0068][loss_mesh_right_uv_2: 0.0067][loss_joint_left_xyz_2: 0.0065][loss_joint_right_xyz_2: 0.0063][loss_mesh_left_xyz_2: 0.0079][loss_mesh_right_xyz_2: 0.0078][loss_edge_left_2: 0.0135][loss_edge_right_2: 0.0136][loss_normal_left_2: 0.0291][loss_normal_right_2: 0.0291][loss_offset_2: 0.0031]
[11/04 02:24:36] Training INFO: Save checkpoint to ./checkpoints/DIR/checkpoint/latest.pth
[11/04 02:30:25] Training INFO: MPJPE_0: left 80.77075399604499 mm, right 81.86647319326214 mm, AVG 81.31861359465356 mm
[11/04 02:30:25] Training INFO: MPVPE_0: left 87.17533539907605 mm, right 89.05994439241933 mm, AVG 88.11763989574769 mm
[11/04 02:30:25] Training INFO: MPJPE_1: left 80.8181789283659 mm, right 82.71075034258412 mm, AVG 81.76446463547501 mm
[11/04 02:30:25] Training INFO: MPVPE_1: left 87.29970373359382 mm, right 89.37457580776776 mm, AVG 88.33713977068078 mm
[11/04 02:30:25] Training INFO: MPJPE_2: left 81.22921638629016 mm, right 83.13988862084408 mm, AVG 82.18455250356712 mm
[11/04 02:30:25] Training INFO: MPVPE_2: left 87.71276979469785 mm, right 89.84211043399922 mm, AVG 88.77744011434854 mm

Have you solved this problem?

No.

Is there something wrong with the test code?

The testing results using released checkpoints are correct. Have you also encountered the same problem?

@luckyday2022
Copy link

只需将batch_size修改为 256。MPJPE的最终结果为81+,MPVPE的最终结果为88+。
我不知道为什么结果这么糟糕。

Have you trained the original model? What's the result?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants