Errors in fine-tuning training with the large model OC_10M.pb #3104

ZKNB · 2024-01-03T09:12:46Z

ZKNB
Jan 3, 2024

When I use deepmd_v2.2.7 for fine-tuning training of DPA large models, he has the following reported error.

 dp train input.json --finetune OC_10M.pb

OMP: Info #254: KMP_AFFINITY: pid 53485 tid 54101 thread 18 bound to OS proc set 18
OMP: Info #254: KMP_AFFINITY: pid 53485 tid 54100 thread 17 bound to OS proc set 17

Intel MKL ERROR: Parameter 6 was incorrect on entry to DGELSD.
Traceback (most recent call last):
  File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/deepmd_cli/main.py", line 63in main
    deepmd_main(args)
  File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py",ne 74, in main
    train_dp(**dict_args)
  File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py"ine 168, in train
    _do_work(jdata, run_opt, is_compress)
  File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py"ine 280, in _do_work
    model.build(train_data, stop_batch, origin_type_map=origin_type_map)
  File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/deepmd/train/trainer.py", li290, in build
    self._init_from_pretrained_model(
  File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/deepmd/train/trainer.py", li1137, in _init_from_pretrained_model
    self._change_energy_bias(
  File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/deepmd/train/trainer.py", li1145, in _change_energy_bias
    self.model.change_energy_bias(
  File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/deepmd/model/ener.py", line , in change_energy_bias
    self.fitting.change_energy_bias(
  File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/deepmd/fit/ener.py", line 85in change_energy_bias
    delta_bias = np.linalg.lstsq(type_numbs, bias_diff, rcond=None)[0]
  File "<__array_function__ internals>", line 180, in lstsq
  File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/numpy/linalg/linalg.py", lin292, in lstsq
    x, resids, rank, s = gufunc(a, b, rcond, signature=signature, extobj=extobj)
  File "/export/home/liluotonggpu2/anaconda3/envs/dpmd/lib/python3.10/site-packages/numpy/linalg/linalg.py", lin00, in _raise_linalgerror_lstsq
    raise LinAlgError("SVD did not converge in Linear Least Squares")
numpy.linalg.LinAlgError: SVD did not converge in Linear Least Squares

Answered by njzjz

Jan 4, 2024

Perhaps you can use this model to evaluate your data, and see if there is anything strange.

View full answer

wanghan-iapcm · 2024-01-04T04:52:05Z

wanghan-iapcm
Jan 4, 2024
Maintainer

would you get the same error if the training is from scratch?

7 replies

njzjz Jan 4, 2024
Maintainer

Does it change the energy bias to NaN? The difference between MKL and OpenBLAS backend might be that MKL will throw an error, and OpenBLAS will give NaN.

ZKNB Jan 4, 2024
Author

I don't know how to check this energy bias. And when I use the dataset to 'se_e2_r' training.The 'lcurve.out' is true.How to fix this MKL and OpenBLAS error?I have been plagued by this problem for long time.(TAT)

njzjz Jan 4, 2024
Maintainer

It should be shown on the screen. It is not a program issue, but a mathematical issue, meaning the least-squares solution cannot be calculated using the current data.

njzjz Jan 4, 2024
Maintainer

Perhaps you can use this model to evaluate your data, and see if there is anything strange.

Answer selected by ZKNB

ZKNB Jan 4, 2024
Author

I just tested my data with the large model and they all show force errors, energy errors.And I found some dataset energy bias is 'nan'.I delected them and rerun,the lcurve.out didn't have 'nan' again.Thank you for your reply and suggestion!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors in fine-tuning training with the large model OC_10M.pb #3104

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Errors in fine-tuning training with the large model OC_10M.pb #3104

ZKNB Jan 3, 2024

Replies: 1 comment · 7 replies

wanghan-iapcm Jan 4, 2024 Maintainer

njzjz Jan 4, 2024 Maintainer

ZKNB Jan 4, 2024 Author

njzjz Jan 4, 2024 Maintainer

njzjz Jan 4, 2024 Maintainer

ZKNB Jan 4, 2024 Author

ZKNB
Jan 3, 2024

Replies: 1 comment 7 replies

wanghan-iapcm
Jan 4, 2024
Maintainer

njzjz Jan 4, 2024
Maintainer

ZKNB Jan 4, 2024
Author

njzjz Jan 4, 2024
Maintainer

njzjz Jan 4, 2024
Maintainer

ZKNB Jan 4, 2024
Author