Replies: 2 comments 1 reply
-
Hi! Do you know how to solve this problem? |
Beta Was this translation helpful? Give feedback.
0 replies
-
you can find old submission script by the error message, then delete this script. In addition,you need to delete the procedure corresponding to the operation of submission script in the record.dpgen txt and the corresponding document content in the iter.0000x folder,then rerun dpgen.
Good luck!
…---Original---
From: ***@***.***>
Date: Sat, Sep 21, 2024 20:55 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [deepmodeling/dpgen] it always exists an error when executing"dpgen run param.json machine. Json" command, I don't know how to solve this(Discussion #1341)
Hi! Do you know how to solve this problem?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
2023-09-22 19:58:20,835 - INFO : Find old submission; recover submission from json file;submission.submission_hash:94d399c52e275fe437f6efdc2cde28256d05a671; machine.context.remote_root:/home/jiang/work/dpgen_example/run/nnwork/94d399c52e275fe437f6efdc2cde28256d05a671; submission.work_base:iter.000000/00.train;
2023-09-22 19:58:21,057 - INFO : info:check_all_finished: False
2023-09-22 19:58:21,061 - INFO : job: f89cc153020a441fee171b740fd5807310295c46 2977 terminated;fail_cout is 10; resubmitting job
2023-09-22 19:58:21,068 - INFO : job:f89cc153020a441fee171b740fd5807310295c46 re-submit after terminated; new job_id is 3423
2023-09-22 19:58:21,323 - INFO : job:f89cc153020a441fee171b740fd5807310295c46 job_id:3423 after re-submitting; the state now is <JobStatus.terminated: 4>
2023-09-22 19:58:21,324 - INFO : job: f89cc153020a441fee171b740fd5807310295c46 3423 terminated;fail_cout is 11; resubmitting job
2023-09-22 19:58:21,334 - INFO : job:f89cc153020a441fee171b740fd5807310295c46 re-submit after terminated; new job_id is 3433
2023-09-22 19:58:21,579 - INFO : job:f89cc153020a441fee171b740fd5807310295c46 job_id:3433 after re-submitting; the state now is <JobStatus.terminated: 4>
2023-09-22 19:58:21,580 - INFO : job: f89cc153020a441fee171b740fd5807310295c46 3433 terminated;fail_cout is 12; resubmitting job
Traceback (most recent call last):
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 352, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 861, in handle_unexpected_job_state
self.handle_unexpected_job_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 861, in handle_unexpected_job_state
self.handle_unexpected_job_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 846, in handle_unexpected_job_state
raise RuntimeError(
RuntimeError: job:f89cc153020a441fee171b740fd5807310295c46 3433 failed 12 times.job_detail:{'f89cc153020a441fee171b740fd5807310295c46': {'job_task_list': [{'command': "/bin/sh -c '{ if [ ! -f model.ckpt.index ]; then dp train input.json; else dp train input.json --restart model.ckpt; fi }'&&dp freeze", 'task_work_path': '002', 'forward_files': ['input.json'], 'backward_files': ['frozen_model.pb', 'lcurve.out', 'train.log', 'model.ckpt.meta', 'model.ckpt.index', 'model.ckpt.data-00000-of-00001', 'checkpoint'], 'outlog': 'train.log', 'errlog': 'train.log'}], 'resources': {'number_node': 1, 'cpu_per_node': 4, 'gpu_per_node': 0, 'queue_name': '', 'group_size': 1, 'custom_flags': [], 'strategy': {'if_cuda_multi_devices': False, 'ratio_unfinished': 0.0}, 'para_deg': 1, 'module_purge': False, 'module_unload_list': [], 'module_list': [], 'source_list': [], 'envs': {}, 'prepend_script': [], 'append_script': [], 'wait_time': 0, 'kwargs': {}}, 'job_state': <JobStatus.terminated: 4>, 'job_id': 3433, 'fail_count': 12}}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jiang/.local/bin/dpgen", line 8, in
sys.exit(main())
^^^^^^
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/main.py", line 233, in main
args.func(args)
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/generator/run.py", line 5109, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/generator/run.py", line 4440, in run_iter
run_train(ii, jdata, mdata)
File "/home/jiang/.local/lib/python3.11/site-packages/dpgen/generator/run.py", line 776, in run_train
submission.run_submission()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 229, in run_submission
self.handle_unexpected_submission_state()
File "/home/jiang/.local/lib/python3.11/site-packages/dpdispatcher/submission.py", line 355, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/home/jiang/work/dpgen_example/run/nnwork/94d399c52e275fe437f6efdc2cde28256d05a671.
Debug information: submission_hash==94d399c52e275fe437f6efdc2cde28256d05a671.
Please check the dirs and scripts in remote_root. The job information mentioned above may help.
According to the above reminder information, I find the tip that 'dp: no vocab file specified' by query the train.log file
Beta Was this translation helpful? Give feedback.
All reactions