Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why show “No matching checkpoint file found” #125

Open
Mango1218 opened this issue Sep 7, 2022 · 13 comments
Open

why show “No matching checkpoint file found” #125

Mango1218 opened this issue Sep 7, 2022 · 13 comments

Comments

@Mango1218
Copy link

when i run "run_training.py", it shows "No matching checkpoint file found"
Restarting training from last epoch ...
No matching checkpoint file found
Training crashed at epoch 1
Traceback for the error!
Traceback (most recent call last):
File "/ai/lu/TransT-main/ltr/../ltr/trainers/base_trainer.py", line 70, in train
self.train_epoch()
File "/ai/lu/TransT-main/ltr/../ltr/trainers/ltr_trainer.py", line 79, in train_epoch
self.cycle_dataset(loader)
File "/ai/lu/TransT-main/ltr/../ltr/trainers/ltr_trainer.py", line 52, in cycle_dataset
for i, data in enumerate(loader, 1):
File "/root/anaconda3/envs/transt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/root/anaconda3/envs/transt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
return self._process_data(data)
File "/root/anaconda3/envs/transt/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
data.reraise()
File "/root/anaconda3/envs/transt/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/root/anaconda3/envs/transt/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/root/anaconda3/envs/transt/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/anaconda3/envs/transt/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/ai/lu/TransT-main/ltr/../ltr/data/sampler.py", line 92, in getitem
dataset = random.choices(self.datasets, self.p_datasets)[0]
File "/root/anaconda3/envs/transt/lib/python3.7/random.py", line 361, in choices
raise ValueError('The number of weights does not match the population')

@xiaofengBian
Copy link

如果你没有训练过,是没有这个checkpoint文件的,它是保存上次训练的参数的

@Mango1218
Copy link
Author

我已经改好了,谢谢!

@xiaofengBian
Copy link

你的可以训练了嘛?

@Mango1218
Copy link
Author

你的可以训练了嘛?

嗯嗯可以了

@xiaofengBian
Copy link

为什么我的一训练就会出现nan啊,你把数据集全部下载了嘛?

@Mango1218
Copy link
Author

为什么我的一训练就会出现nan啊,你把数据集全部下载了嘛?

我没有用全部数据集,太大了,只用了部分

@xiaofengBian
Copy link

xiaofengBian commented Sep 17, 2022 via email

@xiaofengBian
Copy link

你好,方便发一下你的嘛,我想试试,我的一直有问题。还有你是在什么设备上面训练的?

@xiaofengBian
Copy link

我的邮箱[email protected]

@Nirvana9808
Copy link

你的可以训练了嘛?

嗯嗯可以了

你好,请问可以把您修改后的训练代码发一下吗?我的训练部分还没能跑起来,谢谢,我的邮箱[email protected]

@wsasdsda
Copy link

你好,能说一下是怎么改的吗?

@Nuyoah1018
Copy link

你的可以训练了嘛?

嗯嗯可以了
你好,麻烦可以说一下这个问题最后怎么解决的吗,我的邮箱[email protected]

@universefall
Copy link

我已经改好了,谢谢!

你好,请问你是怎么改的,谢谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants