Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问一下文中设置的batch size和chunk size的区别 #74

Open
plz583585760 opened this issue Jan 12, 2022 · 3 comments
Open

请问一下文中设置的batch size和chunk size的区别 #74

plz583585760 opened this issue Jan 12, 2022 · 3 comments

Comments

@plz583585760
Copy link

感谢作者的代码,我用的win10系统,batchsize可以和原文一样设置为16,但是chunksize只能设置为1才能运行,暂时没有找到解决方法,也仅有此处一处不同,导致模型准确率和原文相差甚远,能否释义一下chunksize运作含义以及对精度的影响,对windows系统如何才能修改为chunksize[16]运行呢,不然只能装虚拟机尝试了,麻烦作者了,辛苦!

@eeyrw
Copy link

eeyrw commented Jan 17, 2022

是不是会遇上这样的错误:

training start...
  0%|                                   | 0/500000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File ".\train.py", line 233, in <module>
    train(training_dbs, validation_db, args.start_iter, args.freeze) # 0
  File ".\train.py", line 158, in train
    = nnet.train(iteration, save, viz_split, **training)
  File "C:\Users\yuan\Desktop\LSTR\nnet\py_factory.py", line 115, in train
    ys)
  File "C:\Users\yuan\Desktop\LSTR\myvenv\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\data_parallel.py", line 66, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids, self.chunk_sizes)
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\data_parallel.py", line 77, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim, chunk_sizes=self.chunk_sizes)
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\scatter_gather.py", line 30, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim, chunk_sizes) if inputs else []
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\scatter_gather.py", line 25, in scatter
    return scatter_map(inputs)
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\scatter_gather.py", line 18, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\scatter_gather.py", line 20, in scatter_map
    return list(map(list, zip(*map(scatter_map, obj))))
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\scatter_gather.py", line 15, in scatter_map
    return Scatter.apply(target_gpus, chunk_sizes, dim, obj)
  File "C:\Users\yuan\Desktop\LSTR\myvenv\lib\site-packages\torch\nn\parallel\_functions.py", line 96, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "C:\Users\yuan\Desktop\LSTR\myvenv\lib\site-packages\torch\nn\parallel\comm.py", line 189, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: start (0) + length (16) exceeds dimension size (1).

@Galaxy-Ding
Copy link

是不是会遇上这样的错误:

training start...
  0%|                                   | 0/500000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File ".\train.py", line 233, in <module>
    train(training_dbs, validation_db, args.start_iter, args.freeze) # 0
  File ".\train.py", line 158, in train
    = nnet.train(iteration, save, viz_split, **training)
  File "C:\Users\yuan\Desktop\LSTR\nnet\py_factory.py", line 115, in train
    ys)
  File "C:\Users\yuan\Desktop\LSTR\myvenv\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\data_parallel.py", line 66, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids, self.chunk_sizes)
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\data_parallel.py", line 77, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim, chunk_sizes=self.chunk_sizes)
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\scatter_gather.py", line 30, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim, chunk_sizes) if inputs else []
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\scatter_gather.py", line 25, in scatter
    return scatter_map(inputs)
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\scatter_gather.py", line 18, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\scatter_gather.py", line 20, in scatter_map
    return list(map(list, zip(*map(scatter_map, obj))))
  File "C:\Users\yuan\Desktop\LSTR\models\py_utils\scatter_gather.py", line 15, in scatter_map
    return Scatter.apply(target_gpus, chunk_sizes, dim, obj)
  File "C:\Users\yuan\Desktop\LSTR\myvenv\lib\site-packages\torch\nn\parallel\_functions.py", line 96, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "C:\Users\yuan\Desktop\LSTR\myvenv\lib\site-packages\torch\nn\parallel\comm.py", line 189, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: start (0) + length (16) exceeds dimension size (1).

请问解决了?我也遇到一样的问题

@atubo1024
Copy link

atubo1024 commented Dec 6, 2022

这是由于windows下子进程的system_configs和主进程的system_configs不相同导致的。在linux通过fork出来的子进程会共享主进程的初始状态,但windows并不会这样,所以在预加载数据的子进程中system_configs实际上是默认值,和主进程不一致。
解决的方法:可以考虑在windows环境中,将system_configs的内容通过参数传递给子进程,子进程初始化时再把参数覆盖到system_configs,例如:

# train.py
import platform
# ...
def init_parallel_jobs(dbs, queue, fn):
    if platform.system().lower() == 'windows':
        tasks = [Process(target=prefetch_data, args=(db, queue, fn, system_configs._configs)) for db in dbs]
    else:
        tasks = [Process(target=prefetch_data, args=(db, queue, fn, None)) for db in dbs]
    # ...
 
def prefetch_data(db, queue, sample_data, system_configs_data):
    if platform.system().lower() == 'windows':
        system_configs._configs = system_configs_data
    # ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants