Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Your GCP script #2

Open
IEWbgfnYDwHRoRRSKtkdyMDUzgdwuBYgDKtDJWd opened this issue Apr 24, 2020 · 4 comments
Open

Your GCP script #2

IEWbgfnYDwHRoRRSKtkdyMDUzgdwuBYgDKtDJWd opened this issue Apr 24, 2020 · 4 comments

Comments

@IEWbgfnYDwHRoRRSKtkdyMDUzgdwuBYgDKtDJWd
Copy link

First of all, much thanks and appreciation for your repo, your script for GCP setup worked like a charm.

Only issue is when I try to train a new model using a custom dataset, it errors about 20 minutes after the first tick. also seems to have initial sample fake outputs as human faces (my dataset isnt faces). Unsure if this is normal or if I am doing something wrong.

@dvschultz
Copy link
Owner

can you post the error you‘re getting.

Seeing faces for your first fake is correct. This is using transfer learning (you can look it up on my youtube page and learn more about the technique there). Those faces get erased completely after a handful of ticks.

@ahmedshingaly
Copy link

Thank you very much for the great tutorial
my GPU is running out of memory and failing in the beginning, appreciate if you take a look at the error bellow

her is my error
`2020-05-20 11:18:44.904340: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 36.00MiB (rounded to 37748736). Current allocation summary follows.
2020-05-20 11:18:44.914558: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **********************************************************************************************xx
2020-05-20 11:18:44.919358: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at transpose_op.cc:199 : Resource exhausted: OOM when allocating tensor with shape[3,3,512,4,512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1356, in _do_call
return fn(*args)
File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,3,3,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node GPU0/G_loss/G/G_synthesis/128x128/Conv1/Square}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run_training.py", line 192, in
main()
File "run_training.py", line 187, in main
run(**vars(args))
File "run_training.py", line 120, in run
dnnlib.submit_run(**kwargs)
File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\submission\submit.py", line 343, in submit_run
return farm.submit(submit_config, host_run_dir)
File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\submission\internal\local.py", line 22, in submit
return run_wrapper(submit_config)
File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\submission\submit.py", line 280, in run_wrapper
run_func_obj(**submit_config.run_func_kwargs)
File "C:\Users\USER6459\Documents\python\stylegan2\training\training_loop.py", line 299, in training_loop
tflib.run(G_train_op, feed_dict)
File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\tflib\tfutil.py", line 31, in run
return tf.get_default_session().run(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 950, in run
run_metadata_ptr)
File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1350, in _do_run
run_metadata)
File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[4,3,3,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node GPU0/G_loss/G/G_synthesis/128x128/Conv1/Square (defined at C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py:104) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Errors may have originated from an input operation.
Input Source operations connected to node GPU0/G_loss/G/G_synthesis/128x128/Conv1/Square:
GPU0/G_loss/G/G_synthesis/128x128/Conv1/mul_3 (defined at C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py:100)

Original stack trace for 'GPU0/G_loss/G/G_synthesis/128x128/Conv1/Square':
File "run_training.py", line 192, in
main()
File "run_training.py", line 187, in main
run(**vars(args))
File "run_training.py", line 120, in run
dnnlib.submit_run(**kwargs)
File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\submission\submit.py", line 343, in submit_run
return farm.submit(submit_config, host_run_dir)
File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\submission\internal\local.py", line 22, in submit
return run_wrapper(submit_config)
File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\submission\submit.py", line 280, in run_wrapper
run_func_obj(**submit_config.run_func_kwargs)
File "C:\Users\USER6459\Documents\python\stylegan2\training\training_loop.py", line 220, in training_loop
G_loss, G_reg = dnnlib.util.call_func_by_name(G=G_gpu, D=D_gpu, opt=G_opt, training_set=training_set, minibatch_size=minibatch_gpu_in, **G_loss_args)
File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\util.py", line 256, in call_func_by_name
return func_obj(*args, **kwargs)
File "C:\Users\USER6459\Documents\python\stylegan2\training\loss.py", line 152, in G_logistic_ns_pathreg
fake_images_out, fake_dlatents_out = G.get_output_for(latents, labels, is_training=True, return_dlatents=True)
File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\tflib\network.py", line 221, in get_output_for
out_expr = self._build_func(*final_inputs, **build_kwargs)
File "C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py", line 238, in G_main
images_out = components.synthesis.get_output_for(dlatents, is_training=is_training, force_clean_graph=is_template_graph, **kwargs)
File "C:\Users\USER6459\Documents\python\stylegan2\dnnlib\tflib\network.py", line 221, in get_output_for
out_expr = self._build_func(final_inputs, **build_kwargs)
File "C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py", line 498, in G_synthesis_stylegan2
x = block(x, res)
File "C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py", line 470, in block
x = layer(x, layer_idx=res
2-4, fmaps=nf(res-1), kernel=3)
File "C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py", line 455, in layer
x = modulated_conv2d_layer(x, dlatents_in[:, layer_idx], fmaps=fmaps, kernel=kernel, up=up, resample_kernel=resample_kernel, fused_modconv=fused_modconv)
File "C:\Users\USER6459\Documents\python\stylegan2\training\networks_stylegan2.py", line 104, in modulated_conv2d_layer
d = tf.rsqrt(tf.reduce_sum(tf.square(ww), axis=[1,2,3]) + 1e-8) # [BO] Scaling factor.
File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 10698, in square
"Square", x=x, name=name)
File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 3616, in create_op
op_def=op_def)
File "C:\ProgramData\Anaconda3\envs\old_tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 2005, in init
self._traceback = tf_stack.extract_stack()`

@dvschultz
Copy link
Owner

can you tell me what GPU you’re using, the CUDA version you’re running and what NVIDIA driver is running?

@ahmedshingaly
Copy link

ahmedshingaly commented May 29, 2020

I use workstation RTX 2080 Ti 11GB X1 and Ram is 128GM
CUDA version is 10.0
and even if I reduce batch size it still fails due to space lack
my dataset is 500 pictures 1024x1024
i tried bigger data set and smaller dataset and tried png, jpg,
I tried all config avilable (a , b , c ,d ,e , f)
I tried running it on another computer with GTX 1060 and cuda fails
I tried 512 by 512 image dataset
I tried 256 x 256 image dataset

< ERROR CUDA RUN OUT OF MEMORY >
please note, I am trying to build my own models on building shapes not human faces
thank you in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants