Confusion about ~/.nv/ComputeCache behavior with docker #272

ajhool · 2019-03-14T22:48:43Z

As the README.md says:

Note that running waifu2x in without JIT caching is very slow, which is what would happen if you use docker. For a workaround, you can mount a host volume to the CUDA_CACHE_PATH, for instance,

nvidia-docker run -v $PWD/ComputeCache:/root/.nv/ComputeCache waifu2x th waifu2x.lua --help

Does this mean that when using docker waifu2x will always run very slowly the first time that waifu2x is executed on the host volume and subsequent executions on the host volume will be faster? Is luajit compiling the program the first time that it is used and then executing the compiled version in subsequent runs?

My specific use-case is that I'm executing the docker image in the cloud (AWS EC2 -- p3.2xlarge instances using the Volta architecture). This means that the host volume changes frequently. So, if I spin up a new EC2 instance from an AMI that has never executed waifu2x before, will the first execution of the docker image always be slow (even if I pass the ComputeCache path to docker). If so, I would generate the AMI after executing waifu2x so that the binary is already in the ComputeCache when the server is started, but that step is nontrivial in practice

Are there additional steps I need to take to "prime" the host container with precompiled binaries/libraries for the Volta architecture that would make subsequent docker executions run more quickly? Is it possible to simply build waifu2x ahead of time, instead of relying on JIT?

nagadomi · 2019-03-15T01:28:37Z

First, I am not familiar with Docker and I do not use it personally.

Are there additional steps I need to take to "prime" the host container with precompiled binaries/libraries for the Volta architecture that would make subsequent docker executions run more quickly? Is it possible to simply build waifu2x ahead of time, instead of relying on JIT?

It is related to CUDA, not LuaJIT.

https://devblogs.nvidia.com/cuda-pro-tip-understand-fat-binaries-jit-caching/: The first approach is to completely avoid the JIT cost by including binary code for one or more architectures in the application binary along with PTX code. The CUDA run time looks for code for the present GPU architecture in the binary, and runs it if found. If binary code is not found but PTX is available, then the driver compiles the PTX code. In this way deployed CUDA applications can support new GPUs when they come out.

So I guess that if you build Torch7(cutorch and cunn) with gencode arch=compute_70,code=sm_70 option you can avoid CUDA JIT compilation.
However, kaixhin/cuda-torch:7.5 is very old, and cutorch has not been updated before Volta was released.
So, probably need to recreate Docker Image that can build Torch7 with compute_70 binary.

I would generate the AMI after executing waifu2x so that the binary is already in the ComputeCache when the server is started

I guess this will work. waifu2x.udp.jp uses AMI without Docker. (However, it takes about 30 seconds at the first execution).

ajhool · 2019-03-15T05:28:11Z

I also was using the AMI without Docker and things were working properly, but when I added Docker the initial execution took 10 minutes (as opposed to 30 seconds on the bare AMI), so it might just be a simple docker integration issue. The specific hangup is that importing cudnn takes 10 minutes... there's a line in cudnn that tries to configure the gpus and struggles with Volta.

I'm also new to Docker, so the cacheing issue might be a red herring but I'm still working through it. It seems plausible. You had mentioned that cacheing might have been the issue here (I hadn't realized it was you that pointed me here :) ) soumith/cudnn.torch#385

nagadomi · 2019-03-15T10:18:03Z

ok, I will try to build cuda-torch:10.1 image and test it on p3 instance.

ajhool · 2019-03-15T16:52:30Z

I don't believe cudnn has cuda 10 bindings, I was seeing this behavior with

Cuda 9 and cudnn 7.1

Here's the issue + docker file for how I was building it:
torch/torch7#1193

nagadomi · 2019-03-16T09:43:48Z

I have built a Docker image, I changed to generate binaries for sm_70(volta) and sm_75 at docker build.
https://hub.docker.com/r/nagadomi/torch7
https://hub.docker.com/r/nagadomi/waifu2x

~~However it requires nvidia driver >= 418, and it seems that 418 driver for Tesla V100 has not been released yet. I just noticed now. 😢 So it is not tested.~~

Dockerfile for torch7: https://github.com/nagadomi/distro/blob/cuda10/Dockerfile
~~If you want to try it immediately, you can edit nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 to nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 and docker build.~~

nagadomi · 2019-03-16T10:15:04Z

I found https://devtalk.nvidia.com/default/topic/1047710/announcements-and-news/linux-solaris-and-freebsd-driver-418-43-long-lived-branch-release-/ .

OK, it works.

Launch p3.2xlarge instance with Deep Learning AMI (Ubuntu) Version 20.0 (ami-0f9e8c4a1305ecd22)
Change the volume to 150GB because there is not enough disk space.
Install nvidia 418 driver
Test

$ docker pull nagadomi/waifu2x
$ time nvidia-docker run -v `pwd`/images:/images nagadomi/waifu2x th waifu2x.lua -force_cudnn 1 -m scale -scale 2 -i /images/miku_small.png -o /images/output.png
/images/output.png: 1.4679141044617 sec
real	0m7.688s
user	0m0.044s
sys	0m0.012s

ajhool · 2019-03-18T21:07:38Z

Thanks for the incredibly quick response and guidance.

I notice that the waifu2x dockerfile is skipping the soumith cudnn install/make that is seen in the install_lua_modules.sh script. I am confused as to how lua is finding the cudnn bindings without that package?

From install_lua_modules.sh:

install_cudnn()
{
    rm -fr $CUDNN_WORK_DIR
    git clone https://github.com/soumith/cudnn.torch.git -b $CUDNN_BRANCH $CUDNN_WORK_DIR
    cd $CUDNN_WORK_DIR
    luarocks make cudnn-scm-1.rockspec
    cd ..
    rm -fr $CUDNN_WORK_DIR
}

nagadomi · 2019-03-19T02:23:58Z

cudnn.torch is installed at the time of installation of torch7.
https://github.com/nagadomi/distro/blob/2798753bf053e0e2535465697c40df78c251c7d4/install.sh#L156
I have updated distro's cudnn.torch submodule to R7 branch.

ajhool · 2019-03-23T00:20:22Z

Can confirm that this strategy worked.

I realized that I was originally using the Amazon Linux Deep Learning AMI instead of the Ubuntu Deep Learning AMI. It's very possible that the Amazon Linux distro simply doesn't work properly with NVIDIA or cuda or nvidia-docker, there have been reports of similar issues. Thanks for taking the time to help me throught this @nagadomi

ajhool closed this as completed Mar 23, 2019

nagadomi mentioned this issue May 3, 2022

Add Dockerfile nagadomi/waifu2x-caffe-ubuntu#11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about ~/.nv/ComputeCache behavior with docker #272

Confusion about ~/.nv/ComputeCache behavior with docker #272

ajhool commented Mar 14, 2019 •

edited

Loading

nagadomi commented Mar 15, 2019

ajhool commented Mar 15, 2019

nagadomi commented Mar 15, 2019

ajhool commented Mar 15, 2019

nagadomi commented Mar 16, 2019 •

edited

Loading

nagadomi commented Mar 16, 2019 •

edited

Loading

ajhool commented Mar 18, 2019

nagadomi commented Mar 19, 2019

ajhool commented Mar 23, 2019

Confusion about ~/.nv/ComputeCache behavior with docker #272

Confusion about ~/.nv/ComputeCache behavior with docker #272

Comments

ajhool commented Mar 14, 2019 • edited Loading

nagadomi commented Mar 15, 2019

ajhool commented Mar 15, 2019

nagadomi commented Mar 15, 2019

ajhool commented Mar 15, 2019

nagadomi commented Mar 16, 2019 • edited Loading

nagadomi commented Mar 16, 2019 • edited Loading

ajhool commented Mar 18, 2019

nagadomi commented Mar 19, 2019

ajhool commented Mar 23, 2019

ajhool commented Mar 14, 2019 •

edited

Loading

nagadomi commented Mar 16, 2019 •

edited

Loading

nagadomi commented Mar 16, 2019 •

edited

Loading