Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A complete installation and usage guide #1

Open
winston-li opened this issue Nov 1, 2021 · 23 comments
Open

A complete installation and usage guide #1

winston-li opened this issue Nov 1, 2021 · 23 comments

Comments

@winston-li
Copy link

Hi,
I read your great paper and was excited to give it a try. I followed README.md and executed "hydra_train.sh". However, it prompts with ModuleNotFoundError "No module named 'cyy_torch_cpp_extension.data_structure'". Looks like it needs module "torch_cpp_extension", which will need to build another CyyAlgorithmLib (repository cyyever/algorithm)? I was stuck and can't build it successfully (fatal error: #include in algorithm/src/alphabet/alphabet.hpp). Wondering if I misunderstood some steps or the installation steps were out of dated?

Thanks.

@poppopbean0903
Copy link

Hi,

have you solved the above problem? I met the similar error when running hydra_train.py, it prompts with "ImportError: cannot import name 'SyncedTensorDict' from 'cyy_torch_algorithm.data_structure.synced_tensor_dict' (/home/amax/.local/lib/python3.11/site-packages/cyy_torch_algorithm/data_structure/synced_tensor_dict.py)" .

Thanks.

@cyyever
Copy link
Owner

cyyever commented Nov 27, 2023

@poppopbean0903 You need to build an Pytorch extension as follows:

git clone --recursive [email protected]:cyyever/torch_cpp_extension.git
cd torch_cpp_extension
mkdir build && cd build
cmake -DBUILD_SHARED_LIBS=on ..
sudo make install
env cmake_build_dir=build python3 setup.py install --user

@poppopbean0903
Copy link

sorry to bother u,

I update my cmake to 3.28 version, when I run "cmake -DBUILD_SHARED_LIBS=on .." , it keep reporting errors about missing packages, like fmt, doctest, and spdlog. I have to install them one by one. I'm wondering whether I missed any step except u offered above, resulting endless missing dependencies.

I'm sorry I have rare knowledge about cmake and can't find the exact reason, so ask u for more technical indications. Many thanks.

The exact error be like:
"CMake Error at python_binding/CMakeLists.txt:2 (find_package):
Could not find a package configuration file provided by "pybind11" with any
of the following names:

pybind11Config.cmake
pybind11-config.cmake

"

@cyyever
Copy link
Owner

cyyever commented Nov 28, 2023

@poppopbean0903 I submitted some fixes to disable building tests by default. You can git pull the new code and re-build.

@poppopbean0903
Copy link

The new version still reports erros like "Could not find a package configuration file provided by "spdlog" with any of the following names", lacking of dependencies like pybind11 and so on.

It seems error arises at "include(cmake/all.cmake)", I'm wondering if there is any relationship between these errors and my conda environment ? or with my cmake version? Many thanks !

The complete error is
"
-- Could NOT find clang-tidy (missing: clang-tidy_BINARY)
-- Could NOT find run-clang-tidy (missing: run-clang-tidy_BINARY)
-- Could NOT find iwyu_tool (missing: iwyu_tool_BINARY)
CMake Warning at cmake/build_cache.cmake:9 (message):
no ccache found
Call Stack (most recent call first):
cmake/all.cmake:23 (include)
CMakeLists.txt:7 (include)

-- Caffe2: CUDA detected: 10.0
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 10.0
-- Found cuDNN: v7.6.5 (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn.so)
-- Autodetected CUDA architecture(s): 7.5 7.5 7.5
-- Added CUDA NVCC flags for: -gencode;arch=compute_75,code=sm_75
-- Build spdlog: 1.12.0
-- Build type: Debug
CMake Error at python_binding/CMakeLists.txt:2 (find_package):
Could not find a package configuration file provided by "pybind11" with any
of the following names:

pybind11Config.cmake
pybind11-config.cmake

Add the installation prefix of "pybind11" to CMAKE_PREFIX_PATH or set
"pybind11_DIR" to a directory containing one of the above files. If
"pybind11" provides a separate development package or SDK, be sure it has
been installed.

-- Configuring incomplete, errors occurred! "

@cyyever
Copy link
Owner

cyyever commented Nov 28, 2023

@poppopbean0903 I see. I am fixing it.

@cyyever
Copy link
Owner

cyyever commented Nov 28, 2023

@poppopbean0903 I added the missing pybind11 as a git sub-module. The easiest way to build is to remove the old package and follow the new steps:

git clone --recursive [email protected]:cyyever/torch_cpp_extension.git    
cd torch_cpp_extension    
mkdir build && cd build    
cmake -DBUILD_SHARED_LIBS=on ..    
cmake --build . --config release    
cd ..    
env cmake_build_dir=build python3 setup.py install --user    

@poppopbean0903
Copy link

Thanks a lot, the above issue is solved. But another problem raised: It seems repeated creating 'torch_library', but I didn't create it explicitly and have cleaned the build directory.
Many thanks !

"CMake Error at /home/pami/anaconda3/lib/python3.6/site-packages/torch/share/cmake/Caffe2/public/utils.cmake:40 (add_library):
add_library cannot create target "torch_library" because another target
with the same name already exists. The existing target is an interface
library created in source directory "/home/DiskA/torch_cpp_extension".
See documentation for policy CMP0002 for more details."

@cyyever
Copy link
Owner

cyyever commented Nov 28, 2023

@poppopbean0903 You need python3.11 and torch >=2.1 to work. This specific pytorch error came from an older version that I didn't test.

@poppopbean0903
Copy link

Thanks a lot, I think maybe its the crux. I've tried updating my torch version to 2.0, but failed due to my older cuda version with 10.0. Sorry to bother u with such technical problem, but I'm wondering whether it is possible to run your code with an older version? Because updating my cuda version on the server has brought some serious problem before, I prefer not to ask for trouble if there is another solution. Or is it possible to install torch>=2.1 with cuda 10.0? ( As far as I know it is impossible). I'm very sorry to bother you with this kind of problem, but I really want to get through your code, thank u very much.

@cyyever
Copy link
Owner

cyyever commented Nov 30, 2023

@poppopbean0903 Why not try it in a CUDA Docker container? Indeed, the code relies heavily on new API on latest Pytorch for better performance. I will build a Docker image for your convenience.

@poppopbean0903
Copy link

Thanks a lot, I've installed python=3.11 and torch = 2.1. Sorry to keep disturbing u, but I still run into the following problem. It seems lack of some dependencies, and related to cmake version. My cmake version is 3.24.1, higher than required 3.20 ?

”CMake Error at cmake/all.cmake:1 (cmake_policy):
An attempt was made to set the policy version of CMake to "3.25.0" which is
greater than this version of CMake. This is not allowed because the
greater version may have new policies not known to this CMake. You may
need a newer CMake version to build this project.
Call Stack (most recent call first):
CMakeLists.txt:7 (include)

-- Could NOT find clang-tidy (missing: clang-tidy_BINARY)
-- Could NOT find clang-apply-replacements (missing: clang-apply-replacements_BINARY)
-- Could NOT find run-clang-tidy (missing: run-clang-tidy_BINARY)
-- Could NOT find iwyu_tool (missing: iwyu_tool_BINARY)
CMake Warning at cmake/build_cache.cmake:9 (message):
no ccache found
Call Stack (most recent call first):
cmake/all.cmake:23 (include)
CMakeLists.txt:7 (include)

CMake Error at /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.24/Modules/CMakeDetermineCUDACompiler.cmake:277 (message):
CMAKE_CUDA_ARCHITECTURES must be non-empty if set.
Call Stack (most recent call first):
/root/anaconda3/envs/hydra/lib/python3.11/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:47 (enable_language)
/root/anaconda3/envs/hydra/lib/python3.11/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:87 (include)
/root/anaconda3/envs/hydra/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:25 (find_package)
-- Configuring incomplete, errors occurred! ”

@cyyever
Copy link
Owner

cyyever commented Nov 30, 2023

The log shows that python3.8 was used. You can try to remove the CMake requirement by

grep 'cmake_policy(VERSION 3.25.0)' -r third_party cmake 

and remove related lines. But I think it is better to use the Docker image which I will deliver soon.

@poppopbean0903
Copy link

oh it's great! Looking forward to your docker, many thanks !

@cyyever
Copy link
Owner

cyyever commented Nov 30, 2023

@poppopbean0903
If you want to use CUDA, make sure that the host nvdia driver >=545.29.06 and edge docker configured with CUDA runtime, see https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

docker pull cyyever/aaai_hydra:latest
sudo docker run --gpus all  -it --rm  aaai_hydra:latest bash

I suspect it is too hard to setup CUDA runtime, then it is fine to try CPU training and just use

docker pull cyyever/aaai_hydra:latest
sudo docker run -it --rm  aaai_hydra:latest bash

My code will detect that CUDA is unavailable and just use CPU (If you have a powerful CPU).

Anyway, now you are in the Docker container, try

cd  /root/aaai_hydra
env PYTHONPATH=/root/opt/python/lib/python3.11/site-packages /root/opt/python/bin/python3   lean_hydra_train.py --config-name mnist.yaml

@poppopbean0903
Copy link

poppopbean0903 commented Dec 2, 2023

Thank u very much !! But I got the error after I successfully pulled the image, and run docker run command :

"Unable to find image 'aaai_hydra:latest' locally
docker: Error response from daemon: pull access denied for aaai_hydra, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.

first pull outputs : Status: Downloaded newer image for cyyever/aaai_hydra:latest
docker.io/cyyever/aaai_hydra:latest.

And after I met the error, I check the existence of image by running docker pull again, and output
"latest: Pulling from cyyever/aaai_hydra
Digest:sha256:033e84fb07e447eb8aec80092827f673b7c48df5760e9f5abe5acf58065d3a11
Status: Image is up to date for cyyever/aaai_hydra:latest
docker.io/cyyever/aaai_hydra:latest".

It seems I've successfully pulled the image? But when I try docker run, it looks like I have no permission to access the docker ?

@cyyever
Copy link
Owner

cyyever commented Dec 2, 2023

Use 'sudo docker image list' to find out the right image name. I can't help much here, you should be familiar with Docker operations.

@poppopbean0903
Copy link

poppopbean0903 commented Dec 4, 2023

Great thanks!! I've successfully run the code. But there are a few errors, it seems related to your library:

  1. When I run hydra_train.py, with use hessian = True , error occurs on line 263 in hydra_hook.py , reporting 'dict' object has no attribute 'cpu', with original code test_gradient = test_gradient.cpu()

  2. commenting out above line, it reports "'cyy_torch_cpp_extension.data_structure.SyncedTenso' object has no attribute 'tensor_dict'" at line 277, with original code tensor_dict.tensor_dict.flush(True)

  3. It generally warns "found inf in AMP, scale is tensor(65536., device='cuda:0')", what does it mean of "amp"?

  4. What's the "lean" in lean_hydra_train stands for ? sorry, I can't remind of corresponding part in the paper.

And the followings are some problems about how to use your code accurately, it would be great and helpful to me if you are willing to offer some advice. ^-^ ( It would save me a lot of time ) But it's ok to ignore them, for it shouldn't have bothered u.

First, I want to save hypergradients of each sample, does the tensor_dict variable of line 270 in hydra_hook.py saves all the hypergradients ? But the tensor_dict seems to be empty when I run hydra_train.py . And if there is any advice on suitably and accurately saving these hypergradients , for its special type, which is better among torch.save , joblib, or anything else ?

Second, I shoud be able to relate hypergradient to its corresponding data for future usage , instead of only hypergradients with index, without knowing corresponding data. Is the index fixed everytime I loader the data ? If so, I load the data by similar steps will be ok ?

Thank you so much for your assistance and your time .

@cyyever
Copy link
Owner

cyyever commented Dec 4, 2023

@poppopbean0903 1 and 2 are due to recent code refactors and I will fix them sooner. 3 is https://pytorch.org/docs/stable/amp.html, a manner to accelerate training. 4 is our optimization to speed up hyper-gradient computing and it was not mentioned in the paper.
I will check the results to ensure that the resulting dict contains the influence values of samples.

@cyyever
Copy link
Owner

cyyever commented Dec 5, 2023

@poppopbean0903 I pushed the latest image with all the fixes.

@poppopbean0903
Copy link

thanks ! But when I run your code with mnist, it reports RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! in line 235 of cyy_torch_xai/hydra/hydra_hook.py at the second epoch. It was called in line 66 of hydra_sgd_hook.py , and when I check the device of instance gradient and hypergradient, but got instance and hyper gradient is None.

@cyyever
Copy link
Owner

cyyever commented Dec 8, 2023

thanks ! But when I run your code with mnist, it reports RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! in line 235 of cyy_torch_xai/hydra/hydra_hook.py at the second epoch. It was called in line 66 of hydra_sgd_hook.py , and when I check the device of instance gradient and hypergradient, but got instance and hyper gradient is None.

No worry, I noticed the error and will push a new image immediately

@cyyever
Copy link
Owner

cyyever commented Dec 8, 2023

@poppopbean0903 Updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants