-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A complete installation and usage guide #1
Comments
Hi, have you solved the above problem? I met the similar error when running hydra_train.py, it prompts with "ImportError: cannot import name 'SyncedTensorDict' from 'cyy_torch_algorithm.data_structure.synced_tensor_dict' (/home/amax/.local/lib/python3.11/site-packages/cyy_torch_algorithm/data_structure/synced_tensor_dict.py)" . Thanks. |
@poppopbean0903 You need to build an Pytorch extension as follows:
|
sorry to bother u, I update my cmake to 3.28 version, when I run "cmake -DBUILD_SHARED_LIBS=on .." , it keep reporting errors about missing packages, like fmt, doctest, and spdlog. I have to install them one by one. I'm wondering whether I missed any step except u offered above, resulting endless missing dependencies. I'm sorry I have rare knowledge about cmake and can't find the exact reason, so ask u for more technical indications. Many thanks. The exact error be like:
" |
@poppopbean0903 I submitted some fixes to disable building tests by default. You can git pull the new code and re-build. |
The new version still reports erros like "Could not find a package configuration file provided by "spdlog" with any of the following names", lacking of dependencies like pybind11 and so on. It seems error arises at "include(cmake/all.cmake)", I'm wondering if there is any relationship between these errors and my conda environment ? or with my cmake version? Many thanks ! The complete error is -- Caffe2: CUDA detected: 10.0
Add the installation prefix of "pybind11" to CMAKE_PREFIX_PATH or set -- Configuring incomplete, errors occurred! " |
@poppopbean0903 I see. I am fixing it. |
@poppopbean0903 I added the missing pybind11 as a git sub-module. The easiest way to build is to remove the old package and follow the new steps:
|
Thanks a lot, the above issue is solved. But another problem raised: It seems repeated creating 'torch_library', but I didn't create it explicitly and have cleaned the build directory. "CMake Error at /home/pami/anaconda3/lib/python3.6/site-packages/torch/share/cmake/Caffe2/public/utils.cmake:40 (add_library): |
@poppopbean0903 You need python3.11 and torch >=2.1 to work. This specific pytorch error came from an older version that I didn't test. |
Thanks a lot, I think maybe its the crux. I've tried updating my torch version to 2.0, but failed due to my older cuda version with 10.0. Sorry to bother u with such technical problem, but I'm wondering whether it is possible to run your code with an older version? Because updating my cuda version on the server has brought some serious problem before, I prefer not to ask for trouble if there is another solution. Or is it possible to install torch>=2.1 with cuda 10.0? ( As far as I know it is impossible). I'm very sorry to bother you with this kind of problem, but I really want to get through your code, thank u very much. |
@poppopbean0903 Why not try it in a CUDA Docker container? Indeed, the code relies heavily on new API on latest Pytorch for better performance. I will build a Docker image for your convenience. |
Thanks a lot, I've installed python=3.11 and torch = 2.1. Sorry to keep disturbing u, but I still run into the following problem. It seems lack of some dependencies, and related to cmake version. My cmake version is 3.24.1, higher than required 3.20 ? ”CMake Error at cmake/all.cmake:1 (cmake_policy): -- Could NOT find clang-tidy (missing: clang-tidy_BINARY) CMake Error at /usr/local/lib/python3.8/dist-packages/cmake/data/share/cmake-3.24/Modules/CMakeDetermineCUDACompiler.cmake:277 (message): |
The log shows that python3.8 was used. You can try to remove the CMake requirement by
and remove related lines. But I think it is better to use the Docker image which I will deliver soon. |
oh it's great! Looking forward to your docker, many thanks ! |
@poppopbean0903
I suspect it is too hard to setup CUDA runtime, then it is fine to try CPU training and just use
My code will detect that CUDA is unavailable and just use CPU (If you have a powerful CPU). Anyway, now you are in the Docker container, try
|
Thank u very much !! But I got the error after I successfully pulled the image, and run docker run command : "Unable to find image 'aaai_hydra:latest' locally first pull outputs : Status: Downloaded newer image for cyyever/aaai_hydra:latest And after I met the error, I check the existence of image by running docker pull again, and output : It seems I've successfully pulled the image? But when I try docker run, it looks like I have no permission to access the docker ? |
Use 'sudo docker image list' to find out the right image name. I can't help much here, you should be familiar with Docker operations. |
Great thanks!! I've successfully run the code. But there are a few errors, it seems related to your library:
And the followings are some problems about how to use your code accurately, it would be great and helpful to me if you are willing to offer some advice. ^-^ ( It would save me a lot of time ) But it's ok to ignore them, for it shouldn't have bothered u. First, I want to save hypergradients of each sample, does the tensor_dict variable of line 270 in hydra_hook.py saves all the hypergradients ? But the tensor_dict seems to be empty when I run hydra_train.py . And if there is any advice on suitably and accurately saving these hypergradients , for its special type, which is better among torch.save , joblib, or anything else ? Second, I shoud be able to relate hypergradient to its corresponding data for future usage , instead of only hypergradients with index, without knowing corresponding data. Is the index fixed everytime I loader the data ? If so, I load the data by similar steps will be ok ? Thank you so much for your assistance and your time . |
@poppopbean0903 1 and 2 are due to recent code refactors and I will fix them sooner. 3 is https://pytorch.org/docs/stable/amp.html, a manner to accelerate training. 4 is our optimization to speed up hyper-gradient computing and it was not mentioned in the paper. |
@poppopbean0903 I pushed the latest image with all the fixes. |
thanks ! But when I run your code with mnist, it reports RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! in line 235 of cyy_torch_xai/hydra/hydra_hook.py at the second epoch. It was called in line 66 of hydra_sgd_hook.py , and when I check the device of instance gradient and hypergradient, but got instance and hyper gradient is None. |
No worry, I noticed the error and will push a new image immediately |
@poppopbean0903 Updated |
Hi,
I read your great paper and was excited to give it a try. I followed README.md and executed "hydra_train.sh". However, it prompts with ModuleNotFoundError "No module named 'cyy_torch_cpp_extension.data_structure'". Looks like it needs module "torch_cpp_extension", which will need to build another CyyAlgorithmLib (repository cyyever/algorithm)? I was stuck and can't build it successfully (fatal error: #include in algorithm/src/alphabet/alphabet.hpp). Wondering if I misunderstood some steps or the installation steps were out of dated?
Thanks.
The text was updated successfully, but these errors were encountered: