Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running "Neural Network DMD on Slow Manifold" error #36

Closed
lk1983823 opened this issue Sep 14, 2023 · 1 comment
Closed

Running "Neural Network DMD on Slow Manifold" error #36

lk1983823 opened this issue Sep 14, 2023 · 1 comment

Comments

@lk1983823
Copy link

I have lightning version 2.0.5 installed. The pykoopman version is 1.0.4.
When I run "dlk_regressor.fit(traj_list)" in the example "tutorial_koopman_nndmd_examples.ipynb".

It shows errors

INFO: GPU available: True (cuda), used: True
[rank_zero.py:48 -                _info() ] GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
[rank_zero.py:48 -                _info() ] TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
[rank_zero.py:48 -                _info() ] IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
[rank_zero.py:48 -                _info() ] HPU available: False, using: 0 HPUs
INFO: Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2[/2](https://file+.vscode-resource.vscode-cdn.net/2)
[distributed.py:245 - _init_dist_connection() ] Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2[/2](https://file+.vscode-resource.vscode-cdn.net/2)
INFO: Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1[/2](https://file+.vscode-resource.vscode-cdn.net/2)
[distributed.py:245 - _init_dist_connection() ] Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1[/2](https://file+.vscode-resource.vscode-cdn.net/2)
INFO: ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

[rank_zero.py:48 -                _info() ] ----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

2023-09-14 16:20:30.863437: I tensorflow[/core/platform/cpu_feature_guard.cc:193](https://file+.vscode-resource.vscode-cdn.net/core/platform/cpu_feature_guard.cc:193)] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-14 16:20:31.016351: I tensorflow[/core/util/port.cc:104](https://file+.vscode-resource.vscode-cdn.net/core/util/port.cc:104)] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-14 16:20:32.029631: W tensorflow[/compiler/xla/stream_executor/platform/default/dso_loader.cc:64](https://file+.vscode-resource.vscode-cdn.net/compiler/xla/stream_executor/platform/default/dso_loader.cc:64)] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: [/usr/local/cuda-11.7/lib64](https://file+.vscode-resource.vscode-cdn.net/usr/local/cuda-11.7/lib64)
2023-09-14 16:20:32.029742: W tensorflow[/compiler/xla/stream_executor/platform/default/dso_loader.cc:64](https://file+.vscode-resource.vscode-cdn.net/compiler/xla/stream_executor/platform/default/dso_loader.cc:64)] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: [/usr/local/cuda-11.7/lib64](https://file+.vscode-resource.vscode-cdn.net/usr/local/cuda-11.7/lib64)
2023-09-14 16:20:32.029751: W tensorflow[/compiler/tf2tensorrt/utils/py_utils.cc:38](https://file+.vscode-resource.vscode-cdn.net/compiler/tf2tensorrt/utils/py_utils.cc:38)] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
---------------------------------------------------------------------------
ProcessRaisedException                    Traceback (most recent call last)
Cell In[6], line 1
----> 1 dlk_regressor.fit(traj_list)

File [~/anaconda3/envs/dpc/lib/python3.10/site-packages/pykoopman/regression/_nndmd.py:1187](https://file+.vscode-resource.vscode-cdn.net/media/lk/lksgcc/lk_git/3_Reinforcement_Learning/3_4_MPC/pykoopman/docs/~/anaconda3/envs/dpc/lib/python3.10/site-packages/pykoopman/regression/_nndmd.py:1187), in NNDMD.fit(self, x, y, dt)
   1184     raise ValueError("check `x` and `y` for `self.fit`")
   1186 # trainer starts to train
-> 1187 self.trainer.fit(self._regressor, self.dm)
   1189 # compute Koopman operator information
   1190 self._state_matrix_ = (
   1191     self._regressor._koopman_propagator.get_discrete_time_Koopman_Operator()
   1192     .detach()
   1193     .numpy()
   1194 )

File [~/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:529](https://file+.vscode-resource.vscode-cdn.net/media/lk/lksgcc/lk_git/3_Reinforcement_Learning/3_4_MPC/pykoopman/docs/~/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:529), in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    527 model = _maybe_unwrap_optimized(model)
    528 self.strategy._lightning_module = model
--> 529 call._call_and_handle_interrupt(
    530     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    531 )

File [~/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:41](https://file+.vscode-resource.vscode-cdn.net/media/lk/lksgcc/lk_git/3_Reinforcement_Learning/3_4_MPC/pykoopman/docs/~/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:41), in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     39 try:
     40     if trainer.strategy.launcher is not None:
---> 41         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     42     return trainer_fn(*args, **kwargs)
     44 except _TunerExitException:

File [~/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/multiprocessing.py:124](https://file+.vscode-resource.vscode-cdn.net/media/lk/lksgcc/lk_git/3_Reinforcement_Learning/3_4_MPC/pykoopman/docs/~/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/multiprocessing.py:124), in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
    116 process_context = mp.start_processes(
    117     self._wrapping_function,
    118     args=process_args,
   (...)
    121     join=False,  # we will join ourselves to get the process references
    122 )
    123 self.procs = process_context.processes
--> 124 while not process_context.join():
    125     pass
    127 worker_output = return_queue.get()

File [~/anaconda3/envs/dpc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:160](https://file+.vscode-resource.vscode-cdn.net/media/lk/lksgcc/lk_git/3_Reinforcement_Learning/3_4_MPC/pykoopman/docs/~/anaconda3/envs/dpc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:160), in ProcessContext.join(self, timeout)
    158 msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
    159 msg += original_trace
--> 160 raise ProcessRaisedException(msg, error_index, failed_process.pid)

ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "[/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py](https://file+.vscode-resource.vscode-cdn.net/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/torch/multiprocessing/spawn.py)", line 69, in _wrap
    fn(i, *args)
  File "[/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/multiprocessing.py](https://file+.vscode-resource.vscode-cdn.net/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/multiprocessing.py)", line 147, in _wrapping_function
    results = function(*args, **kwargs)
  File "[/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py](https://file+.vscode-resource.vscode-cdn.net/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py)", line 568, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "[/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py](https://file+.vscode-resource.vscode-cdn.net/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py)", line 934, in _run
    call._call_setup_hook(self)  # allow user to setup lightning_module in accelerator environment
  File "[/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py](https://file+.vscode-resource.vscode-cdn.net/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py)", line 83, in _call_setup_hook
    _call_lightning_datamodule_hook(trainer, "setup", stage=fn)
  File "[/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py](https://file+.vscode-resource.vscode-cdn.net/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py)", line 164, in _call_lightning_datamodule_hook
    return fn(*args, **kwargs)
  File "[/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/pykoopman/regression/_nndmd.py](https://file+.vscode-resource.vscode-cdn.net/home/lk/anaconda3/envs/dpc/lib/python3.10/site-packages/pykoopman/regression/_nndmd.py)", line 902, in setup
    self._tr_x, self._tr_yseq, self._tr_ys, self.normalization
AttributeError: 'SeqDataModule' object has no attribute '_tr_x' 

Anyone can help ?

@pswpswpsw
Copy link
Collaborator

pswpswpsw commented Sep 22, 2023

I actually just created a blank conda env and there is no error coming out for running that jupyter notebook.

So you need to setup conda env carefully:

  1. conda env create --name pyk python=3.10
  2. conda activate pyk
  3. python -m pip install -r requirements-dev.txt

Then if there is no error coming out, this environment should be good to go. Maybe you won't have GPU-version of pytorch depending on which OS and which pytorch you are using but the code will run anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants