Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training command Errors (TensorFlow/Python Incompatibility)? #1181

Closed
generic-beat-detector opened this issue May 28, 2024 · 4 comments
Closed

Comments

@generic-beat-detector
Copy link

generic-beat-detector commented May 28, 2024

Hello!

FWIW, this is truly a wonderful project!

Unfortunately, my limited skills can't even ### seem to get donkey train --tub data to work on Ubuntu 22.04 x86-64. The command also fails on RPi 4B bookworm but somehow works on the robocarstore RPi 4B pre-built-image @v5.0-dev3?

For the PC installs, I followed the (variations of) instructions here, there, and there

I'm using the same exact dataset in all scenarios:

$ ls 
calibrate.py  config.py  data  logs  manage.py  models  myconfig.py  train.py

$ ls data/
catalog_0.catalog  catalog_0.catalog_manifest  images  manifest.json

$ ls data/images/
0_cam_image_array_.jpg   15_cam_image_array_.jpg  5_cam_image_array_.jpg
10_cam_image_array_.jpg  16_cam_image_array_.jpg  6_cam_image_array_.jpg
11_cam_image_array_.jpg  1_cam_image_array_.jpg   7_cam_image_array_.jpg
12_cam_image_array_.jpg  2_cam_image_array_.jpg   8_cam_image_array_.jpg
13_cam_image_array_.jpg  3_cam_image_array_.jpg   9_cam_image_array_.jpg
14_cam_image_array_.jpg  4_cam_image_array_.jpg

Ubuntu 22.04, x86-64 (w/ RTX 3070)

$  donkey --version
using donkey v5.0.0 ...
  • Python 3.10.12
$ donkey train --tub data
[...]
INFO:donkeycar.pipeline.training:Records # Training 13
INFO:donkeycar.pipeline.training:Records # Validation 4
[...]
INFO:donkeycar.parts.keras:////////// Starting training //////////
Epoch 1/100
2024-05-28 23:08:26.725746: W tensorflow/core/framework/op_kernel.cc:1733] INVALID_ARGUMENT: ValueError: Key image is not in available keys.
Traceback (most recent call last):

  File "~/miniconda3/envs/donkey/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__
    ret = func(*args)
[...]
ValueError: Key image is not in available keys.
[...]
	 [[{{node PyFunc}}]]
	 [[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_1722]
2024-05-29 01:23:29.995776: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
	 [[{{node PyFunc}}]]
  • Python-3.9.19
$ donkey train --tub data
[...]
INFO:donkeycar.pipeline.types:Loading tubs from paths ['data']
INFO:donkeycar.pipeline.training:Records # Training 13
INFO:donkeycar.pipeline.training:Records # Validation 4
INFO:donkeycar.parts.tub_v2:Closing tub data
[...]
INFO:donkeycar.parts.keras:////////// Starting training //////////
Epoch 1/100
2024-05-29 00:38:05.846618: W tensorflow/core/framework/op_kernel.cc:1733] INVALID_ARGUMENT: ValueError: Key image is not in available keys.
Traceback (most recent call last):

  File "~/tmp-donkey/miniconda3/envs/donkey/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__
    ret = func(*args)
[...]
ValueError: Key image is not in available keys.
[...]
	 [[{{node PyFunc}}]]
	 [[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_1722]
2024-05-29 01:23:29.995776: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
	 [[{{node PyFunc}}]]

RPi 4B

  • Bookworm, Python 3.11.2
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 12 (bookworm)
Release:	12
Codename:	bookworm


$ python --version
Python 3.11.2

$ donkey --version
using donkey v5.1.0 ...

$ donkey train --tub data
[...]
INFO:donkeycar.pipeline.types:Loading tubs from paths ['data']
INFO:donkeycar.pipeline.training:Records # Training 13
INFO:donkeycar.pipeline.training:Records # Validation 4
INFO:donkeycar.parts.tub_v2:Closing tub data
[...]
INFO:donkeycar.parts.keras:////////// Starting training //////////
Epoch 1/100
2024-05-29 01:17:03.203388: W tensorflow/core/framework/op_kernel.cc:1827] INVALID_ARGUMENT: ValueError: Key image is not in available keys.
Traceback (most recent call last):

  File "/home/pi/projects/donkeycar/env/lib/python3.11/site-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__
    ret = func(*args)
          ^^^^^^^^^^^
  • However, on the robocarstore pre-built-image @v5.0-dev3, I can run donkey train --tub data on the RPi 4B without any problems (Python 3.9.2)
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 11 (bullseye)
Release:	11
Codename:	bullseye

$ python --version
Python 3.9.2

$ donkey --version
using donkey v5.0.dev3 ...

$ donkey train --tub data
[...]
INFO:donkeycar.pipeline.types:Loading tubs from paths ['data']
INFO:donkeycar.pipeline.training:Records # Training 13
INFO:donkeycar.pipeline.training:Records # Validation 4
INFO:donkeycar.parts.tub_v2:Closing tub data
INFO:donkeycar.parts.image_transformations:Creating ImageTransformations []
INFO:donkeycar.parts.image_transformations:Creating ImageTransformations []
INFO:donkeycar.parts.image_transformations:Creating ImageTransformations []
INFO:donkeycar.parts.image_transformations:Creating ImageTransformations []
INFO:donkeycar.pipeline.training:Train with image caching: True
INFO:donkeycar.parts.keras:////////// Starting training //////////
Epoch 1/100
1/1 [==============================] - ETA: 0s - loss: 0.4888 - n_outputs0_loss: 0.0301 - n_outputs1_loss: 0.4587
Epoch 1: val_loss improved from inf to 0.14395, saving model to /home/pi/mycar/models/pilot_24-05-29_0.savedmodel
[...]
1/1 [==============================] - 16s 16s/step - loss: 0.4888 - n_outputs0_loss: 0.0301 - n_outputs1_loss: 0.4587 - val_loss: 0.1440 - val_n_outputs0_loss: 0.0115 - val_n_outputs1_loss: 0.1324
Epoch 2/100
1/1 [==============================] - ETA: 0s - loss: 0.2921 - n_outputs0_loss: 0.0200 - n_outputs1_loss: 0.2721
Epoch 2: val_loss did not improve from 0.14395
1/1 [==============================] - 6s 6s/step - loss: 0.2921 - n_outputs0_loss: 0.0200 - n_outputs1_loss: 0.2721 - val_loss: 0.2925 - val_n_outputs0_loss: 0.0241 - val_n_outputs1_loss: 0.2683

What's going on here?

Regards.

@DocGarbanzo
Copy link
Contributor

Can you please install the latest release, 5.1.0? Also, you have far too few data, try with around 1000 records and not 14. My suspicion is that there is a problem if you don't even have a single full sized batch in neither train nor validation set.

@generic-beat-detector
Copy link
Author

@DocGarbanzo

Yes sir, going with your recommendation to install donkeycar v5.1.0 (which requires python >=3.11 and <=3.12), the training -- apparently -- succeeds (even with my 17 image test dataset):

  • Installed miniconda virtual environment
	$ wget https://repo.anaconda.com/miniconda/Miniconda3-py39_23.3.1-0-Linux-x86_64.sh
	$ bash ./Miniconda3-py39_23.3.1-0-Linux-x86_64.sh

	$ eval "$(/home/USER/dev-donkey/miniconda3/bin/conda shell.bash hook)"
	$ conda create -n donkey python=3.11
	$ conda activate donkey
	$ python --version
	Python 3.11.9
	$ unzip donkeycar-5.1.0.zip
	$ cd donkeycar-5.1.0/
	$ pip install -e .[pc]
	$ cd ..
	$ donkey createcar --path $PWD/mycar
	$ cd mycar/
	$ ls
	calibrate.py  config.py  data  logs  manage.py  models  myconfig.py  train.py
  • Copied the very same 17 image dataset to data, then
	$ donkey train --tub data

	using donkey v5.1.0 ...
	[...]
	INFO:donkeycar.pipeline.types:Loading tubs from paths ['data']
	INFO:donkeycar.pipeline.training:Records # Training 13
	INFO:donkeycar.pipeline.training:Records # Validation 4
	[...]
	INFO:donkeycar.parts.keras:////////// Starting training //////////
	Epoch 1/100
	2024-06-05 00:48:34.244384: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] layout failed: INVALID_ARGUMENT: Size of values 0 does not match size of permutation 4 @ fanin shape inlinear/dropout/dropout/SelectV2-2-TransposeNHWCToNCHW-LayoutOptimizer
	2024-06-05 00:48:34.517973: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
	2024-06-05 00:48:35.751589: I external/local_xla/xla/service/service.cc:168] XLA service 0x7d6ab00145b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
	2024-06-05 00:48:35.751618: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3070, Compute Capability 8.6
	2024-06-05 00:48:35.755479: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
	WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
	I0000 00:00:1717537715.808544  643092 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
	1/1 [==============================] - ETA: 0s - loss: 0.4434 - n_outputs0_loss: 0.0248 - n_outputs1_loss: 0.4187
	Epoch 1: val_loss improved from inf to 0.46197, saving model to /home/USER/dev-donkey/mycar/models/pilot_24-06-05_0.savedmodel
	INFO:tensorflow:Assets written to: /home/USER/dev-donkey/mycar/models/pilot_24-06-05_0.savedmodel/assets
	1/1 [==============================] - 6s 6s/step - loss: 0.4434 - n_outputs0_loss: 0.0248 - n_outputs1_loss: 0.4187 - val_loss: 0.4620 - val_n_outputs0_loss: 0.0269 - val_n_outputs1_loss: 0.4351
	Epoch 2/100
	1/1 [==============================] - ETA: 0s - loss: 0.3613 - n_outputs0_loss: 0.0248 - n_outputs1_loss: 0.3364
	Epoch 2: val_loss improved from 0.46197 to 0.31188, saving model to /home/USER/dev-donkey/mycar/models/pilot_24-06-05_0.savedmodel
	INFO:tensorflow:Assets written to: /home/USER/dev-donkey/mycar/models/pilot_24-06-05_0.savedmodel/assets
	1/1 [==============================] - 1s 794ms/step - loss: 0.3613 - n_outputs0_loss: 0.0248 - n_outputs1_loss: 0.3364 - val_loss: 0.3119 - val_n_outputs0_loss: 0.0225 - val_n_outputs1_loss: 0.2893
	Epoch 3/100
	1/1 [==============================] - ETA: 0s - loss: 0.2427 - n_outputs0_loss: 0.0228 - n_outputs1_loss: 0.22002024-06-05 00:48:40.575992: W tensorflow/core/framework/op_kernel.cc:1827] UNKNOWN: KeyError: 109
	Traceback (most recent call last):

		File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__
		  ret = func(*args)
		        ^^^^^^^^^^^

		File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
		  return func(*args, **kwargs)
		         ^^^^^^^^^^^^^^^^^^^^^

		File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/data/ops/from_generator_op.py", line 290, in finalize_py_func
		  generator_state.iterator_completed(iterator_id)

		File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 870, in iterator_completed
		  del self._iterators[self._normalize_id(iterator_id)]
		      ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	KeyError: 109


	2024-06-05 00:48:40.576057: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: UNKNOWN: KeyError: 109
	Traceback (most recent call last):

		File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__
		  ret = func(*args)
		        ^^^^^^^^^^^

		File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
		  return func(*args, **kwargs)
		         ^^^^^^^^^^^^^^^^^^^^^

		File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/data/ops/from_generator_op.py", line 290, in finalize_py_func
		  generator_state.iterator_completed(iterator_id)

		File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 870, in iterator_completed
		  del self._iterators[self._normalize_id(iterator_id)]
		      ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	KeyError: 109


		 [[{{node PyFunc}}]]

	Epoch 3: val_loss improved from 0.31188 to 0.08692, saving model to /home/USER/dev-donkey/mycar/models/pilot_24-06-05_0.savedmodel
	INFO:tensorflow:Assets written to: /home/USER/dev-donkey/mycar/models/pilot_24-06-05_0.savedmodel/assets
	1/1 [==============================] - 1s 790ms/step - loss: 0.2427 - n_outputs0_loss: 0.0228 - n_outputs1_loss: 0.2200 - val_loss: 0.0869 - val_n_outputs0_loss: 0.0067 - val_n_outputs1_loss: 0.0802
	Epoch 4/100
	1/1 [==============================] - ETA: 0s - loss: 0.1607 - n_outputs0_loss: 0.0319 - n_outputs1_loss: 0.1288
	Epoch 4: val_loss did not improve from 0.08692
	1/1 [==============================] - 0s 145ms/step - loss: 0.1607 - n_outputs0_loss: 0.0319 - n_outputs1_loss: 0.1288 - val_loss: 0.0976 - val_n_outputs0_loss: 0.0024 - val_n_outputs1_loss: 0.0952
	Epoch 5/100
	1/1 [==============================] - ETA: 0s - loss: 0.1222 - n_outputs0_loss: 0.0187 - n_outputs1_loss: 0.1035
	Epoch 5: val_loss did not improve from 0.08692
	1/1 [==============================] - 0s 160ms/step - loss: 0.1222 - n_outputs0_loss: 0.0187 - n_outputs1_loss: 0.1035 - val_loss: 0.1443 - val_n_outputs0_loss: 0.0027 - val_n_outputs1_loss: 0.1415
	Epoch 6/100
	1/1 [==============================] - ETA: 0s - loss: 0.1274 - n_outputs0_loss: 0.0132 - n_outputs1_loss: 0.1142
	Epoch 6: val_loss did not improve from 0.08692
	1/1 [==============================] - 0s 140ms/step - loss: 0.1274 - n_outputs0_loss: 0.0132 - n_outputs1_loss: 0.1142 - val_loss: 0.1452 - val_n_outputs0_loss: 0.0022 - val_n_outputs1_loss: 0.1431
	Epoch 7/100
	1/1 [==============================] - ETA: 0s - loss: 0.1173 - n_outputs0_loss: 0.0111 - n_outputs1_loss: 0.1062
	Epoch 7: val_loss did not improve from 0.08692
	1/1 [==============================] - 0s 139ms/step - loss: 0.1173 - n_outputs0_loss: 0.0111 - n_outputs1_loss: 0.1062 - val_loss: 0.1188 - val_n_outputs0_loss: 0.0017 - val_n_outputs1_loss: 0.1170
	Epoch 8/100
	1/1 [==============================] - ETA: 0s - loss: 0.1270 - n_outputs0_loss: 0.0182 - n_outputs1_loss: 0.1089
	Epoch 8: val_loss did not improve from 0.08692
	1/1 [==============================] - 0s 140ms/step - loss: 0.1270 - n_outputs0_loss: 0.0182 - n_outputs1_loss: 0.1089 - val_loss: 0.0971 - val_n_outputs0_loss: 0.0020 - val_n_outputs1_loss: 0.0951
	INFO:donkeycar.parts.keras:////////// Finished training in: 0:00:08.441838 //////////
	[...]
	2024-06-05 00:48:45.071498: I tensorflow/cc/saved_model/loader.cc:316] SavedModel load for tags { serve }; Status: success: OK. Took 66754 microseconds.
	[...]
	INFO:donkeycar.parts.interpreter:TFLite conversion done.
	INFO:donkeycar.pipeline.database:Writing database file: /home/USER/dev-donkey/mycar/models/database.json

... a few errors but the training process completed (early due to "no improvment in validation loss" -- my bogus dataset ;), and everthing looks A-okay. I'll have to test with a real dataset of course, but at least the (compatibility) issues seem to have been fixed!

Thank you sir.

@generic-beat-detector
Copy link
Author

@DocGarbanzo,

Hi!
So far, so good. I've just trained a model on a

$ ls -l data/images/ | wc -l
13960

image dataset, and it's quite lovely. The autopilot has completed several runs like a champ!

I could swear I previously run into an issue (seemingly Python v3.11 related) with donkey ui (a "recursion depth exceeded" type error) but I mysteriously cannot reproduce it. In any case, it is not a priority right now. I will let you know of any problems in another thread. Thanks once again.

@DocGarbanzo
Copy link
Contributor

@DocGarbanzo,

Hi!

So far, so good. I've just trained a model on a


$ ls -l data/images/ | wc -l

13960

image dataset, and it's quite lovely. The autopilot has completed several runs like a champ!

I could swear I previously run into an issue (seemingly Python v3.11 related) with donkey ui (a "recursion depth exceeded" type error) but I mysteriously cannot reproduce it. In any case, it is not a priority right now. I will let you know of any problems in another thread. Thanks once again.

Ok, great. Thanks for confirming. The TF key error still is a bit concerning. We'll have an eye on it if that ever shows up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants