Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train it on another data set? how can I handle checkpoint? #18

Open
changlinzhang opened this issue Jan 18, 2018 · 7 comments
Open

Comments

@changlinzhang
Copy link

Hi, kwotsin! Thanks for your work.
I want to train it on another data set (class number is 30 instead of 12). I thought I had changed related codes. But I met this error:
2018-01-11 17:23:22.187077: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Input to reshape is a tensor with 172800 values, but the requested shape has 4320000
I thought it may be caused by checkpoint? How can I deal with this problem?

The completed information is as follow:
========= Median Frequency Balancing Class Weights =========
[6.397542327061094e-05, 6.7097626201794152e-05, 0.024400273767542283, 0.041269401614453756, 5.5506352412896832e-05, 0.076635711324892844, 0.069381256179271614, 3.472654196521944e-05, 0.00042760164428717635, 0.00012440287198120186, 0.090233329139976615, 0.12489918060211183, 0.0013708685331902757, 6.0827765291491662e-05, 0.073240128809290553, 0.35775514055273316, 0.64257341685305103, 0.90968868010977944, 0.37688909228806228, 0.44248634385452756, 0.00042529101230680852, 0.30566376891079095, 0.28941152643298945, 3.9464190165066867e-05, 0.26421036878629223, 0.42250536299160169, 0.5089356784417215, 0.00024742224929701886, 0.47265314480960613, 0.0]
2018-01-11 17:22:23.528595: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-01-11 17:22:23.528689: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-01-11 17:22:23.528720: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-01-11 17:22:29.254935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:02:00.0
Total memory: 11.17GiB
Free memory: 11.10GiB
2018-01-11 17:22:29.503633: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x1e106f80 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that.
2018-01-11 17:22:29.504523: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-01-11 17:22:29.505315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties:
name: Tesla K40c
major: 3 minor: 5 memoryClockRate (GHz) 0.745
pciBusID 0000:84:00.0
Total memory: 11.17GiB
Free memory: 11.10GiB
2018-01-11 17:22:29.505448: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 0 and 1
2018-01-11 17:22:29.505491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:832] Peer access not supported between device ordinals 1 and 0
2018-01-11 17:22:29.505540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 1
2018-01-11 17:22:29.505685: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y N
2018-01-11 17:22:29.505705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 1: N Y
2018-01-11 17:22:29.505740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40c, pci bus id: 0000:02:00.0)
2018-01-11 17:22:29.505779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40c, pci bus id: 0000:84:00.0)
2018-01-11 17:22:34.391659: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 1368 get requests, put_count=1100 evicted_count=1000 eviction_rate=0.909091 and unsatisfied allocation rate=1
2018-01-11 17:22:34.391731: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
INFO:tensorflow:Saving checkpoint to path ./log/original/model.ckpt
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Epoch 1/300
INFO:tensorflow:Current Learning Rate: [0.00050000002]
INFO:tensorflow:global step 1: loss: 0.3121 (4.79 sec/step) Current Streaming Accuracy: 0.0000 Current Mean IOU: 0.0000
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0000 Validation Mean IOU: 0.0000 (2.24 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0209 Validation Mean IOU: 0.0030 (1.10 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0207 Validation Mean IOU: 0.0028 (1.26 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0227 Validation Mean IOU: 0.0033 (1.23 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0220 Validation Mean IOU: 0.0035 (1.24 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0208 Validation Mean IOU: 0.0033 (1.28 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0201 Validation Mean IOU: 0.0033 (1.22 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0198 Validation Mean IOU: 0.0032 (1.25 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0197 Validation Mean IOU: 0.0032 (1.24 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0196 Validation Mean IOU: 0.0031 (1.21 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0197 Validation Mean IOU: 0.0031 (1.18 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0196 Validation Mean IOU: 0.0031 (1.21 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0196 Validation Mean IOU: 0.0031 (1.39 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0196 Validation Mean IOU: 0.0032 (1.23 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0196 Validation Mean IOU: 0.0032 (1.18 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0193 Validation Mean IOU: 0.0032 (1.16 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0191 Validation Mean IOU: 0.0031 (1.41 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0193 Validation Mean IOU: 0.0031 (1.26 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0195 Validation Mean IOU: 0.0032 (1.43 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0197 Validation Mean IOU: 0.0032 (1.32 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0202 Validation Mean IOU: 0.0033 (1.34 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0204 Validation Mean IOU: 0.0034 (1.33 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0203 Validation Mean IOU: 0.0034 (1.21 sec/step)
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0206 Validation Mean IOU: 0.0034 (1.36 sec/step)
2018-01-11 17:23:21.808311: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: assertion failed: [all dims of 'image.shape' must be > 0.]
[[Node: assert_positive_11/assert_less/Assert/Assert = Assert[T=[DT_STRING], summarize=3, _device="/job:localhost/replica:0/task:0/cpu:0"](assert_positive_11/assert_less/All/_5795, assert_positive_11/assert_less/Assert/Assert/data_0)]]
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, assertion failed: [all dims of 'image.shape' must be > 0.]
[[Node: assert_positive_11/assert_less/Assert/Assert = Assert[T=[DT_STRING], summarize=3, _device="/job:localhost/replica:0/task:0/cpu:0"](assert_positive_11/assert_less/All/_5795, assert_positive_11/assert_less/Assert/Assert/data_0)]]
INFO:tensorflow:---VALIDATION--- Validation Accuracy: 0.0207 Validation Mean IOU: 0.0035 (1.19 sec/step)
2018-01-11 17:23:22.187077: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Input to reshape is a tensor with 172800 values, but the requested shape has 4320000
[[Node: Reshape_5 = Reshape[T=DT_UINT8, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](batch_1/_5971, Reshape_5/shape)]]
2018-01-11 17:23:22.197319: W tensorflow/core/framework/op_kernel.cc:1158] Invalid argument: Input to reshape is a tensor with 172800 values, but the requested shape has 4320000
[[Node: Reshape_5 = Reshape[T=DT_UINT8, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](batch_1/5971, Reshape_5/shape)]]
Traceback (most recent call last):
File "train_enet.py", line 340, in
run()
File "train_enet.py", line 337, in run
plt.savefig(photo_dir+"/image
" + str(i))
File "/usr/lib64/python2.7/contextlib.py", line 35, in exit
self.gen.throw(type, value, traceback)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 964, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 792, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
enqueue_callable()
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1063, in _single_operation_run
target_list_as_strings, status, None)
File "/usr/lib64/python2.7/contextlib.py", line 24, in exit
self.gen.next()
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [all dims of 'image.shape' must be > 0.]
[[Node: assert_positive_11/assert_less/Assert/Assert = Assert[T=[DT_STRING], summarize=3, _device="/job:localhost/replica:0/task:0/cpu:0"](assert_positive_11/assert_less/All/_5795, assert_positive_11/assert_less/Assert/Assert/data_0)]]

@ghost
Copy link

ghost commented Apr 25, 2018

have you figured out how it works? I trained on my own dataset as well, but the accuracy is so low..

@kangyang94
Copy link

heollo, @changlinzhang @kwotsin could you tell me how to use the files in the checkpoint folder as the pretrain model to train my own dataset?

@RobinHan24
Copy link

hello,everyone,so how to make our data set to train? Thank you.

@RobinHan24
Copy link

have you figured out how it works? I trained on my own dataset as well, but the accuracy is so low..

I made my own dataset, but I met errors below
InvalidArgumentError (see above for traceback): assertion failed: [labels out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency:0) = ] [0 0 0...] [y (mean_iou/ToInt64_1:0) = ] [2]
[[Node: mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch/_5481, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1/_5483, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_2/_5485)]]

Traceback (most recent call last):
File "train_enet.py", line 337, in
run()
File "train_enet.py", line 293, in run
loss, training_accuracy, training_mean_IOU = train_step(sess, train_op, sv.global_step, metrics_op=metrics_op)
File "train_enet.py", line 202, in train_step
total_loss, global_step_count, accuracy_val, mean_IOU_val, _ = sess.run([train_op, global_step, accuracy, mean_IOU, metrics_op])
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 877, in run
run_metadata_ptr)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1100, in _run
feed_dict_tensor, options, run_metadata)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1272, in _do_run
run_metadata)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1291, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [labels out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency:0) = ] [0 0 0...] [y (mean_iou/ToInt64_1:0) = ] [2]
[[Node: mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch/_5481, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1/_5483, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_2/_5485)]]

Caused by op u'mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert', defined at:
File "train_enet.py", line 337, in
run()
File "train_enet.py", line 192, in run
mean_IOU, mean_IOU_update = tf.contrib.metrics.streaming_mean_iou(predictions=predictions, labels=annotations, num_classes=num_classes)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/contrib/metrics/python/ops/metric_ops.py", line 3528, in streaming_mean_iou
name=name)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/metrics_impl.py", line 1128, in mean_iou
num_classes, weights)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/metrics_impl.py", line 298, in _streaming_confusion_matrix
labels, predictions, num_classes, weights=weights, dtype=dtypes.float64)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/confusion_matrix.py", line 171, in confusion_matrix
labels, num_classes_int64, message='labels out of bound')],
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/check_ops.py", line 559, in assert_less
return control_flow_ops.Assert(condition, data, summarize=summarize)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/util/tf_should_use.py", line 118, in wrapped
return _add_should_use_warning(fn(*args, **kwargs))
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 157, in Assert
guarded_assert = cond(condition, no_op, true_assert, name="AssertGuard")
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2057, in cond
orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1895, in BuildCondBranch
original_result = fn()
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 155, in true_assert
condition, data, summarize, name="Assert")
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/ops/gen_logging_ops.py", line 51, in _assert
name=name)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 454, in new_func
return func(*args, **kwargs)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3155, in create_op
op_def=op_def)
File "/home/bayes/anaconda2/envs/py2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1717, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): assertion failed: [labels out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency:0) = ] [0 0 0...] [y (mean_iou/ToInt64_1:0) = ] [2]
[[Node: mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch/_5481, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1/_5483, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_2/_5485)]]

Could you help me please

@c13proto
Copy link

I faced same problem.
In my case, I remade annotation images not including value of '255' and works.
DrSleep/tensorflow-deeplab-resnet#107 (comment)

@x7hkvip
Copy link

x7hkvip commented Apr 29, 2019

@RobinHan24 I met the same problem.I have 10 classes,according my classes,I set the pixels of my label images to 0 to 9,then the problem fixed.I don't wither it is helpful for you?

@jayashreek3
Copy link

thanks for this useful repo
hi everyone if anyone could help me out to solve this issue

  1. the current code works for camvid dataset,
  2. am facing a difficulty in training this ENet model with cityscapes dataset :
    which i tried using https://github.com/mcordts/cityscapesScripts and got trained data,
    now i would like to import this similar data in this code but states dimension miss match, could you please help me to fix this grey scale image insertion as i have 4types(color.png,instance.png,labeld.png,json.png,trainid.png) of labeling after training the data. how to choose anyone from this folder and import in this model
    i tried for single type of images and got error:

InvalidArgumentError (see above for traceback): assertion failed: [labels out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency:0) = ] [0 0 0...] [y (mean_iou/ToInt64_1:0) = ] [12]
[[node mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert (defined at /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/metrics/python/ops/metric_ops.py:3561) = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_1, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less/Assert/AssertGuard/Assert/Switch_2)]]

as i am beginner to this field so, hoping for suggestions to resolve this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants