Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[deeplab] Unable to Evaluate on ADE20k #4089

Closed
pedropgusmao opened this issue Apr 25, 2018 · 8 comments
Closed

[deeplab] Unable to Evaluate on ADE20k #4089

pedropgusmao opened this issue Apr 25, 2018 · 8 comments
Assignees
Labels
stat:awaiting model gardener Waiting on input from TensorFlow model gardener

Comments

@pedropgusmao
Copy link

pedropgusmao commented Apr 25, 2018

Please go to Stack Overflow for help and support:

http://stackoverflow.com/questions/tagged/tensorflow

Also, please understand that many of the models included in this repository are experimental and research-style code. If you open a GitHub issue, here is our policy:

  1. It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
  2. The form below must be filled out.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.


System information

  • What is the top-level directory of the model you are using: ?
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): The stock example is not explicitly available
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 1.7.0
  • Bazel version (if compiling from source): 0.11
  • CUDA/cuDNN version: 7.1.3
  • GPU model and memory: NVIDIA Titan V 12GB
  • Exact command to reproduce:
python deeplab/eval.py \
    --logtostderr \
    --eval_split="val" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --eval_crop_size=513 \
    --eval_crop_size=513 \
    --dataset="ade20k" \
    --checkpoint_dir='deeplab/datasets/ADE20K/exp/train_on_train_set/train' \
    --eval_logdir='deeplab/datasets/ADE20K/exp/train_on_train_set/eval' \
    --dataset_dir='deeplab/datasets/ADE20K/tfrecord' 

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

It seems that the script is not correctly cropping the images before execution.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

2018-04-25 17:26:15.635672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:09:00.0 totalMemory: 11.78GiB freeMemory: 11.36GiB 2018-04-25 17:26:15.635730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0 2018-04-25 17:26:16.432618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-04-25 17:26:16.432667: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 2018-04-25 17:26:16.432678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N 2018-04-25 17:26:16.433049: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10989 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:09:00.0, compute capability: 7.0) INFO:tensorflow:Restoring parameters from deeplab/datasets/ADE20K/exp/train_on_train_set/train/model.ckpt-50000 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. 2018-04-25 17:26:19.373963: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at queue_ops.cc:105 : Invalid argument: Shape mismatch in tuple component 1. Expected [513,513,3], got [513,683,3] INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Shape mismatch in tuple component 1. Expected [513,513,3], got [513,683,3] [[Node: batch/padding_fifo_queue_enqueue = QueueEnqueueV2[Tcomponents=[DT_INT64, DT_FLOAT, DT_STRING, DT_INT32, DT_UINT8, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](batch/padding_fifo_queue, ParseSingleExample/ParseSingleExample:3, add_2/_4681, ParseSingleExample/ParseSingleExample:1, add_3/_4683, batch/packed, ParseSingleExample/ParseSingleExample:6)]] INFO:tensorflow:Starting evaluation at 2018-04-25-17:26:20 INFO:tensorflow:Finished evaluation at 2018-04-25-17:26:20 miou_1.0[0] Traceback (most recent call last): File "deeplab/eval.py", line 176, in <module> tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 126, in run _sys.exit(main(argv)) File "deeplab/eval.py", line 169, in main eval_interval_secs=FLAGS.eval_interval_secs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 301, in evaluation_loop timeout=timeout) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/training/python/training/evaluation.py", line 455, in evaluate_repeatedly '%Y-%m-%d-%H:%M:%S', time.gmtime())) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 658, in __exit__ self._close_internal(exception_type) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 695, in _close_internal self._sess.close() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 943, in close self._sess.close() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1087, in close ignore_live_threads=True) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/local/lib/python3.5/dist-packages/six.py", line 692, in reraise raise value.with_traceback(tb) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 252, in _run enqueue_callable() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1249, in _single_operation_run self._call_tf_sessionrun(None, {}, [], target_list, None) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun status, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape mismatch in tuple component 1. Expected [513,513,3], got [513,683,3] [[Node: batch/padding_fifo_queue_enqueue = QueueEnqueueV2[Tcomponents=[DT_INT64, DT_FLOAT, DT_STRING, DT_INT32, DT_UINT8, DT_INT64], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](batch/padding_fifo_queue, ParseSingleExample/ParseSingleExample:3, add_2/_4681, ParseSingleExample/ParseSingleExample:1, add_3/_4683, batch/packed, ParseSingleExample/ParseSingleExample:6)]]

@k-w-w k-w-w assigned YknZhu and unassigned k-w-w Apr 26, 2018
@k-w-w k-w-w added the stat:awaiting model gardener Waiting on input from TensorFlow model gardener label Apr 26, 2018
@RomRoc
Copy link

RomRoc commented Apr 30, 2018

See here for solution: #3886

In ADE20K validation dataset largest image is 2100 x 2100, so I set eval_crop_size=2113.

Anyway when I run eval.py in google colab I get eror:


tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [`predictions` out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency_1:0) = ] [0 0 0...] [y (mean_iou/ToInt64_2:0) = ] [150]
	 [[Node: mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert = Assert[T=[DT_STRING, DT_STRING, DT_STRING, DT_INT64, DT_STRING, DT_INT64], summarize=3, _device="/job:localhost/replica:0/task:0/device:CPU:0"](mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch/_4751, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_0, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_1, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_2, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_1/_4753, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/data_4, mean_iou/confusion_matrix/assert_less_1/Assert/AssertGuard/Assert/Switch_2/_4755)]]
	 [[Node: mean_iou/confusion_matrix/SparseTensorDenseAdd/_4769 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_2155_mean_iou/confusion_matrix/SparseTensorDenseAdd", tensor_type=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]


@pedropgusmao
Copy link
Author

@RomRoc , sorry but it is still not clear to me what needs to be done. Should I use different values of --eval_crop_size? I try to increase the value of K and set the same value for both --eval_crop_size.
When I reach --eval_crop_size=961 (k =60), I start to get

InvalidArgumentError (see above for traceback): assertion failed: [predictions out of bound] [Condition x < y did not hold element-wise:] [x (mean_iou/confusion_matrix/control_dependency_1:0) = ] [0 0 0...] [y (mean_iou/ToInt64_2:0) = ] [150]

@RomRoc
Copy link

RomRoc commented Apr 30, 2018

The biggest image in ADE20K validation dataset is 2100 x 2100, so I run with these parameters:

python deeplab/eval.py \
    --logtostderr \
    --eval_split="val" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --eval_crop_size=2113 \
    --eval_crop_size=2113 \
    --dataset="ade20k" \
    --checkpoint_dir=${TRAIN_LOGDIR} \
    --eval_logdir=${EVAL_LOGDIR} \
    --dataset_dir=${ADE20K_DATASET}

But I get the error specified above.

@shivpatri
Copy link

I too am stuck at the same issue. Could anyone help?

@haichaoyu
Copy link
Contributor

Hello, @RomRoc

Where did you download the ADE20k dataset with largest validation image size 2100^2? I also checked the size but got different outputs from yours. Here, both train_largest and valid_largest is 3504 * 3888. Here, train_largest is 2100 * 2100 and valid_largest is 1600 * 1600.

Could you please provide some details? Thanks.

Haichao

@RomRoc
Copy link

RomRoc commented May 6, 2018

Hello @haichaoyu
I use the script provided here that downloads from http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip
The image 2100x2100 is ADE_train_00006921.jpg in training dataset.

@kkahatapitiya
Copy link

kkahatapitiya commented May 8, 2018

@ShivakshiT try DrSleep's solution here Worked for me for training on MScocostuff

@wt-huang
Copy link

wt-huang commented Nov 3, 2018

Closing as this is resolved

@wt-huang wt-huang closed this as completed Nov 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting model gardener Waiting on input from TensorFlow model gardener
Projects
None yet
Development

No branches or pull requests

8 participants