-
Notifications
You must be signed in to change notification settings - Fork 108
Problem on MSeg, took 10 minutes for 1 image and only produced the gray scaled image? #56
Comments
My setup:
|
Sounds like you are running on cpu and not on gpu. I am runninng on 24gb gpu and for "default_config_1080_ms.yaml" config it took good few seconds for one fullHD input image. Check in your config what gpu is indicated as test_gpu: [0]. Make sure also pytorch is picking nvidia as a device and not a cpu. It is the often case where you have something badly installed in your environment. From log lines you should indicate or you can also write super simple pytorch script to print device name (howto check pytorch is using gpu). If it is the case, I would create new conda environment and install pytorch for gpu/cuda (pip install might be bit different for gpu support). About grayscale output it is probably what you want. One channel (8 bit gray) is super enough to store 256 classes and mseg is producing less than that. This way of storing info is efficient for hdd, 4k image is barely 112kb or so. |
ah, you are using two gpus (2x 24GB). That potentially may be the case. GPU usage indeed indicates that pytorch is using gpus. I didn't test the code (epe training / mseg generations) in multiGPU setup yet, but I remember from my past pytorch projects that if something was not well designed for multiGPU setup, some steps (communication, combinning results, even some part of calculating gradients, copying things back and forth, some merging ops on cpu) took much time and may be less efficient than on one gpu. What I would try in your case:
Also if you step of generating knn will be slow that means faiss use cpu. On gpu faiss is blazing fast and all took 1-2 second for 500k samples in my case.
Msegs and EPE are different networks, you probably mean conda envs. Everything should be running fine in the same environment (as it is working for me), but you can use different conda environments, it does not matter at all. PS. what is your input image size for Mseg step? |
okay, so the MSeg scipt i used is the universal demo inference or the universal demo for batch; so the command is like: the input file 01_images is a folder contains of 2500 images of Playing for Data - dataset And before, i tried to use the universal_demo.py and at the first image i break the running and i think these are why it takes long time just to infer 1 image. But i dont yet investigate/debug line-per-line further. because took too long for me for the mseg, now i have MSeg-BW-images for 1,5-2 folders of PfD: images_01 and images_02. now i am still at debug line per line where this num_samples = 0 came.. Thank you so much @czero69 for your help btw :) |
ah, I am going only with universal_demo.py
you have at least two errors. One says some file is missing. Another one, num_samples=0 this happens when dataloader see no data at all, usually wrong paths / wrong input file structure / wrong input file path. For preparing EPE data, you must take extra care. Go throu all preparation steps careffuly. I recommend printing results of every step below (values, means) to check does tensors looks ok (not nans/ not inf etc.) these are my scripts, where images are 4k (hence 2160 3840 and -c 60)
take a note of correct order for /path/real.txt (images, msegs) To verify a bit, its good to see does matched crops looks ok
Also, make sure all your input color images are RGB, 3-channel (not RGBA). Robust maps (mseg) and Stencils (gt masks) are in 8 bit. Your NPZs should have a same structure as fake NPZs ('data' key in numpy dict, float16). If it has different dim than 32 (32 == num of gbuffer channels in total) modify the code accordingly; should be in one place.
After solving some trivial issues, all is training fine for me. Results are ... shortly speaking ... breathtaking. Possibly I would rewrite an entire training pipeline to latest pytorch and pipeline similar to how I work nowadays, so it would be easier for me to modify epe basilne arch furhter, support batches > 1, logging, etc. |
ya, this skipped entries is found at script epe.datasets.utils.py in function read_filelist. real.txt: so i think with the text files i do it correctly. and i already check the NaN. The other input like RGB 3 Channel (not RGBA), how many bits, i dont check it yet. i tell you again as soon i try all of the suggestions you made. |
in compute_weights.py take a note that argument is in H, W (and not W, H), so for e.g. fullHD will be 1080 1920; it was my NaN reason
should be 24 bits, check one random img for fake & real. Not everywhere in the code there is [:,:3,:,:] so RGBA will rise some 4!=3 in tensors size
Almost for sure paths are wrong. Take one file from each of your .txt, .csvs and make stat in the terminal
|
at least order looks ok. I have this order too ["screenshot", "msegs/gray4k", "NPZs", "gray_stencils"] the stat is just to make sure .txt/.csv have correct paths. The one you've printed is correct, indicating file indeed exist in HDD. Rendered.txt looks ok too for a first sight. But num_samples == 0 indicates epe pipeline does not see sth. Check all 4 paths in some random row in rendered.txt. Check paths in your citysample_ie2.config (or whatever name config you are using for epe training). Also check whats going on with, missing files, probably some paths are pointing to non-existing files. |
ahh okay, thank you, @czero69 for the stat tips to check the location of the images. ya when i run the training, the num_samples = 0 come from the skipped entries. ya, for the config i use the train_pfd2cs.yaml from github, i just modify the basic like path, etc.. i even keep the name same. pfd and cs, just to avoid unnecessary error the number 355 skipped entries are for validation and 1066 are for training, in the val.txt there are 355 lines/ images and 1066 images in train.txt |
Hello @luda1013, |
Hi @vace17 , not yet, i am still countered some problem, now i countered the problem: in your case, in the script they always made the used device is cuda, u can also then check when u are on training, open your terminal and check with nvidia-smi Could you please help me then with training? till now i cannot bring it to train. update: but i have another issue now, and maybe it is the same with you @vace17 : |
@luda1013 I have a different issue since I don't encounter this specific error and the training process is running but it takes a lot of time |
@czero69 can I ask what is your specific setup?
I checked from the terminal the current usage of the GPU using the command nvidia-smi during the run of the training process. It seems to me that the GPU is currently used but the percentage of usage of it continues changing between low values (5-10%) and 50-60%. |
hey, I have tried two set-ups so far:
my entire epoch would be around 1M steps (batch == 1), 196x196 single crop Authors mentioned somewhere here in the issue space that for them it was around 200k steps / day too and they were using 1x3090 |
Hello Kamil, nice to see you again. @czero69 |
Hallo Pros,
i am currently working with enhancing image enhancement paper and algorithm and trying to implement that. In the process, we need to use MSeg-segmentation for real and rendered images/ datasets. i have like 50-60k images.
So the dependencies MSeg-api and MSeg_semantic were already installed. I tried the google collab first and then copying the commands, so i could run the script in my linux also. the command is like this:
python -u mseg_semantic/tool/universal_demo.py
--config="default_config_360.yaml"
model_name mseg-3m
model_path mseg-3m.pth
input_file /home/luda1013/PfD/image/try_images
the weight i used, i downloaded it from the google collab, so the mseg-3m-1080.pth
but for me, it took like 10 minutes for 1 image and also what i get in temp_files is just the gray scale image of it.
Could someone help me how i could solve this problem, thank you :)
The text was updated successfully, but these errors were encountered: