Skip to content
This repository was archived by the owner on Jan 7, 2025. It is now read-only.

How to use GPU correctly #133

Closed
tjb-tech opened this issue Mar 18, 2021 · 17 comments
Closed

How to use GPU correctly #133

tjb-tech opened this issue Mar 18, 2021 · 17 comments

Comments

@tjb-tech
Copy link

Hello, could you please tell me how to start the GPU of this project? I tried '--use_gpu' in the command line, but did not open the GPU. I hope you can answer my question as soon as possible after you see it. Thank you very much

@corneliusboehm
Copy link
Contributor

Hi @tjb-tech! First of all, is your GPU readily set up with CUDA? You can verify that by running nvidia-smi and checking the reported CUDA version.

If you enable the --use_gpu option on the train_classifier.py script, it will automatically use the GPU at index 0 for training and nvidia-smi should list a new python process.

@tjb-tech
Copy link
Author

Hi @tjb-tech! First of all, is your GPU readily set up with CUDA? You can verify that by running nvidia-smi and checking the reported CUDA version.

If you enable the --use_gpu option on the train_classifier.py script, it will automatically use the GPU at index 0 for training and nvidia-smi should list a new python process.

First of all, thank you very much for your timely reply to us. Through your method, we have checked that nvidia-smi does have a new line of Python process. However, I found that the CPU utilization rate was close to 100%, while the GPU utilization rate was close to 1%. Could you tell me why? I hope to get your professional reply. Thank you very much

@corneliusboehm
Copy link
Contributor

The problem with video datasets is that loading and decoding of the videos can get expensive. So it can happen that the update step of the model on the GPU is done faster than the preparation of the next batch, which leads to the CPU being utilized more than the GPU.
However, a utilization of 1% is really low. Could you verify with the following command during training if the utilization is constantly that low or if there are at least some periodic spikes?

watch -n 1 nvidia-smi

And could you send over your CPU and GPU specs?

@tjb-tech
Copy link
Author

tjb-tech commented Mar 22, 2021

The problem with video datasets is that loading and decoding of the videos can get expensive. So it can happen that the update step of the model on the GPU is done faster than the preparation of the next batch, which leads to the CPU being utilized more than the GPU.
However, a utilization of 1% is really low. Could you verify with the following command during training if the utilization is constantly that low or if there are at least some periodic spikes?

watch -n 1 nvidia-smi

And could you send over your CPU and GPU specs?

First of all, thank you very much for taking time out of your busy schedule to answer my questions. My CPU is I5-9300H, and my GPU is GTX1650. The screenshots of SMI before and after operation are as follows. The running status of my CPU and GPU is as follows. Thank you again for your prompt and enthusiastic reply. Looking forward to hearing from you soon
N$DUF587IWJ2 P$((S8H4JI
X%P3WL)OY07{870UMZ461MF
C{`CGF0VP XCS{1VAA1{%71

@corneliusboehm
Copy link
Contributor

Thanks for the info! It looks like the Python process is allocating some memory on the GPU, which is a good sign. Do you see any other output on the console of the epochs being finished? And do you get a resulting checkpoint?

Generally, does training in PyTorch work for you in other projects?
One more thing you could check is the following:

import torch
torch.cuda.is_available()

I must admit that we haven't tested our code on Windows in a while, so there might also be a platform-related issue.

@corneliusboehm
Copy link
Contributor

Hey @tjb-tech, have you been able to resolve your issue?

@tjb-tech
Copy link
Author

tjb-tech commented Apr 6, 2021

Hey @tjb-tech, have you been able to resolve your issue?

Thank you very much for your concern and I'm sorry for not replying to you in time. We tried using torch.cuda.is_available(), and the return value is true, but the GPU and CPU usage are still the same as before, so I think it may be a system compatibility issue, which you can do some further research on. By the way, my system is Windows 10.

@corneliusboehm
Copy link
Contributor

Thanks for the update. Do you still get a checkpoint after training and how long does it take? Because if that generally works, I would go ahead and close this issue for now.

@tjb-tech
Copy link
Author

tjb-tech commented Apr 6, 2021

Thanks for your reply. That's all about the GPU problem for the time being. I'm still trying to run your sense_studio code but we encountered the first error:

* Serving Flask app "sense_studio" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: on
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 105-309-328
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

I tried the following scenario

from gevent import pywsgi
if __name__ == '__main__':
Server = pywsgi.WSGIServer(('0.0.0.0', 5000), app)
server.serve_forever()

Here we go
But again, I encountered the following problems
JYN)EEP5V {U%G`JEVP365F

[2021-04-06 13:19:30,225] ERROR in app: Exception on / [GET]
Traceback (most recent call last):
  File "D:\Anaconda\envs\sense\lib\site-packages\flask\app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "D:\Anaconda\envs\sense\lib\site-packages\flask\app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "D:\Anaconda\envs\sense\lib\site-packages\flask\app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "D:\Anaconda\envs\sense\lib\site-packages\flask\_compat.py", line 39, in reraise
    raise value
  File "D:\Anaconda\envs\sense\lib\site-packages\flask\app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "D:\Anaconda\envs\sense\lib\site-packages\flask\app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "D:/MyDocuments/Service Outsourcing/sense/tools/sense_studio/sense_studio.py", line 46, in projects_overview
    project['exists'] = os.path.exists(project['path'])
TypeError: 'bool' object is not subscriptable
127.0.0.1 - - [2021-04-06 13:19:30] "GET / HTTP/1.1" 500 490 0.004986
127.0.0.1 - - [2021-04-06 13:19:30] "GET /favicon.ico HTTP/1.1" 404 420 0.000999

@corneliusboehm
Copy link
Contributor

* Serving Flask app "sense_studio" (lazy loading)
* Environment: production
  WARNING: This is a development server. Do not use it in a production deployment.
  Use a production WSGI server instead.
* Debug mode: on
* Restarting with stat
* Debugger is active!
* Debugger PIN: 105-309-328
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

This is a regular output and not relevant when running this application locally. You don't need to worry about setting up a WSGI server.

If the second error persists, I will have to take a look at that though 😕 Can you already send me the contents of your sense/tools/sense_studio/projects_config.json?

@tjb-tech
Copy link
Author

tjb-tech commented Apr 6, 2021

* Serving Flask app "sense_studio" (lazy loading)
* Environment: production
  WARNING: This is a development server. Do not use it in a production deployment.
  Use a production WSGI server instead.
* Debug mode: on
* Restarting with stat
* Debugger is active!
* Debugger PIN: 105-309-328
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

This is a regular output and not relevant when running this application locally. You don't need to worry about setting up a WSGI server.

If the second error persists, I will have to take a look at that though 😕 Can you already send me the contents of your sense/tools/sense_studio/projects_config.json?

Of course, my file looks like this
image

@corneliusboehm
Copy link
Contributor

Very interesting. This is either a very outdated format or an error occurred. Anyway, I would recommend deleting this file and trying again. Also you might want to pull our latest master branch, as we've recently added a few improvements.

@tjb-tech
Copy link
Author

tjb-tech commented Apr 7, 2021

Very interesting. This is either a very outdated format or an error occurred. Anyway, I would recommend deleting this file and trying again. Also you might want to pull our latest master branch, as we've recently added a few improvements.

Thank you so much for your timely help. We have opened Sense Studio and created our own project. We have also uploaded our own data, but we can't click the Training button, the browser shows JavaScript :void(0); , as shown in the figure below
image

I would appreciate it very much if you could answer my questions

@corneliusboehm
Copy link
Contributor

Yes, the training module has only been added a few days ago, so after pulling our latest updates this feature should be enabled for you.

@tjb-tech
Copy link
Author

I am very sorry that I have not been able to continue to discuss this project with you recently due to my busy business. Your suggestion last time was very effective and I admire it very much. These two days, I reviewed your project again, and carefully read the blog on the 20BN official website. I noticed the following test screen in your demo video, which was very impressive.
截屏2021-04-17 上午10 42 42
I want to achieve this effect on my computer. Could you please tell me how the content of this test page is completed? Could you share this part? My heartfelt thanks in advance! Once again, I would like to express my admiration for your open source spirit.

@guillaumebrg
Copy link
Contributor

Hey @tjb-tech, thank you for the kind words!

That specific demo which you found on our website is kind of old and wasn't obtained using sense. We haven't released this exact model with this exact set of classes. However, we've recently been working on providing a gesture control demo within sense which might do what you need. It's still work in progress (model weights haven't been released yet) but you can already have a look here: #149.

@corneliusboehm
Copy link
Contributor

It looks like the original issue has been solved, so I'm going to close this thread now.
We're happy to keep supporting you, if more questions come up!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants