Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some problem while running on GPU #17

Open
onlyoh opened this issue Sep 7, 2020 · 10 comments
Open

Some problem while running on GPU #17

onlyoh opened this issue Sep 7, 2020 · 10 comments

Comments

@onlyoh
Copy link

onlyoh commented Sep 7, 2020

I want to test the performance of C9 of yolo after FlexTensor's optimization, but there seems to be some problems when running optimize_conv2d.py on GPU

$ python optimize_conv2d.py --shapes yolo --from 8 --to 9 --parallel 16 --target cuda
......
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
warm up [1599394505.223908] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
warm up [1599394508.009939] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
warm up [1599394510.781969] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
Fail to find valid schedule, too many errors
warm up [1599394513.576313] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
warm up [1599394516.424372] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
Warning: No valid schedule found in warm up process, please use more trials
Now automatically use more trials, increase 16
......

I have seen a previous issue and the current code uses 'spawn' when using multiprocessing.
It seems that it will not stop running because it can't find a suitable schedule.

@KnowingNothing
Copy link
Collaborator

Please check your nvcc by typing nvcc --version in your terminal, if nvcc is not available, the codegen of tvm will fail.

@onlyoh
Copy link
Author

onlyoh commented Sep 10, 2020

The result of nvcc --version is :

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

And excuting print(torch.version.cuda) in python interpreter outputs :

10.0.130

@KnowingNothing
Copy link
Collaborator

How about setting a larger timeout? Just try to add --timeout 20 in your command.

@onlyoh
Copy link
Author

onlyoh commented Sep 10, 2020

This method does not seem to work...

@KnowingNothing
Copy link
Collaborator

Then I'd suggest that you uncomment the two #print(msg)s in scheduler.py, and then tell me the error message, if any.

@onlyoh
Copy link
Author

onlyoh commented Sep 10, 2020

It outputs these messages:

Optimize yolo convolution layer 9 shape (1, 512, 28, 28, 512, 512, 1, 1, 1, 1, 0, 1, 1)
graph space size 2
op 0 space size: 25344000
[Warning] Directory lib is not empty, but reusing it
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
warm up [1599727687.361282] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf ]
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
op build fail:module 'tvm.tir' has no attribute 'ir_pass'
......

@KnowingNothing
Copy link
Collaborator

I see. TVM is under fast development and the API keeps changing. To use FlexTensor, you can try TVM (commit 89da63e228eae2b0b4fe39770031a042858c52a7).

@onlyoh
Copy link
Author

onlyoh commented Sep 10, 2020

Thanks, I will try it!

@hecmay
Copy link

hecmay commented Sep 16, 2020

A follow-up to this issue. I got the following errors when running the same example. I am using TVM v0.7 (not exactly the commit you recommended). What would be the reason to have those null error messages?

$ python optimize_conv2d.py --shapes yolo --from 8 --to 9 --parallel 16 --target cuda
Optimize yolo convolution layer 9 shape (1, 512, 28, 28, 512, 512, 1, 1, 1, 1, 0, 1, 1)
graph space size 2
op 0 space size: 25344000
[Warning] Directory lib is not empty, but reusing it
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:                                                                                        op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
op build fail:
warm up [1600224234.227920] [ inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf
inf inf ]

@KnowingNothing
Copy link
Collaborator

Did you check your nvcc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants