Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout when jobs are waiting #8

Open
ViviYe opened this issue Aug 27, 2019 · 6 comments
Open

Timeout when jobs are waiting #8

ViviYe opened this issue Aug 27, 2019 · 6 comments

Comments

@ViviYe
Copy link
Contributor

ViviYe commented Aug 27, 2019

When I run 4 hw jobs in parallel, half of them end up getting timeout errors. I suspected that this is because generating AFI stage cannot be run in parallel and the waiting time added up to reach the timeout for AFI stage.
In these two, timeout happens in the beginning of AFI stage
http://ec2-54-234-195-6.compute-1.amazonaws.com:5000/jobs/2sfeH7kJqyg.html
http://ec2-54-234-195-6.compute-1.amazonaws.com:5000/jobs/w57woT8LUMs.html
However, some of the jobs timeout in the beginning, or middle of make stages
http://ec2-54-234-195-6.compute-1.amazonaws.com:5000/jobs/wpc3A_L4UDI.html
http://ec2-54-234-195-6.compute-1.amazonaws.com:5000/jobs/fjMiNTK1b2E/log.txt
And the synthesis timeout is already 20000. I also suspect when jobs are waiting for entering make stage, the waiting time counts towards timeout?

@sampsyo
Copy link
Contributor

sampsyo commented Aug 27, 2019

Hmm… the time should not "count" when a job is waiting to enter a stage. Timeouts are only on individual command executions, not on entire stages, so jobs can wait indefinitely until it's their turn.

The weirdest thing is this command timing out:

2019-08-27T12:57:35.534734 $ 'cd $AWS_FPGA_REPO_DIR ; source ./sdaccel_setup.sh > /dev/null ; echo $AWS_PLATFORM'
2019-08-27T12:58:40.674941 timeout after 60 seconds

It might be worth manually running that setup script a couple times to see if it waits for something to happen at any point. Not that this solves the mystery, but maybe there's a more direct way to obtain $AWS_PLATFORM that doesn't involve running the whole setup script…

@ViviYe
Copy link
Contributor Author

ViviYe commented Aug 27, 2019

From the command, it seems like the command was only given 60s before timing out?

@ViviYe
Copy link
Contributor Author

ViviYe commented Aug 27, 2019

Ah! I see now all the process have their own timeout! I will try running the script multiple times! I will also try increase the timeout for the specific process that are timing out!!

@ViviYe
Copy link
Contributor Author

ViviYe commented Aug 27, 2019

adding on this:
http://ec2-54-234-195-6.compute-1.amazonaws.com:5000/jobs/Z847gWO7OEQ.html
multiple jobs also stopped at making stage: They didn't fail but stopped running for some reason..

@sampsyo
Copy link
Contributor

sampsyo commented Aug 28, 2019

Oh no, that's bad! I really don't know why that would be happening… it really seems like the timeout should work every time. 😱

I don't exactly know what's going wrong and we'll have to dig deeper. But I do notice that we're not cleaning up timed-out commands, as we probably should be according to the Python docs.
https://docs.python.org/3/library/subprocess.html#subprocess.Popen.communicate

But that probably couldn't cause what we're seeing here?

@rachitnigam
Copy link
Member

This was fixed. The issue was make stage would block on the license acquired by a different make stage on the same machine. With the new AWS deployment, this is not a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants