-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Job gets stuck in pending state #1391
Comments
This part in the master pod seems to be correct:
The worker requested a range, the master sent the range Based on this, the issue seems to be related to how the worker handled the range 0-1. The worker requested another range at the same second (13:04:58), so I suspect that the handling of the range 0-1 caused an exception. If you don't see any other log entries in lithops/lithops/serverless/backends/k8s/entry_point.py Lines 102 to 109 in c690601
maybe the hard-coded |
I have added a log line and this error message appears:
|
So it seems the hard-coded timeout is too low. I think we can safely set it to 1 second here, since it is only a timeout, and it won't affect the performance. Feel free to submit a PR with the new log line and the new timeout if you test it and it works properly |
Adding the following log line causes the log to be flooded (like 40 - 50 lines) with the same error message when the master pod is not ready (at the beginning of the execution or after a restart of the master pod). Is it a matter of concern? while call_ids_range is None:
try:
server = f'http://{master_ip}:{config.MASTER_PORT}'
url = f'{server}/get-range/{job_key}/{total_calls}/{chunksize}'
res = requests.get(url, timeout=1)
call_ids_range = res.text # for example: 0-5
except Exception as e:
---> logger.debug(f"Error getting range: {e}")
time.sleep(0.1)
|
I see, probably that's why we didn't keep a log message here. You can put it commented for now if these logs are anoying: # logger.debug(f"Error getting range: {e}") btw, is the issue you were experiencing fixed by setting a higher timeout? |
Setting timeout to 1s doesn't seem to fix the issue. In the following execution, a mapper is stuck: Lithops client:
Mapper:
Master:
|
So this is ok:
The question is who is receiving this range? What are the logs of the reducer
|
I suspect the timeout of 1s does fix the case of a stuck reducer, but not of a stuck mapper. |
The missing one is 2-3
|
Based on the master logs the range has been requested and provided successfully to the worker with IP address 10.244.235.194 :
very weird... |
I modified my log scrapping script to print the IP addresses of the pods. Here is an example execution. As you can see, the logs match with the worker IP addresses.
|
And if you activate again the log ( |
The log line is activated and the timeout is set to 1 second. However, there is no error in the logs. while call_ids_range is None:
try:
server = f'http://{master_ip}:{config.MASTER_PORT}'
url = f'{server}/get-range/{job_key}/{total_calls}/{chunksize}'
res = requests.get(url, timeout=1)
call_ids_range = res.text # for example: 0-5
except Exception as e:
logger.debug(f"Error getting range: {e}")
time.sleep(0.1) |
@bystepii Any update on this? |
@JosepSampe The issue is still present. Jobs still get stuck in the "pending" state randomly, including both mappers and reducers. |
Unfortunately, I'm not able to reproduce the issue using an AWS EKS cluster |
I have the following code that I am trying to run on a k8s cluster using lithops:
At some random iteration, a job gets stuck in the pending state and never gets executed. It can be a mapper as well as a reducer.
Config
Logs
In the example execution, the job
R027
is stuck in the pending state.Lithops logs:
K8s logs for lithops master (pod
lithops-master-2304d41a58-s7jsx
):K8s logs for the job
R027
(podlithops-7046b1-0-r027-mjz9c
):K8s logs for the mapper jobs
M027
:pod
lithops-7046b1-0-m027-57tn9
:pod
lithops-7046b1-0-m027-g7k9k
:pod
lithops-7046b1-0-m027-kc4pr
:pod
lithops-7046b1-0-m027-pvcfh
:The text was updated successfully, but these errors were encountered: