Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Job gets stuck in pending state #1391

Open
bystepii opened this issue Jul 31, 2024 · 16 comments
Open

[k8s] Job gets stuck in pending state #1391

bystepii opened this issue Jul 31, 2024 · 16 comments

Comments

@bystepii
Copy link
Contributor

I have the following code that I am trying to run on a k8s cluster using lithops:

from lithops import FunctionExecutor
from time import sleep

def map_fn(i):
    sleep(3)

def red_fn(i):
    sleep(2)

fexec = FunctionExecutor()
for i in range(1000):
    print(f"i = {i}")
    fexec.map_reduce(map_fn, [1,2,3,4], red_fn)
    res = fexec.get_result()

At some random iteration, a job gets stuck in the pending state and never gets executed. It can be a mapper as well as a reducer.

Config

lithops:
 backend: k8s
 storage: minio
 log_level: DEBUG

k8s:
 docker_server: docker.io
 docker_user: stepii
 docker_password: ******
 runtime_memory: 10240
 runtime_cpu: 6
 worker_processes: 1
 runtime_timeout: 900
 namespace: indexing

minio:
 storage_bucket: stepan
 endpoint: http://192.168.5.24:9000
 access_key_id: ******
 secret_access_key: ******

Logs

In the example execution, the job R027 is stuck in the pending state.

Lithops logs:

[...]
i = 27
2024-07-31 13:04:49,542 [INFO] invokers.py:119 -- ExecutorID 7046b1-0 | JobID M027 - Selected Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - 10240MB
2024-07-31 13:04:49,542 [DEBUG] storage.py:472 -- Runtime metadata found in local memory cache
2024-07-31 13:04:49,542 [DEBUG] job.py:242 -- ExecutorID 7046b1-0 | JobID M027 - Serializing function and data
2024-07-31 13:04:49,544 [DEBUG] serialize.py:75 -- Referenced Modules: __main__, time
2024-07-31 13:04:49,544 [DEBUG] module_dependency.py:172 -- Module 'time' is built-in. Skipping.
2024-07-31 13:04:49,544 [DEBUG] serialize.py:120 -- Modules to transmit: None
2024-07-31 13:04:49,544 [DEBUG] job.py:284 -- ExecutorID 7046b1-0 | JobID M027 - Function and modules found in local cache
2024-07-31 13:04:49,544 [DEBUG] job.py:302 -- ExecutorID 7046b1-0 | JobID M027 - Uploading data to the storage backend
2024-07-31 13:04:49,631 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/7046b1-0-M027/aggdata.pickle - Size: 84.0B - OK
2024-07-31 13:04:49,631 [INFO] invokers.py:186 -- ExecutorID 7046b1-0 | JobID M027 - Starting function invocation: map_fn() - Total: 4 activations
2024-07-31 13:04:49,631 [DEBUG] invokers.py:212 -- ExecutorID 7046b1-0 | JobID M027 - Worker processes: 1 - Chunksize: 1
2024-07-31 13:04:49,645 [DEBUG] k8s.py:632 -- ExecutorID 7046b1-0 | JobID M027 - Required Workers: 4
2024-07-31 13:04:49,653 [DEBUG] k8s.py:661 -- ExecutorID 7046b1-0 | JobID M027 - Going to run 4 activations in 4 workers
2024-07-31 13:04:49,668 [DEBUG] invokers.py:271 -- ExecutorID 7046b1-0 | JobID M027 - Job invoked (0.037s) - Activation ID: lithops-7046b1-0-m027
2024-07-31 13:04:49,668 [INFO] invokers.py:225 -- ExecutorID 7046b1-0 | JobID M027 - View execution logs at /tmp/lithops-stepii/logs/7046b1-0-M027.log
2024-07-31 13:04:49,669 [DEBUG] monitor.py:429 -- ExecutorID 7046b1-0 - Starting Storage job monitor
2024-07-31 13:04:49,669 [INFO] wait.py:105 -- ExecutorID 7046b1-0 - Waiting for 20% of 4 function activations to complete
2024-07-31 13:04:51,708 [DEBUG] monitor.py:144 -- ExecutorID 7046b1-0 - Pending: 4 - Running: 0 - Done: 0
2024-07-31 13:04:52,141 [DEBUG] monitor.py:144 -- ExecutorID 7046b1-0 - Pending: 3 - Running: 1 - Done: 0
2024-07-31 13:04:52,574 [DEBUG] monitor.py:144 -- ExecutorID 7046b1-0 - Pending: 0 - Running: 4 - Done: 0
2024-07-31 13:04:55,681 [DEBUG] future.py:229 -- ExecutorID 7046b1-0 | JobID M027 - Got status from call 00000 - Activation ID: ce49b1360754 - Time: 3.20 seconds
2024-07-31 13:04:55,682 [DEBUG] future.py:229 -- ExecutorID 7046b1-0 | JobID M027 - Got status from call 00001 - Activation ID: 5e9ab076c1c4 - Time: 3.15 seconds
2024-07-31 13:04:55,682 [DEBUG] future.py:229 -- ExecutorID 7046b1-0 | JobID M027 - Got status from call 00003 - Activation ID: d603959a2909 - Time: 3.17 seconds
2024-07-31 13:04:55,683 [DEBUG] executors.py:366 -- ExecutorID 7046b1-0 | JobID M027 - 20% of map activations done. Spawning reduce stage
2024-07-31 13:04:55,683 [INFO] invokers.py:119 -- ExecutorID 7046b1-0 | JobID R027 - Selected Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - 10240MB
2024-07-31 13:04:55,683 [DEBUG] storage.py:472 -- Runtime metadata found in local memory cache
2024-07-31 13:04:55,683 [DEBUG] job.py:185 -- Extra environment vars {'__LITHOPS_REDUCE_JOB': 'True'}
2024-07-31 13:04:55,683 [DEBUG] job.py:242 -- ExecutorID 7046b1-0 | JobID R027 - Serializing function and data
2024-07-31 13:04:55,684 [DEBUG] serialize.py:75 -- Referenced Modules: __main__, time
2024-07-31 13:04:55,684 [DEBUG] module_dependency.py:172 -- Module 'time' is built-in. Skipping.
2024-07-31 13:04:55,684 [DEBUG] serialize.py:120 -- Modules to transmit: None
2024-07-31 13:04:55,685 [DEBUG] job.py:284 -- ExecutorID 7046b1-0 | JobID R027 - Function and modules found in local cache
2024-07-31 13:04:55,685 [DEBUG] job.py:302 -- ExecutorID 7046b1-0 | JobID R027 - Uploading data to the storage backend
2024-07-31 13:04:55,741 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/7046b1-0-R027/aggdata.pickle - Size: 13.6KiB - OK
2024-07-31 13:04:55,741 [INFO] invokers.py:186 -- ExecutorID 7046b1-0 | JobID R027 - Starting function invocation: red_fn() - Total: 1 activations
2024-07-31 13:04:55,741 [DEBUG] invokers.py:212 -- ExecutorID 7046b1-0 | JobID R027 - Worker processes: 1 - Chunksize: 1
2024-07-31 13:04:55,758 [DEBUG] k8s.py:632 -- ExecutorID 7046b1-0 | JobID R027 - Required Workers: 1
2024-07-31 13:04:55,766 [DEBUG] k8s.py:661 -- ExecutorID 7046b1-0 | JobID R027 - Going to run 1 activations in 1 workers
2024-07-31 13:04:55,782 [DEBUG] invokers.py:271 -- ExecutorID 7046b1-0 | JobID R027 - Job invoked (0.041s) - Activation ID: lithops-7046b1-0-r027
2024-07-31 13:04:55,782 [INFO] invokers.py:225 -- ExecutorID 7046b1-0 | JobID R027 - View execution logs at /tmp/lithops-stepii/logs/7046b1-0-R027.log
2024-07-31 13:04:55,782 [INFO] executors.py:494 -- ExecutorID 7046b1-0 - Getting results from 28 function activations
2024-07-31 13:04:55,783 [INFO] wait.py:101 -- ExecutorID 7046b1-0 - Waiting for 2 function activations to complete
2024-07-31 13:04:55,918 [DEBUG] monitor.py:144 -- ExecutorID 7046b1-0 - Pending: 1 - Running: 0 - Done: 4
2024-07-31 13:04:56,786 [DEBUG] future.py:229 -- ExecutorID 7046b1-0 | JobID M027 - Got status from call 00002 - Activation ID: d48d456eee35 - Time: 3.20 seconds
2024-07-31 13:05:02,704 [DEBUG] monitor.py:144 -- ExecutorID 7046b1-0 - Pending: 1 - Running: 0 - Done: 4
2024-07-31 13:05:09,433 [DEBUG] monitor.py:144 -- ExecutorID 7046b1-0 - Pending: 1 - Running: 0 - Done: 4
2024-07-31 13:05:16,152 [DEBUG] monitor.py:144 -- ExecutorID 7046b1-0 - Pending: 1 - Running: 0 - Done: 4
2024-07-31 13:05:22,877 [DEBUG] monitor.py:144 -- ExecutorID 7046b1-0 - Pending: 1 - Running: 0 - Done: 4

K8s logs for lithops master (pod lithops-master-2304d41a58-s7jsx):

[...]
10.244.140.136 - - [31/Jul/2024 13:04:51] "GET /get-range/7046b1-0-M027/4/1 HTTP/1.1" 200 -
[2024-07-31 13:04:51,829] INFO in lithopsentry: Sending range "1-2" to Host 10.244.235.216
10.244.235.216 - - [31/Jul/2024 13:04:51] "GET /get-range/7046b1-0-M027/4/1 HTTP/1.1" 200 -
[2024-07-31 13:04:51,911] INFO in lithopsentry: Sending range "2-3" to Host 10.244.140.172
10.244.140.172 - - [31/Jul/2024 13:04:51] "GET /get-range/7046b1-0-M027/4/1 HTTP/1.1" 200 -
[2024-07-31 13:04:51,930] INFO in lithopsentry: Sending range "3-4" to Host 10.244.235.194
10.244.235.194 - - [31/Jul/2024 13:04:51] "GET /get-range/7046b1-0-M027/4/1 HTTP/1.1" 200 -
[2024-07-31 13:04:51,935] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.190
10.244.140.190 - - [31/Jul/2024 13:04:51] "GET /get-range/7046b1-0-M026/4/1 HTTP/1.1" 200 -
[2024-07-31 13:04:52,053] INFO in lithopsentry: Sending range "-1" to Host 10.244.235.228
10.244.235.228 - - [31/Jul/2024 13:04:52] "GET /get-range/7046b1-0-M026/4/1 HTTP/1.1" 200 -
[2024-07-31 13:04:52,112] INFO in lithopsentry: Sending range "-1" to Host 10.244.235.242
10.244.235.242 - - [31/Jul/2024 13:04:52] "GET /get-range/7046b1-0-M026/4/1 HTTP/1.1" 200 -
[2024-07-31 13:04:52,171] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.143
10.244.140.143 - - [31/Jul/2024 13:04:52] "GET /get-range/7046b1-0-M026/4/1 HTTP/1.1" 200 -
[2024-07-31 13:04:57,983] INFO in lithopsentry: Sending range "0-1" to Host 10.244.140.142
10.244.140.142 - - [31/Jul/2024 13:04:58] "GET /get-range/7046b1-0-R027/1/1 HTTP/1.1" 200 -
[2024-07-31 13:04:58,184] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.142
10.244.140.142 - - [31/Jul/2024 13:04:58] "GET /get-range/7046b1-0-R027/1/1 HTTP/1.1" 200 -
[2024-07-31 13:04:59,130] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.171
10.244.140.171 - - [31/Jul/2024 13:04:59] "GET /get-range/7046b1-0-R026/1/1 HTTP/1.1" 200 -
[2024-07-31 13:05:05,414] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.136
10.244.140.136 - - [31/Jul/2024 13:05:05] "GET /get-range/7046b1-0-M027/4/1 HTTP/1.1" 200 -
[2024-07-31 13:05:06,656] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.172
10.244.140.172 - - [31/Jul/2024 13:05:06] "GET /get-range/7046b1-0-M027/4/1 HTTP/1.1" 200 -
[2024-07-31 13:05:06,712] INFO in lithopsentry: Sending range "-1" to Host 10.244.235.194
10.244.235.194 - - [31/Jul/2024 13:05:06] "GET /get-range/7046b1-0-M027/4/1 HTTP/1.1" 200 -
[2024-07-31 13:05:06,714] INFO in lithopsentry: Sending range "-1" to Host 10.244.235.216
10.244.235.216 - - [31/Jul/2024 13:05:06] "GET /get-range/7046b1-0-M027/4/1 HTTP/1.1" 200 -

K8s logs for the job R027 (pod lithops-7046b1-0-r027-mjz9c):

2024-07-31 13:04:57,977 [INFO] lithopsentry.py:84 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-07-31 13:04:58,187 [INFO] lithopsentry.py:111 -- Received range: -1
2024-07-31 13:04:58,187 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution

K8s logs for the mapper jobs M027:

pod lithops-7046b1-0-m027-57tn9:

2024-07-31 13:04:51,922 [INFO] lithopsentry.py:84 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-07-31 13:04:51,929 [INFO] lithopsentry.py:111 -- Received range: 3-4
2024-07-31 13:04:52,086 [DEBUG] minio.py:38 -- Creating MinIO client
2024-07-31 13:04:52,086 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-07-31 13:04:52,260 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-07-31 13:04:52,268 [INFO] utils.py:50 -- Getting function and modules
2024-07-31 13:04:52,268 [INFO] utils.py:62 -- Loading lithops.jobs/7046b1-0/c9195dee50e04507b9a5f2d1019afc17.func.pickle from storage
2024-07-31 13:04:52,288 [INFO] utils.py:107 -- Loading function data parameters from storage
2024-07-31 13:04:52,297 [INFO] handler.py:75 -- Tasks received: 1 - Worker processes: 1
2024-07-31 13:04:52,297 [INFO] handler.py:117 -- Worker process 0 started
2024-07-31 13:04:52,299 [INFO] handler.py:177 -- Lithops v3.4.2.dev0 - Starting k8s execution
2024-07-31 13:04:52,299 [INFO] handler.py:178 -- Execution ID: 7046b1-0-M027/00003
2024-07-31 13:04:52,299 [DEBUG] minio.py:38 -- Creating MinIO client
2024-07-31 13:04:52,299 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-07-31 13:04:52,306 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-07-31 13:04:52,312 [DEBUG] handler.py:194 -- Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - Memory: 10240MB - Timeout: 895 seconds
2024-07-31 13:04:52,413 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/7046b1-0-M027/00003/d603959a2909.init - Size: 0.0B - OK
2024-07-31 13:04:52,414 [DEBUG] handler.py:206 -- Starting JobRunner process
2024-07-31 13:04:52,432 [DEBUG] jobrunner.py:204 -- Process started
2024-07-31 13:04:52,434 [INFO] jobrunner.py:234 -- Going to execute 'map_fn()'
---------------------- FUNCTION LOG ----------------------
----------------------------------------------------------
2024-07-31 13:04:55,437 [INFO] jobrunner.py:240 -- Success function execution
2024-07-31 13:04:55,439 [INFO] jobrunner.py:312 -- Process finished
2024-07-31 13:04:55,467 [DEBUG] handler.py:217 -- JobRunner process finished
2024-07-31 13:04:55,468 [INFO] status.py:88 -- Storing execution stats - Size: 2.1KiB
2024-07-31 13:05:06,705 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/7046b1-0-M027/00003/status.json - Size: 2.1KiB - OK
2024-07-31 13:05:06,705 [INFO] handler.py:292 -- Finished
2024-07-31 13:05:06,705 [INFO] handler.py:139 -- Worker process 0 finished
2024-07-31 13:05:06,712 [INFO] lithopsentry.py:111 -- Received range: -1
2024-07-31 13:05:06,712 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution

pod lithops-7046b1-0-m027-g7k9k:

2024-07-31 13:04:51,693 [INFO] lithopsentry.py:84 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-07-31 13:04:51,700 [INFO] lithopsentry.py:111 -- Received range: 0-1
2024-07-31 13:04:51,817 [DEBUG] minio.py:38 -- Creating MinIO client
2024-07-31 13:04:51,817 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-07-31 13:04:51,950 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-07-31 13:04:51,958 [INFO] utils.py:50 -- Getting function and modules
2024-07-31 13:04:51,958 [INFO] utils.py:69 -- Loading lithops.jobs/7046b1-0/c9195dee50e04507b9a5f2d1019afc17.func.pickle from storage
2024-07-31 13:04:51,978 [INFO] utils.py:114 -- Loading function data parameters from storage
2024-07-31 13:04:52,049 [INFO] handler.py:75 -- Tasks received: 1 - Worker processes: 1
2024-07-31 13:04:52,049 [INFO] handler.py:117 -- Worker process 0 started
2024-07-31 13:04:52,051 [INFO] handler.py:177 -- Lithops v3.4.2.dev0 - Starting k8s execution
2024-07-31 13:04:52,051 [INFO] handler.py:178 -- Execution ID: 7046b1-0-M027/00000
2024-07-31 13:04:52,051 [DEBUG] minio.py:38 -- Creating MinIO client
2024-07-31 13:04:52,051 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-07-31 13:04:52,056 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-07-31 13:04:52,061 [DEBUG] handler.py:194 -- Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - Memory: 10240MB - Timeout: 895 seconds
2024-07-31 13:04:52,191 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/7046b1-0-M027/00000/ce49b1360754.init - Size: 0.0B - OK
2024-07-31 13:04:52,192 [DEBUG] handler.py:206 -- Starting JobRunner process
2024-07-31 13:04:52,213 [DEBUG] jobrunner.py:204 -- Process started
2024-07-31 13:04:52,215 [INFO] jobrunner.py:234 -- Going to execute 'map_fn()'
---------------------- FUNCTION LOG ----------------------
----------------------------------------------------------
2024-07-31 13:04:55,219 [INFO] jobrunner.py:240 -- Success function execution
2024-07-31 13:04:55,221 [INFO] jobrunner.py:312 -- Process finished
2024-07-31 13:04:55,248 [DEBUG] handler.py:217 -- JobRunner process finished
2024-07-31 13:04:55,250 [INFO] status.py:88 -- Storing execution stats - Size: 2.1KiB
2024-07-31 13:05:05,409 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/7046b1-0-M027/00000/status.json - Size: 2.1KiB - OK
2024-07-31 13:05:05,409 [INFO] handler.py:292 -- Finished
2024-07-31 13:05:05,410 [INFO] handler.py:139 -- Worker process 0 finished
2024-07-31 13:05:05,417 [INFO] lithopsentry.py:111 -- Received range: -1
2024-07-31 13:05:05,417 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution

pod lithops-7046b1-0-m027-kc4pr:

2024-07-31 13:04:51,821 [INFO] lithopsentry.py:84 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-07-31 13:04:51,829 [INFO] lithopsentry.py:111 -- Received range: 1-2
2024-07-31 13:04:51,975 [DEBUG] minio.py:38 -- Creating MinIO client
2024-07-31 13:04:51,976 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-07-31 13:04:52,148 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-07-31 13:04:52,157 [INFO] utils.py:50 -- Getting function and modules
2024-07-31 13:04:52,157 [INFO] utils.py:62 -- Loading lithops.jobs/7046b1-0/c9195dee50e04507b9a5f2d1019afc17.func.pickle from storage
2024-07-31 13:04:52,188 [INFO] utils.py:107 -- Loading function data parameters from storage
2024-07-31 13:04:52,199 [INFO] handler.py:75 -- Tasks received: 1 - Worker processes: 1
2024-07-31 13:04:52,199 [INFO] handler.py:117 -- Worker process 0 started
2024-07-31 13:04:52,201 [INFO] handler.py:177 -- Lithops v3.4.2.dev0 - Starting k8s execution
2024-07-31 13:04:52,201 [INFO] handler.py:178 -- Execution ID: 7046b1-0-M027/00001
2024-07-31 13:04:52,201 [DEBUG] minio.py:38 -- Creating MinIO client
2024-07-31 13:04:52,201 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-07-31 13:04:52,209 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-07-31 13:04:52,216 [DEBUG] handler.py:194 -- Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - Memory: 10240MB - Timeout: 895 seconds
2024-07-31 13:04:52,289 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/7046b1-0-M027/00001/5e9ab076c1c4.init - Size: 0.0B - OK
2024-07-31 13:04:52,290 [DEBUG] handler.py:206 -- Starting JobRunner process
2024-07-31 13:04:52,309 [DEBUG] jobrunner.py:204 -- Process started
2024-07-31 13:04:52,311 [INFO] jobrunner.py:234 -- Going to execute 'map_fn()'
---------------------- FUNCTION LOG ----------------------
----------------------------------------------------------
2024-07-31 13:04:55,315 [INFO] jobrunner.py:240 -- Success function execution
2024-07-31 13:04:55,317 [INFO] jobrunner.py:312 -- Process finished
2024-07-31 13:04:55,346 [DEBUG] handler.py:217 -- JobRunner process finished
2024-07-31 13:04:55,347 [INFO] status.py:88 -- Storing execution stats - Size: 2.1KiB
2024-07-31 13:05:06,705 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/7046b1-0-M027/00001/status.json - Size: 2.1KiB - OK
2024-07-31 13:05:06,705 [INFO] handler.py:292 -- Finished
2024-07-31 13:05:06,705 [INFO] handler.py:139 -- Worker process 0 finished
2024-07-31 13:05:06,714 [INFO] lithopsentry.py:111 -- Received range: -1
2024-07-31 13:05:06,715 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution

pod lithops-7046b1-0-m027-pvcfh:

2024-07-31 13:04:51,905 [INFO] lithopsentry.py:84 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-07-31 13:04:51,913 [INFO] lithopsentry.py:111 -- Received range: 2-3
2024-07-31 13:04:52,082 [DEBUG] minio.py:38 -- Creating MinIO client
2024-07-31 13:04:52,082 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-07-31 13:04:52,254 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-07-31 13:04:52,263 [INFO] utils.py:50 -- Getting function and modules
2024-07-31 13:04:52,263 [INFO] utils.py:69 -- Loading lithops.jobs/7046b1-0/c9195dee50e04507b9a5f2d1019afc17.func.pickle from storage
2024-07-31 13:04:52,290 [INFO] utils.py:114 -- Loading function data parameters from storage
2024-07-31 13:04:52,299 [INFO] handler.py:75 -- Tasks received: 1 - Worker processes: 1
2024-07-31 13:04:52,299 [INFO] handler.py:117 -- Worker process 0 started
2024-07-31 13:04:52,301 [INFO] handler.py:177 -- Lithops v3.4.2.dev0 - Starting k8s execution
2024-07-31 13:04:52,301 [INFO] handler.py:178 -- Execution ID: 7046b1-0-M027/00002
2024-07-31 13:04:52,302 [DEBUG] minio.py:38 -- Creating MinIO client
2024-07-31 13:04:52,302 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-07-31 13:04:52,308 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-07-31 13:04:52,315 [DEBUG] handler.py:194 -- Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - Memory: 10240MB - Timeout: 895 seconds
2024-07-31 13:04:52,440 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/7046b1-0-M027/00002/d48d456eee35.init - Size: 0.0B - OK
2024-07-31 13:04:52,441 [DEBUG] handler.py:206 -- Starting JobRunner process
2024-07-31 13:04:52,458 [DEBUG] jobrunner.py:204 -- Process started
2024-07-31 13:04:52,461 [INFO] jobrunner.py:234 -- Going to execute 'map_fn()'
---------------------- FUNCTION LOG ----------------------
----------------------------------------------------------
2024-07-31 13:04:55,464 [INFO] jobrunner.py:240 -- Success function execution
2024-07-31 13:04:55,466 [INFO] jobrunner.py:312 -- Process finished
2024-07-31 13:04:55,496 [DEBUG] handler.py:217 -- JobRunner process finished
2024-07-31 13:04:55,497 [INFO] status.py:88 -- Storing execution stats - Size: 2.1KiB
2024-07-31 13:05:06,652 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/7046b1-0-M027/00002/status.json - Size: 2.1KiB - OK
2024-07-31 13:05:06,652 [INFO] handler.py:292 -- Finished
2024-07-31 13:05:06,652 [INFO] handler.py:139 -- Worker process 0 finished
2024-07-31 13:05:06,659 [INFO] lithopsentry.py:111 -- Received range: -1
2024-07-31 13:05:06,659 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution
@JosepSampe
Copy link
Member

This part in the master pod seems to be correct:

[2024-07-31 13:04:57,983] INFO in lithopsentry: Sending range "0-1" to Host 10.244.140.142
10.244.140.142 - - [31/Jul/2024 13:04:58] "GET /get-range/7046b1-0-R027/1/1 HTTP/1.1" 200 -
[2024-07-31 13:04:58,184] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.142
10.244.140.142 - - [31/Jul/2024 13:04:58] "GET /get-range/7046b1-0-R027/1/1 HTTP/1.1" 200 -

The worker requested a range, the master sent the range 0-1 to the worker, the worker processed it, the worker requested another range, the master sent -1, the worker finished. This flow is ok.

Based on this, the issue seems to be related to how the worker handled the range 0-1. The worker requested another range at the same second (13:04:58), so I suspect that the handling of the range 0-1 caused an exception.

If you don't see any other log entries in lithops-7046b1-0-r027-mjz9c, it likely means the exception occurred here:

while call_ids_range is None:
try:
server = f'http://{master_ip}:{config.MASTER_PORT}'
url = f'{server}/get-range/{job_key}/{total_calls}/{chunksize}'
res = requests.get(url, timeout=0.1)
call_ids_range = res.text # for example: 0-5
except Exception:
time.sleep(0.1)

maybe the hard-coded timeout is too low? or something happened accessing res.text?
Probably it is a good idea to add a log line inside the except Exception: before the time.sleep()

@bystepii
Copy link
Contributor Author

bystepii commented Aug 2, 2024

I have added a log line and this error message appears:

2024-08-02 13:34:18,389 [INFO] lithopsentry.py:84 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-08-02 13:34:18,494 [DEBUG] lithopsentry.py:109 -- Error getting range: HTTPConnectionPool(host='10.244.140.143', port=8080): Read timed out. (read timeout=0.1)
2024-08-02 13:34:18,696 [DEBUG] lithopsentry.py:109 -- Error getting range: HTTPConnectionPool(host='10.244.140.143', port=8080): Read timed out. (read timeout=0.1)
2024-08-02 13:34:18,833 [INFO] lithopsentry.py:112 -- Received range: -1
2024-08-02 13:34:18,833 [INFO] lithopsentry.py:123 -- Finishing kubernetes execution

@JosepSampe
Copy link
Member

So it seems the hard-coded timeout is too low. I think we can safely set it to 1 second here, since it is only a timeout, and it won't affect the performance. Feel free to submit a PR with the new log line and the new timeout if you test it and it works properly

@bystepii
Copy link
Contributor Author

bystepii commented Aug 2, 2024

Adding the following log line causes the log to be flooded (like 40 - 50 lines) with the same error message when the master pod is not ready (at the beginning of the execution or after a restart of the master pod). Is it a matter of concern?

        while call_ids_range is None:
            try:
                server = f'http://{master_ip}:{config.MASTER_PORT}'
                url = f'{server}/get-range/{job_key}/{total_calls}/{chunksize}'
                res = requests.get(url, timeout=1)
                call_ids_range = res.text  # for example: 0-5
            except Exception as e:
           ---> logger.debug(f"Error getting range: {e}")
                time.sleep(0.1)
2024-08-02 14:08:45,400 [DEBUG] lithopsentry.py:109 -- Error getting range: HTTPConnectionPool(host='10.244.235.196', port=8080): Max retries exceeded with url: /get-range/f0207a-0-M045/4/1 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1b201ab100>: Failed to establish a new connection: [Errno 111] Connection refused'))

@JosepSampe
Copy link
Member

I see, probably that's why we didn't keep a log message here. You can put it commented for now if these logs are anoying:

# logger.debug(f"Error getting range: {e}")

btw, is the issue you were experiencing fixed by setting a higher timeout?

@bystepii
Copy link
Contributor Author

bystepii commented Aug 2, 2024

Setting timeout to 1s doesn't seem to fix the issue.

In the following execution, a mapper is stuck:

Lithops client:

[...]
i = 27
2024-08-02 15:30:37,546 [INFO] invokers.py:119 -- ExecutorID 5eb4c8-0 | JobID M027 - Selected Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - 10240MB
2024-08-02 15:30:37,546 [DEBUG] storage.py:472 -- Runtime metadata found in local memory cache
2024-08-02 15:30:37,546 [DEBUG] job.py:242 -- ExecutorID 5eb4c8-0 | JobID M027 - Serializing function and data
2024-08-02 15:30:37,548 [DEBUG] serialize.py:75 -- Referenced Modules: time, __main__
2024-08-02 15:30:37,548 [DEBUG] module_dependency.py:172 -- Module 'time' is built-in. Skipping.
2024-08-02 15:30:37,548 [DEBUG] serialize.py:120 -- Modules to transmit: None
2024-08-02 15:30:37,548 [DEBUG] job.py:284 -- ExecutorID 5eb4c8-0 | JobID M027 - Function and modules found in local cache
2024-08-02 15:30:37,548 [DEBUG] job.py:302 -- ExecutorID 5eb4c8-0 | JobID M027 - Uploading data to the storage backend
2024-08-02 15:30:37,625 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/5eb4c8-0-M027/aggdata.pickle - Size: 84.0B - OK
2024-08-02 15:30:37,625 [INFO] invokers.py:186 -- ExecutorID 5eb4c8-0 | JobID M027 - Starting function invocation: map_fn() - Total: 4 activations
2024-08-02 15:30:37,625 [DEBUG] invokers.py:212 -- ExecutorID 5eb4c8-0 | JobID M027 - Worker processes: 1 - Chunksize: 1
2024-08-02 15:30:37,641 [DEBUG] k8s.py:632 -- ExecutorID 5eb4c8-0 | JobID M027 - Required Workers: 4
2024-08-02 15:30:37,650 [DEBUG] k8s.py:661 -- ExecutorID 5eb4c8-0 | JobID M027 - Going to run 4 activations in 4 workers
2024-08-02 15:30:37,673 [DEBUG] invokers.py:271 -- ExecutorID 5eb4c8-0 | JobID M027 - Job invoked (0.047s) - Activation ID: lithops-5eb4c8-0-m027
2024-08-02 15:30:37,673 [INFO] invokers.py:225 -- ExecutorID 5eb4c8-0 | JobID M027 - View execution logs at /tmp/lithops-stepii/logs/5eb4c8-0-M027.log
2024-08-02 15:30:37,673 [DEBUG] monitor.py:429 -- ExecutorID 5eb4c8-0 - Starting Storage job monitor
2024-08-02 15:30:37,673 [INFO] wait.py:105 -- ExecutorID 5eb4c8-0 - Waiting for 20% of 4 function activations to complete
2024-08-02 15:30:39,711 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 4 - Running: 0 - Done: 0
2024-08-02 15:30:40,575 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 3 - Done: 0
2024-08-02 15:30:43,589 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 1 - Done: 2
2024-08-02 15:30:43,684 [DEBUG] future.py:229 -- ExecutorID 5eb4c8-0 | JobID M027 - Got status from call 00000 - Activation ID: f68a132ca64d - Time: 3.13 seconds
2024-08-02 15:30:43,685 [DEBUG] future.py:229 -- ExecutorID 5eb4c8-0 | JobID M027 - Got status from call 00001 - Activation ID: 1849dd89b0ca - Time: 3.15 seconds
2024-08-02 15:30:43,685 [DEBUG] executors.py:366 -- ExecutorID 5eb4c8-0 | JobID M027 - 20% of map activations done. Spawning reduce stage
2024-08-02 15:30:43,685 [INFO] invokers.py:119 -- ExecutorID 5eb4c8-0 | JobID R027 - Selected Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - 10240MB
2024-08-02 15:30:43,686 [DEBUG] storage.py:472 -- Runtime metadata found in local memory cache
2024-08-02 15:30:43,686 [DEBUG] job.py:185 -- Extra environment vars {'__LITHOPS_REDUCE_JOB': 'True'}
2024-08-02 15:30:43,686 [DEBUG] job.py:242 -- ExecutorID 5eb4c8-0 | JobID R027 - Serializing function and data
2024-08-02 15:30:43,687 [DEBUG] serialize.py:75 -- Referenced Modules: time, __main__
2024-08-02 15:30:43,687 [DEBUG] module_dependency.py:172 -- Module 'time' is built-in. Skipping.
2024-08-02 15:30:43,687 [DEBUG] serialize.py:120 -- Modules to transmit: None
2024-08-02 15:30:43,687 [DEBUG] job.py:284 -- ExecutorID 5eb4c8-0 | JobID R027 - Function and modules found in local cache
2024-08-02 15:30:43,687 [DEBUG] job.py:302 -- ExecutorID 5eb4c8-0 | JobID R027 - Uploading data to the storage backend
2024-08-02 15:30:43,793 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/5eb4c8-0-R027/aggdata.pickle - Size: 9.7KiB - OK
2024-08-02 15:30:43,793 [INFO] invokers.py:186 -- ExecutorID 5eb4c8-0 | JobID R027 - Starting function invocation: red_fn() - Total: 1 activations
2024-08-02 15:30:43,793 [DEBUG] invokers.py:212 -- ExecutorID 5eb4c8-0 | JobID R027 - Worker processes: 1 - Chunksize: 1
2024-08-02 15:30:43,811 [DEBUG] k8s.py:632 -- ExecutorID 5eb4c8-0 | JobID R027 - Required Workers: 1
2024-08-02 15:30:43,819 [DEBUG] k8s.py:661 -- ExecutorID 5eb4c8-0 | JobID R027 - Going to run 1 activations in 1 workers
2024-08-02 15:30:43,835 [DEBUG] invokers.py:271 -- ExecutorID 5eb4c8-0 | JobID R027 - Job invoked (0.041s) - Activation ID: lithops-5eb4c8-0-r027
2024-08-02 15:30:43,835 [INFO] invokers.py:225 -- ExecutorID 5eb4c8-0 | JobID R027 - View execution logs at /tmp/lithops-stepii/logs/5eb4c8-0-R027.log
2024-08-02 15:30:43,835 [INFO] executors.py:494 -- ExecutorID 5eb4c8-0 - Getting results from 28 function activations
2024-08-02 15:30:43,835 [INFO] wait.py:101 -- ExecutorID 5eb4c8-0 - Waiting for 3 function activations to complete
2024-08-02 15:30:44,018 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 2 - Running: 0 - Done: 3
2024-08-02 15:30:44,838 [DEBUG] future.py:229 -- ExecutorID 5eb4c8-0 | JobID M027 - Got status from call 00003 - Activation ID: a6600c24322a - Time: 3.16 seconds
2024-08-02 15:30:46,572 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 1 - Done: 3
2024-08-02 15:30:53,373 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 1 - Done: 3
2024-08-02 15:31:00,129 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 1 - Done: 3
2024-08-02 15:31:06,874 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 1 - Done: 3
2024-08-02 15:31:13,620 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 1 - Done: 3
2024-08-02 15:31:20,361 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 1 - Done: 3
2024-08-02 15:31:27,107 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 1 - Done: 3

Mapper:

2024-08-02 15:30:39,910 [INFO] lithopsentry.py:84 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-08-02 15:30:40,213 [INFO] lithopsentry.py:111 -- Received range: -1
2024-08-02 15:30:40,213 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution

Master:

[...]
10.244.235.255 - - [02/Aug/2024 15:30:39] "GET /get-range/5eb4c8-0-M027/4/1 HTTP/1.1" 200 -
[2024-08-02 15:30:39,896] INFO in lithopsentry: Sending range "1-2" to Host 10.244.140.132
10.244.140.132 - - [02/Aug/2024 15:30:39] "GET /get-range/5eb4c8-0-M027/4/1 HTTP/1.1" 200 -
[2024-08-02 15:30:39,924] INFO in lithopsentry: Sending range "2-3" to Host 10.244.235.194
10.244.235.194 - - [02/Aug/2024 15:30:40] "GET /get-range/5eb4c8-0-M027/4/1 HTTP/1.1" 200 -
[2024-08-02 15:30:40,133] INFO in lithopsentry: Sending range "3-4" to Host 10.244.140.131
10.244.140.131 - - [02/Aug/2024 15:30:40] "GET /get-range/5eb4c8-0-M027/4/1 HTTP/1.1" 200 -
[2024-08-02 15:30:40,135] INFO in lithopsentry: Sending range "-1" to Host 10.244.235.194
10.244.235.194 - - [02/Aug/2024 15:30:40] "GET /get-range/5eb4c8-0-M027/4/1 HTTP/1.1" 200 -
[2024-08-02 15:30:45,998] INFO in lithopsentry: Sending range "0-1" to Host 10.244.140.186
10.244.140.186 - - [02/Aug/2024 15:30:45] "GET /get-range/5eb4c8-0-R027/1/1 HTTP/1.1" 200 -
[2024-08-02 15:30:46,599] INFO in lithopsentry: Sending range "-1" to Host 10.244.235.250
10.244.235.250 - - [02/Aug/2024 15:30:46] "GET /get-range/5eb4c8-0-R026/1/1 HTTP/1.1" 200 -
[2024-08-02 15:30:53,783] INFO in lithopsentry: Sending range "-1" to Host 10.244.235.255
10.244.235.255 - - [02/Aug/2024 15:30:53] "GET /get-range/5eb4c8-0-M027/4/1 HTTP/1.1" 200 -
[2024-08-02 15:30:53,941] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.132
10.244.140.132 - - [02/Aug/2024 15:30:53] "GET /get-range/5eb4c8-0-M027/4/1 HTTP/1.1" 200 -
[2024-08-02 15:30:54,632] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.131
10.244.140.131 - - [02/Aug/2024 15:30:54] "GET /get-range/5eb4c8-0-M027/4/1 HTTP/1.1" 200 -

@JosepSampe
Copy link
Member

So this is ok:

[2024-08-02 15:30:45,998] INFO in lithopsentry: Sending range "0-1" to Host 10.244.140.186
10.244.140.186 - - [02/Aug/2024 15:30:45] "GET /get-range/5eb4c8-0-R027/1/1 HTTP/1.1" 200 -

The question is who is receiving this range?

What are the logs of the reducer lithops-5eb4c8-0-r027? Do you get the same error message as before with 1 second timeout?

2024-08-02 13:34:18,696 [DEBUG] lithopsentry.py:109 -- Error getting range: HTTPConnectionPool(host='10.244.140.143', port=8080): Read timed out. (read timeout=0.1)

@bystepii
Copy link
Contributor Author

bystepii commented Aug 2, 2024

What are the logs of the reducer lithops-5eb4c8-0-r027? Do you get the same error message as before with 1 second timeout?

2024-08-02 15:30:45,992 [INFO] lithopsentry.py:83 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-08-02 15:30:46,000 [INFO] lithopsentry.py:111 -- Received range: 0-1
2024-08-02 15:30:46,144 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-02 15:30:46,144 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-02 15:30:46,312 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-02 15:30:46,321 [INFO] utils.py:50 -- Getting function and modules
2024-08-02 15:30:46,321 [INFO] utils.py:62 -- Loading lithops.jobs/5eb4c8-0/af1bed0332dc25016a9eeb006e1d024a.func.pickle from storage
2024-08-02 15:30:46,330 [INFO] utils.py:107 -- Loading function data parameters from storage
2024-08-02 15:30:46,339 [INFO] handler.py:75 -- Tasks received: 1 - Worker processes: 1
2024-08-02 15:30:46,339 [INFO] handler.py:117 -- Worker process 0 started
2024-08-02 15:30:46,342 [INFO] handler.py:177 -- Lithops v3.4.2.dev0 - Starting k8s execution
2024-08-02 15:30:46,342 [INFO] handler.py:178 -- Execution ID: 5eb4c8-0-R027/00000
2024-08-02 15:30:46,342 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-02 15:30:46,342 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-02 15:30:46,348 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-02 15:30:46,355 [DEBUG] handler.py:194 -- Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - Memory: 10240MB - Timeout: 895 seconds
2024-08-02 15:30:46,435 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/5eb4c8-0-R027/00000/bce461f7f41d.init - Size: 0.0B - OK
2024-08-02 15:30:46,436 [DEBUG] handler.py:206 -- Starting JobRunner process
2024-08-02 15:30:46,454 [DEBUG] jobrunner.py:204 -- Process started
2024-08-02 15:30:46,456 [INFO] jobrunner.py:120 -- Reduce function: waiting for map results
2024-08-02 15:30:46,456 [INFO] wait.py:101 -- ExecutorID 5eb4c8-0 - Waiting for 4 function activations to complete
2024-08-02 15:30:46,462 [DEBUG] monitor.py:429 -- ExecutorID 5eb4c8-0 - Starting Storage job monitor
2024-08-02 15:30:48,551 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 0 - Done: 3
2024-08-02 15:30:49,468 [DEBUG] future.py:229 -- ExecutorID 5eb4c8-0 | JobID M027 - Got status from call 00003 - Activation ID: a6600c24322a - Time: 3.16 seconds
2024-08-02 15:30:55,313 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 0 - Done: 3
2024-08-02 15:31:02,047 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 0 - Done: 3
2024-08-02 15:31:08,776 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 0 - Done: 3
2024-08-02 15:31:15,504 [DEBUG] monitor.py:144 -- ExecutorID 5eb4c8-0 - Pending: 1 - Running: 0 - Done: 3

I suspect the timeout of 1s does fix the case of a stuck reducer, but not of a stuck mapper.

@bystepii
Copy link
Contributor Author

bystepii commented Aug 2, 2024

The question is who is receiving this range?

The missing one is 2-3

$ cat logs/lithops-5eb4c8-0-m027-6b6f9.log 
2024-08-02 15:30:39,910 [INFO] lithopsentry.py:84 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-08-02 15:30:40,213 [INFO] lithopsentry.py:111 -- Received range: -1
2024-08-02 15:30:40,213 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution
$ cat logs/lithops-5eb4c8-0-m027-fr4st.log 
2024-08-02 15:30:39,819 [INFO] lithopsentry.py:84 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-08-02 15:30:39,827 [INFO] lithopsentry.py:111 -- Received range: 0-1
2024-08-02 15:30:39,977 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-02 15:30:39,977 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-02 15:30:40,147 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-02 15:30:40,156 [INFO] utils.py:50 -- Getting function and modules
2024-08-02 15:30:40,156 [INFO] utils.py:62 -- Loading lithops.jobs/5eb4c8-0/c9195dee50e04507b9a5f2d1019afc17.func.pickle from storage
2024-08-02 15:30:40,184 [INFO] utils.py:107 -- Loading function data parameters from storage
2024-08-02 15:30:40,193 [INFO] handler.py:75 -- Tasks received: 1 - Worker processes: 1
2024-08-02 15:30:40,193 [INFO] handler.py:117 -- Worker process 0 started
2024-08-02 15:30:40,195 [INFO] handler.py:177 -- Lithops v3.4.2.dev0 - Starting k8s execution
2024-08-02 15:30:40,196 [INFO] handler.py:178 -- Execution ID: 5eb4c8-0-M027/00000
2024-08-02 15:30:40,196 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-02 15:30:40,196 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-02 15:30:40,202 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-02 15:30:40,208 [DEBUG] handler.py:194 -- Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - Memory: 10240MB - Timeout: 895 seconds
2024-08-02 15:30:40,276 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/5eb4c8-0-M027/00000/f68a132ca64d.init - Size: 0.0B - OK
2024-08-02 15:30:40,276 [DEBUG] handler.py:206 -- Starting JobRunner process
2024-08-02 15:30:40,294 [DEBUG] jobrunner.py:204 -- Process started
2024-08-02 15:30:40,296 [INFO] jobrunner.py:234 -- Going to execute 'map_fn()'
---------------------- FUNCTION LOG ----------------------
----------------------------------------------------------
2024-08-02 15:30:43,300 [INFO] jobrunner.py:240 -- Success function execution
2024-08-02 15:30:43,302 [INFO] jobrunner.py:312 -- Process finished
2024-08-02 15:30:43,328 [DEBUG] handler.py:217 -- JobRunner process finished
2024-08-02 15:30:43,329 [INFO] status.py:88 -- Storing execution stats - Size: 2.1KiB
2024-08-02 15:30:53,770 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/5eb4c8-0-M027/00000/status.json - Size: 2.1KiB - OK
2024-08-02 15:30:53,770 [INFO] handler.py:292 -- Finished
2024-08-02 15:30:53,771 [INFO] handler.py:139 -- Worker process 0 finished
2024-08-02 15:30:53,777 [INFO] lithopsentry.py:111 -- Received range: -1
2024-08-02 15:30:53,777 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution
$ cat logs/lithops-5eb4c8-0-m027-pzwtm.log 
2024-08-02 15:30:39,890 [INFO] lithopsentry.py:83 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-08-02 15:30:39,898 [INFO] lithopsentry.py:111 -- Received range: 1-2
2024-08-02 15:30:40,060 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-02 15:30:40,060 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-02 15:30:40,231 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-02 15:30:40,241 [INFO] utils.py:50 -- Getting function and modules
2024-08-02 15:30:40,241 [INFO] utils.py:62 -- Loading lithops.jobs/5eb4c8-0/c9195dee50e04507b9a5f2d1019afc17.func.pickle from storage
2024-08-02 15:30:40,261 [INFO] utils.py:107 -- Loading function data parameters from storage
2024-08-02 15:30:40,286 [INFO] handler.py:75 -- Tasks received: 1 - Worker processes: 1
2024-08-02 15:30:40,287 [INFO] handler.py:117 -- Worker process 0 started
2024-08-02 15:30:40,289 [INFO] handler.py:177 -- Lithops v3.4.2.dev0 - Starting k8s execution
2024-08-02 15:30:40,289 [INFO] handler.py:178 -- Execution ID: 5eb4c8-0-M027/00001
2024-08-02 15:30:40,289 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-02 15:30:40,289 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-02 15:30:40,298 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-02 15:30:40,305 [DEBUG] handler.py:194 -- Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - Memory: 10240MB - Timeout: 895 seconds
2024-08-02 15:30:40,376 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/5eb4c8-0-M027/00001/1849dd89b0ca.init - Size: 0.0B - OK
2024-08-02 15:30:40,376 [DEBUG] handler.py:206 -- Starting JobRunner process
2024-08-02 15:30:40,395 [DEBUG] jobrunner.py:204 -- Process started
2024-08-02 15:30:40,397 [INFO] jobrunner.py:234 -- Going to execute 'map_fn()'
---------------------- FUNCTION LOG ----------------------
----------------------------------------------------------
2024-08-02 15:30:43,401 [INFO] jobrunner.py:240 -- Success function execution
2024-08-02 15:30:43,403 [INFO] jobrunner.py:312 -- Process finished
2024-08-02 15:30:43,435 [DEBUG] handler.py:217 -- JobRunner process finished
2024-08-02 15:30:43,436 [INFO] status.py:88 -- Storing execution stats - Size: 2.1KiB
2024-08-02 15:30:53,936 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/5eb4c8-0-M027/00001/status.json - Size: 2.1KiB - OK
2024-08-02 15:30:53,937 [INFO] handler.py:292 -- Finished
2024-08-02 15:30:53,937 [INFO] handler.py:139 -- Worker process 0 finished
2024-08-02 15:30:53,944 [INFO] lithopsentry.py:111 -- Received range: -1
2024-08-02 15:30:53,944 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution
$ cat logs/lithops-5eb4c8-0-m027-wrqg5.log 
2024-08-02 15:30:39,997 [INFO] lithopsentry.py:83 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-08-02 15:30:40,134 [INFO] lithopsentry.py:111 -- Received range: 3-4
2024-08-02 15:30:40,293 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-02 15:30:40,293 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-02 15:30:40,459 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-02 15:30:40,469 [INFO] utils.py:50 -- Getting function and modules
2024-08-02 15:30:40,469 [INFO] utils.py:62 -- Loading lithops.jobs/5eb4c8-0/c9195dee50e04507b9a5f2d1019afc17.func.pickle from storage
2024-08-02 15:30:40,480 [INFO] utils.py:107 -- Loading function data parameters from storage
2024-08-02 15:30:40,490 [INFO] handler.py:75 -- Tasks received: 1 - Worker processes: 1
2024-08-02 15:30:40,490 [INFO] handler.py:117 -- Worker process 0 started
2024-08-02 15:30:40,493 [INFO] handler.py:177 -- Lithops v3.4.2.dev0 - Starting k8s execution
2024-08-02 15:30:40,493 [INFO] handler.py:178 -- Execution ID: 5eb4c8-0-M027/00003
2024-08-02 15:30:40,493 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-02 15:30:40,493 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-02 15:30:40,503 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-02 15:30:40,510 [DEBUG] handler.py:194 -- Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - Memory: 10240MB - Timeout: 895 seconds
2024-08-02 15:30:40,584 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/5eb4c8-0-M027/00003/a6600c24322a.init - Size: 0.0B - OK
2024-08-02 15:30:40,585 [DEBUG] handler.py:206 -- Starting JobRunner process
2024-08-02 15:30:40,606 [DEBUG] jobrunner.py:204 -- Process started
2024-08-02 15:30:40,609 [INFO] jobrunner.py:234 -- Going to execute 'map_fn()'
---------------------- FUNCTION LOG ----------------------
----------------------------------------------------------
2024-08-02 15:30:43,612 [INFO] jobrunner.py:240 -- Success function execution
2024-08-02 15:30:43,615 [INFO] jobrunner.py:312 -- Process finished
2024-08-02 15:30:43,647 [DEBUG] handler.py:217 -- JobRunner process finished
2024-08-02 15:30:43,649 [INFO] status.py:88 -- Storing execution stats - Size: 2.1KiB
2024-08-02 15:30:54,628 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/5eb4c8-0-M027/00003/status.json - Size: 2.1KiB - OK
2024-08-02 15:30:54,628 [INFO] handler.py:292 -- Finished
2024-08-02 15:30:54,628 [INFO] handler.py:139 -- Worker process 0 finished
2024-08-02 15:30:54,635 [INFO] lithopsentry.py:111 -- Received range: -1
2024-08-02 15:30:54,635 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution

@JosepSampe
Copy link
Member

Based on the master logs the range has been requested and provided successfully to the worker with IP address 10.244.235.194 :

[2024-08-02 15:30:39,924] INFO in lithopsentry: Sending range "2-3" to Host 10.244.235.194
10.244.235.194 - - [02/Aug/2024 15:30:40] "GET /get-range/5eb4c8-0-M027/4/1 HTTP/1.1" 200 -
[2024-08-02 15:30:40,135] INFO in lithopsentry: Sending range "-1" to Host 10.244.235.194
10.244.235.194 - - [02/Aug/2024 15:30:40] "GET /get-range/5eb4c8-0-M027/4/1 HTTP/1.1" 200 -

very weird...
Is it possible to verify that the logs you get match with the worker with this IP address?

@bystepii
Copy link
Contributor Author

bystepii commented Aug 5, 2024

I modified my log scrapping script to print the IP addresses of the pods. Here is an example execution. As you can see, the logs match with the worker IP addresses.

[...]
[2024-08-05 13:42:56,740] INFO in lithopsentry: Sending range "0-1" to Host 10.244.235.213
10.244.235.213 - - [05/Aug/2024 13:42:56] "GET /get-range/4a92ae-0-M028/4/1 HTTP/1.1" 200 -
[2024-08-05 13:42:56,929] INFO in lithopsentry: Sending range "1-2" to Host 10.244.140.143
10.244.140.143 - - [05/Aug/2024 13:42:56] "GET /get-range/4a92ae-0-M028/4/1 HTTP/1.1" 200 -
[2024-08-05 13:42:56,932] INFO in lithopsentry: Sending range "2-3" to Host 10.244.140.139
10.244.140.139 - - [05/Aug/2024 13:42:56] "GET /get-range/4a92ae-0-M028/4/1 HTTP/1.1" 200 -
[2024-08-05 13:42:57,021] INFO in lithopsentry: Sending range "3-4" to Host 10.244.140.172
10.244.140.172 - - [05/Aug/2024 13:42:57] "GET /get-range/4a92ae-0-M028/4/1 HTTP/1.1" 200 -
[2024-08-05 13:42:57,023] INFO in lithopsentry: Sending range "-1" to Host 10.244.235.213
10.244.235.213 - - [05/Aug/2024 13:42:57] "GET /get-range/4a92ae-0-M028/4/1 HTTP/1.1" 200 -
[2024-08-05 13:42:59,327] INFO in lithopsentry: Sending range "-1" to Host 10.244.235.192
10.244.235.192 - - [05/Aug/2024 13:42:59] "GET /get-range/4a92ae-0-M027/4/1 HTTP/1.1" 200 -
[2024-08-05 13:42:59,484] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.191
10.244.140.191 - - [05/Aug/2024 13:42:59] "GET /get-range/4a92ae-0-M027/4/1 HTTP/1.1" 200 -
[2024-08-05 13:42:59,552] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.175
10.244.140.175 - - [05/Aug/2024 13:42:59] "GET /get-range/4a92ae-0-M027/4/1 HTTP/1.1" 200 -
[2024-08-05 13:42:59,668] INFO in lithopsentry: Sending range "-1" to Host 10.244.235.199
10.244.235.199 - - [05/Aug/2024 13:42:59] "GET /get-range/4a92ae-0-M027/4/1 HTTP/1.1" 200 -
[2024-08-05 13:43:02,915] INFO in lithopsentry: Sending range "0-1" to Host 10.244.235.202
10.244.235.202 - - [05/Aug/2024 13:43:02] "GET /get-range/4a92ae-0-R028/1/1 HTTP/1.1" 200 -
[2024-08-05 13:43:04,953] INFO in lithopsentry: Sending range "-1" to Host 10.244.235.204
10.244.235.204 - - [05/Aug/2024 13:43:04] "GET /get-range/4a92ae-0-R027/1/1 HTTP/1.1" 200 -
[2024-08-05 13:43:11,070] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.139
10.244.140.139 - - [05/Aug/2024 13:43:11] "GET /get-range/4a92ae-0-M028/4/1 HTTP/1.1" 200 -
[2024-08-05 13:43:11,137] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.143
10.244.140.143 - - [05/Aug/2024 13:43:11] "GET /get-range/4a92ae-0-M028/4/1 HTTP/1.1" 200 -
[2024-08-05 13:43:11,195] INFO in lithopsentry: Sending range "-1" to Host 10.244.140.172
10.244.140.172 - - [05/Aug/2024 13:43:11] "GET /get-range/4a92ae-0-M028/4/1 HTTP/1.1" 200 -
Pod lithops-4a92ae-0-m028-vpmbh started with IP 10.244.235.213
2024-08-05 13:42:56,734 [INFO] lithopsentry.py:84 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-08-05 13:42:57,026 [INFO] lithopsentry.py:111 -- Received range: -1
2024-08-05 13:42:57,026 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution
Pod lithops-4a92ae-0-m028-rhfjt started with IP 10.244.140.143
2024-08-05 13:42:56,811 [INFO] lithopsentry.py:83 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-08-05 13:42:56,932 [INFO] lithopsentry.py:111 -- Received range: 1-2
2024-08-05 13:42:57,082 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-05 13:42:57,082 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-05 13:42:57,250 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-05 13:42:57,259 [INFO] utils.py:50 -- Getting function and modules
2024-08-05 13:42:57,259 [INFO] utils.py:62 -- Loading lithops.jobs/4a92ae-0/c9195dee50e04507b9a5f2d1019afc17.func.pickle from storage
2024-08-05 13:42:57,286 [INFO] utils.py:107 -- Loading function data parameters from storage
2024-08-05 13:42:57,296 [INFO] handler.py:75 -- Tasks received: 1 - Worker processes: 1
2024-08-05 13:42:57,296 [INFO] handler.py:117 -- Worker process 0 started
2024-08-05 13:42:57,298 [INFO] handler.py:177 -- Lithops v3.4.2.dev0 - Starting k8s execution
2024-08-05 13:42:57,299 [INFO] handler.py:178 -- Execution ID: 4a92ae-0-M028/00001
2024-08-05 13:42:57,299 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-05 13:42:57,299 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-05 13:42:57,305 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-05 13:42:57,312 [DEBUG] handler.py:194 -- Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - Memory: 10240MB - Timeout: 895 seconds
2024-08-05 13:42:57,388 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/4a92ae-0-M028/00001/a6f0e1603f32.init - Size: 0.0B - OK
2024-08-05 13:42:57,389 [DEBUG] handler.py:206 -- Starting JobRunner process
2024-08-05 13:42:57,408 [DEBUG] jobrunner.py:204 -- Process started
2024-08-05 13:42:57,410 [INFO] jobrunner.py:234 -- Going to execute 'map_fn()'
---------------------- FUNCTION LOG ----------------------
----------------------------------------------------------
2024-08-05 13:43:00,411 [INFO] jobrunner.py:240 -- Success function execution
2024-08-05 13:43:00,413 [INFO] jobrunner.py:312 -- Process finished
2024-08-05 13:43:00,440 [DEBUG] handler.py:217 -- JobRunner process finished
2024-08-05 13:43:00,441 [INFO] status.py:88 -- Storing execution stats - Size: 2.1KiB
2024-08-05 13:43:11,132 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/4a92ae-0-M028/00001/status.json - Size: 2.1KiB - OK
2024-08-05 13:43:11,133 [INFO] handler.py:292 -- Finished
2024-08-05 13:43:11,133 [INFO] handler.py:139 -- Worker process 0 finished
2024-08-05 13:43:11,139 [INFO] lithopsentry.py:111 -- Received range: -1
2024-08-05 13:43:11,140 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution
Pod lithops-4a92ae-0-m028-kftbk started with IP 10.244.140.139
2024-08-05 13:42:56,900 [INFO] lithopsentry.py:83 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-08-05 13:42:57,022 [INFO] lithopsentry.py:111 -- Received range: 2-3
2024-08-05 13:42:57,171 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-05 13:42:57,172 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-05 13:42:57,346 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-05 13:42:57,355 [INFO] utils.py:50 -- Getting function and modules
2024-08-05 13:42:57,355 [INFO] utils.py:62 -- Loading lithops.jobs/4a92ae-0/c9195dee50e04507b9a5f2d1019afc17.func.pickle from storage
2024-08-05 13:42:57,388 [INFO] utils.py:107 -- Loading function data parameters from storage
2024-08-05 13:42:57,396 [INFO] handler.py:75 -- Tasks received: 1 - Worker processes: 1
2024-08-05 13:42:57,396 [INFO] handler.py:117 -- Worker process 0 started
2024-08-05 13:42:57,398 [INFO] handler.py:177 -- Lithops v3.4.2.dev0 - Starting k8s execution
2024-08-05 13:42:57,398 [INFO] handler.py:178 -- Execution ID: 4a92ae-0-M028/00002
2024-08-05 13:42:57,399 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-05 13:42:57,399 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-05 13:42:57,406 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-05 13:42:57,413 [DEBUG] handler.py:194 -- Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - Memory: 10240MB - Timeout: 895 seconds
2024-08-05 13:42:57,522 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/4a92ae-0-M028/00002/13ffcaa67a87.init - Size: 0.0B - OK
2024-08-05 13:42:57,522 [DEBUG] handler.py:206 -- Starting JobRunner process
2024-08-05 13:42:57,540 [DEBUG] jobrunner.py:204 -- Process started
2024-08-05 13:42:57,542 [INFO] jobrunner.py:234 -- Going to execute 'map_fn()'
---------------------- FUNCTION LOG ----------------------
----------------------------------------------------------
2024-08-05 13:43:00,546 [INFO] jobrunner.py:240 -- Success function execution
2024-08-05 13:43:00,548 [INFO] jobrunner.py:312 -- Process finished
2024-08-05 13:43:00,576 [DEBUG] handler.py:217 -- JobRunner process finished
2024-08-05 13:43:00,577 [INFO] status.py:88 -- Storing execution stats - Size: 2.1KiB
2024-08-05 13:43:11,066 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/4a92ae-0-M028/00002/status.json - Size: 2.1KiB - OK
2024-08-05 13:43:11,066 [INFO] handler.py:292 -- Finished
2024-08-05 13:43:11,066 [INFO] handler.py:139 -- Worker process 0 finished
2024-08-05 13:43:11,073 [INFO] lithopsentry.py:111 -- Received range: -1
2024-08-05 13:43:11,073 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution
Pod lithops-4a92ae-0-m028-lrck6 started with IP 10.244.140.172
2024-08-05 13:42:56,902 [INFO] lithopsentry.py:83 -- Lithops v3.4.2.dev0 - Starting kubernetes execution
2024-08-05 13:42:57,023 [INFO] lithopsentry.py:111 -- Received range: 3-4
2024-08-05 13:42:57,186 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-05 13:42:57,186 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-05 13:42:57,360 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-05 13:42:57,369 [INFO] utils.py:50 -- Getting function and modules
2024-08-05 13:42:57,370 [INFO] utils.py:62 -- Loading lithops.jobs/4a92ae-0/c9195dee50e04507b9a5f2d1019afc17.func.pickle from storage
2024-08-05 13:42:57,388 [INFO] utils.py:107 -- Loading function data parameters from storage
2024-08-05 13:42:57,397 [INFO] handler.py:75 -- Tasks received: 1 - Worker processes: 1
2024-08-05 13:42:57,397 [INFO] handler.py:117 -- Worker process 0 started
2024-08-05 13:42:57,399 [INFO] handler.py:177 -- Lithops v3.4.2.dev0 - Starting k8s execution
2024-08-05 13:42:57,399 [INFO] handler.py:178 -- Execution ID: 4a92ae-0-M028/00003
2024-08-05 13:42:57,399 [DEBUG] minio.py:38 -- Creating MinIO client
2024-08-05 13:42:57,399 [DEBUG] minio.py:43 -- Setting MinIO endpoint to http://192.168.5.24:9000
2024-08-05 13:42:57,406 [INFO] minio.py:62 -- MinIO client created - Endpoint: http://192.168.5.24:9000
2024-08-05 13:42:57,413 [DEBUG] handler.py:194 -- Runtime: docker.io/stepii/lithops-kubernetes-default-v310:3.4.2.dev0 - Memory: 10240MB - Timeout: 895 seconds
2024-08-05 13:42:57,547 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/4a92ae-0-M028/00003/350438c49045.init - Size: 0.0B - OK
2024-08-05 13:42:57,547 [DEBUG] handler.py:206 -- Starting JobRunner process
2024-08-05 13:42:57,568 [DEBUG] jobrunner.py:204 -- Process started
2024-08-05 13:42:57,571 [INFO] jobrunner.py:234 -- Going to execute 'map_fn()'
---------------------- FUNCTION LOG ----------------------
----------------------------------------------------------
2024-08-05 13:43:00,574 [INFO] jobrunner.py:240 -- Success function execution
2024-08-05 13:43:00,577 [INFO] jobrunner.py:312 -- Process finished
2024-08-05 13:43:00,608 [DEBUG] handler.py:217 -- JobRunner process finished
2024-08-05 13:43:00,609 [INFO] status.py:88 -- Storing execution stats - Size: 2.1KiB
2024-08-05 13:43:11,191 [DEBUG] minio.py:110 -- PUT Object lithops.jobs/4a92ae-0-M028/00003/status.json - Size: 2.1KiB - OK
2024-08-05 13:43:11,191 [INFO] handler.py:292 -- Finished
2024-08-05 13:43:11,191 [INFO] handler.py:139 -- Worker process 0 finished
2024-08-05 13:43:11,197 [INFO] lithopsentry.py:111 -- Received range: -1
2024-08-05 13:43:11,198 [INFO] lithopsentry.py:122 -- Finishing kubernetes execution

@JosepSampe
Copy link
Member

And if you activate again the log (logger.debug(f"Error getting range: {e}")) in the exception, what error do you see in the pod that fails and the timeout of the get() set to 1?

@bystepii
Copy link
Contributor Author

bystepii commented Aug 5, 2024

The log line is activated and the timeout is set to 1 second. However, there is no error in the logs.

        while call_ids_range is None:
            try:
                server = f'http://{master_ip}:{config.MASTER_PORT}'
                url = f'{server}/get-range/{job_key}/{total_calls}/{chunksize}'
                res = requests.get(url, timeout=1)
                call_ids_range = res.text  # for example: 0-5
            except Exception as e:
                logger.debug(f"Error getting range: {e}")
                time.sleep(0.1)

@JosepSampe
Copy link
Member

@bystepii Any update on this?

@bystepii
Copy link
Contributor Author

bystepii commented Oct 1, 2024

@JosepSampe The issue is still present. Jobs still get stuck in the "pending" state randomly, including both mappers and reducers.

@JosepSampe
Copy link
Member

Unfortunately, I'm not able to reproduce the issue using an AWS EKS cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants