Worker process dies with "This is an uncaught error in rrq, probably a bug!" message #59

yapus · 2021-10-15T17:22:47Z

Hello again. I'm having two workers processing a queue of tasks . After processing a task each - both workers die with the following error:

worker_1       | [2021-10-15 17:18:25] TASK_COMPLETE f94a41f63478ebddbd2ac156410f0daa                                                                                                                              
worker_1       | [2021-10-15 17:18:25] TASK_START 2cbf5520762384481743f061af6ae5bf                                                                                                                                 
worker_1       | [2021-10-15 17:18:25] HEARTBEAT stopping                                                                                                                                                          
worker_1       | [2021-10-15 17:18:25] STOP ERROR                                                                                                                                                                  
worker_1       | This is an uncaught error in rrq, probably a bug!                                                                                                                                                 
worker_1       | Error in self$location(hash, TRUE) : Some hashes were not found!                                                                                                                                  
worker_1       | Calls: rrq_worker_main ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous>                                                                                                                 
worker_1       | In addition: Warning messages:                                                                                                                                                                    
worker_1       | 1: In max(thisIndexValues) :                                                                                                                                                                      
worker_1       |   no non-missing arguments to max; returning -Inf                                                                                                                                                 
worker_1       | 2: In max(thisIndexValues) :                                                                                                                                                                      
worker_1       |   no non-missing arguments to max; returning -Inf                                                                                                                                                 
worker_1       | 3: In max(thisIndexValues) :                                                          
worker_1       |   no non-missing arguments to max; returning -Inf                                     
worker_1       | 4: In max(thisIndexValues) :                                                          
worker_1       |   no non-missing arguments to max; returning -Inf                                     
worker_1       | 5: In max(thisIndexValues) :                                                          
worker_1       |   no non-missing arguments to max; returning -Inf                                     
worker_1       | 6: In max(thisIndexValues) :                                                          
worker_1       |   no non-missing arguments to max; returning -Inf                                     
worker_1       | 7: In max(thisIndexValues) :       
worker_1       |   no non-missing arguments to max; returning -Inf                                       
worker_1       | 8: In max(thisIndexValues) :       
worker_1       |   no non-missing arguments to max; returning -Inf                                       
worker_1       | Execution halted

I was trying to play around with worker config and found out that when i use heartbeat_period = 3 this error almost always appears and nothing works. When i use heartbeat_period = 30 - it works as expected.

is it somehow related to high CPU load - like heartbeat cannot refresh redis key in time and worker got killed?

i'm also randomly catch other type of errors in my HTTP controller:

api_1          | 2021-10-15 17:59:10.970794 rserve[122] INFO /result task_id = ad21417c5eec438f4d33f8789a6e5bc8
api_1          | 2021-10-15 17:59:10.972284 rserve[94] ERROR Error in redis_pipeline(ptr, cmds): Unknown type [redux bug -- please report]
api_1          |                                    
api_1          | 2021-10-15 17:59:10.973574 rserve[94] DEBUG Outgoing POST response (id 981310c4-2de1-11ec-8c9f-0242ac120006): 500 Internal Server Error
api_1          | {"timestamp":"2021-10-15 17:59:10.972620","level":"ERROR","name":"Application","pid":102,"msg":"","context":{"request_id":"987c44a4-2de1-11ec-8c9f-0242ac120006","message":{"error":"Failure commu
nicating with the Redis server","call":"redis_command(ptr, cmd)","traceback":["FUN(request, response)",".res$set_body(queue$task_status(task_id))","queue$task_status(task_id)","task_status(self$con, self$keys, t
ask_ids)","from_redis_hash(con, keys$task_status, task_ids, missing = TASK_MISSING)","con$HMGET(key, fields)","command(list(\"HMGET\", key, field))","redis_command(ptr, cmd)"]}}}

The text was updated successfully, but these errors were encountered:

richfitz · 2021-10-15T18:08:27Z

I think I need a bit more to get a reproducible example; this works for me:

r <- rrq::rrq_controller$new("myqueue")
r$worker_config_save("localhost", heartbeat_period = 3)
w <- rrq::worker_spawn(r, logdir = "logs")
t <- r$enqueue(1 + 1)
r$task_wait(t)
r$message_send_and_wait("PING") # worker looks alive
writeLines(readLines(file.path("logs", w))) # logs look fine

Does this happen every time? Seems more likely that it's a race condition than load really given the timings above.

For the redux error:

redis_pipeline(ptr, cmds): Unknown type [redux bug -- please report]

that is really weird. It's coming from deep in the redis-reply-to-r code:

SEXP redis_reply_to_sexp(redisReply* reply, bool error_throw) {
  if (reply == NULL) {
    error("Failure communicating with the Redis server");
  }
  switch(reply->type) {
  case REDIS_REPLY_STATUS:
    return status_to_sexp(reply->str);
  case REDIS_REPLY_STRING:
    return raw_string_to_sexp(reply->str, reply->len);
  case REDIS_REPLY_INTEGER:
    return (reply->integer < INT_MAX ?
            ScalarInteger(reply->integer) :
            ScalarReal((double)reply->integer));
  case REDIS_REPLY_NIL:
    return R_NilValue;
  case REDIS_REPLY_ARRAY:
    return array_to_sexp(reply, error_throw);
  case REDIS_REPLY_ERROR:
    return reply_error(reply, error_throw);
  default:
    // In theory we should do something based on error action here but
    // it's also not triggerable.
    error("Unknown type [redux bug -- please report]"); // # nocov
  }
  return R_NilValue; // # nocov
}

So something you're doing is triggering a new response type. Most likely newer redis servers are returning something else? REDIS_REPLY_DOUBLE and REDIS_REPLY_BOOL look like they're not handled here but would be easy to trigger. What version of redis server do you have?

yapus · 2021-10-15T18:19:04Z

Hi, thanks for your reply! For the redux error - i just found this issue richfitz/redux#50 and it looks relevant - i'm using forking HTTP server, so i'll fix that now to reopen redis connection for every client connection. It would be better to use connection pool for sure, but i couldn't find one quickly.

I'll fix forking error with redis and see if that helps with other bug also.

yapus · 2021-10-15T19:00:37Z

Ok, i've fixed redis connection in forked processes and cannot see that error anymore. I think this can be closed for now, sorry for the buzz.

It looks like i was somehow crippling the data in the Redis DB while simultaneously using Redis connection in many forked processes (each handling HTTP polling request) to get task statuses from rrq.

richfitz · 2021-10-18T16:49:02Z

Thanks, that makes sense as a cause if we have some corrupted client state there. If there's public code at any point that shows the full setup I'd be curious. That said, creating a connection should cost ~1ms (equivalent to around the overhead of a single task) so this may not be that big a deal in practice?

yapus closed this as completed Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker process dies with "This is an uncaught error in rrq, probably a bug!" message #59

Worker process dies with "This is an uncaught error in rrq, probably a bug!" message #59

yapus commented Oct 15, 2021 •

edited

Loading

richfitz commented Oct 15, 2021

yapus commented Oct 15, 2021

yapus commented Oct 15, 2021 •

edited

Loading

richfitz commented Oct 18, 2021

Worker process dies with "This is an uncaught error in rrq, probably a bug!" message #59

Worker process dies with "This is an uncaught error in rrq, probably a bug!" message #59

Comments

yapus commented Oct 15, 2021 • edited Loading

richfitz commented Oct 15, 2021

yapus commented Oct 15, 2021

yapus commented Oct 15, 2021 • edited Loading

richfitz commented Oct 18, 2021

yapus commented Oct 15, 2021 •

edited

Loading

yapus commented Oct 15, 2021 •

edited

Loading