Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iRODS 4.3 iCAT server occasionally hangs on container startup #30

Open
mikkonie opened this issue Sep 19, 2024 · 3 comments
Open

iRODS 4.3 iCAT server occasionally hangs on container startup #30

mikkonie opened this issue Sep 19, 2024 · 3 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@mikkonie
Copy link
Contributor

mikkonie commented Sep 19, 2024

After upgrading to iRODS 4.3, I'm seeing a problem where one or both of the two iRODS iCAT servers I run hang and time out when starting up the containers in docker compose. I did not experience this in the v4.2 versions of this image.

The 4.3 upgrade has been merged in the dev branch.

The docker-compose.yml used to launch the containers is here.

When this happens, the iRODS server is in an unresponsive state. Local iinit as well as external connection attempts from the docker host time out. Other servers in the network (postgres and redis) are working ok and reachable. Docker compose output displays the following:

irods-1       | Start iRODS
irods-1       | Test iinit
irods-1       | remote addresses: 172.20.0.5 ERROR: _rcConnect: connectToRhost timed out, retrying
irods-1       | WARNING: Error 25 disabling echo mode. Password will be displayed in plaintext.
irods-1       | Enter your current iRODS password:
irods-1       | remote addresses: 172.20.0.5 ERROR: _rcConnect: connectToRhost error, server on irods:1247 is probably down status = -115000 SYS_SOCK_READ_TIMEDOUT
irods-1       | remote addresses: 172.20.0.5 ERROR: Saved password, but failed to connect to server irods
irods-1       | iinit failed

During this, running /etc/init.d/irods status in the container does display running iRODS processes, but curiously no irodsDelayServer:

root@irods:/# /etc/init.d/irods status
irodsServer :
  Process 477
  Process 478
  Process 492

SSH:ing into the container and doing /etc/init.d/irods restart makes the server work normally. Most of the time, restarting the container does the same.

I have not yet found a way to consistently reproduce this, as it occurs seemingly randomly. This seem to happen most often after rebuilding the image and starting it for the first time, but not exclusively.

iRODS log says the following:

 {"log_category":"server","log_level":"info","log_message":"Initializing server ...","server_host":"irods","server_pid":477,"server_timestamp":"2024-09-19T10:41:46.139Z","server_type":"server","server_zone":"sodarZone"}
 {"log_category":"server","log_level":"info","log_message":"Setting up UNIX domain socket for agent factory ...","server_host":"irods","server_pid":477,"server_timestamp":"2024-09-19T10:41:46.141Z","server_type":"server","server_zone":"sodarZone"}
 {"log_category":"server","log_level":"info","log_message":"Forking agent factory ...","server_host":"irods","server_pid":477,"server_timestamp":"2024-09-19T10:41:46.141Z","server_type":"server","server_zone":"sodarZone"}
 {"log_category":"server","log_level":"info","log_message":"Connecting to agent factory [agent_factory_pid=478] ...","server_host":"irods","server_pid":477,"server_timestamp":"2024-09-19T10:41:46.142Z","server_type":"server","server_zone":"sodarZone"}
 {"log_category":"agent_factory","log_level":"info","log_message":"Initializing agent factory ...","server_host":"irods","server_pid":478,"server_timestamp":"2024-09-19T10:41:46.143Z","server_type":"agent_factory","server_zone":"sodarZone"}
 {"local_hostname":"localhost, irods","log_category":"server","log_level":"info","port":"1247","server_host":"irods","server_pid":477,"server_timestamp":"2024-09-19T10:41:46.190Z","server_type":"server","server_zone":"sodarZone"}
 {"log_category":"server","log_level":"info","server_host":"irods","server_pid":477,"server_timestamp":"2024-09-19T10:41:46.206Z","server_type":"server","server_zone":"sodarZone","zone_info.host":"localhost","zone_info.name":"sodarZone","zone_info.port":"1247","zone_info.type":"LOCAL_ICAT"}
 {"log_category":"server","log_level":"info","log_message":"rodsServer Release version rods4.3.1 - API Version d is up","server_host":"irods","server_pid":477,"server_timestamp":"2024-09-19T10:41:46.207Z","server_type":"server","server_zone":"sodarZone"}
 {"log_category":"legacy","log_level":"info","log_message":">>> control plane :: listening on port 1248\n","server_host":"irods","server_pid":477,"server_timestamp":"2024-09-19T10:41:46.208Z","server_type":"server","server_zone":"sodarZone"}
 {"log_category":"server","log_level":"error","log_message":"readWorkerTask - readStartupPack failed. -4000","server_host":"irods","server_pid":477,"server_timestamp":"2024-09-19T10:41:48.177Z","server_type":"server","server_zone":"sodarZone"}

The last readStartupPack failed bit looks of interest. However, that line also seems to appear on successful startups?

How to Reproduce

Start, stop and restart the docker compose environment 1-N times.

Oddly, this seems to happen on different intervals depending on the network I'm connected to when running the docker compose environment. However, I can't say for sure this isn't just a coincidence.

Also, I have not witnessed this happening when starting the image in GitHub Actions as part of SODAR server CI.

Further Steps for Debugging

  • Upgrade from 4.3.1 to 4.3.3
    • Also happens on 4.3.3
  • Enable debug logging in iRODS
  • Search for reports of similar problems and solutions, especially regarding irodsDelayServer
    • Could not find anything that seems relevant
  • If all else fails, contact iRODS support
@mikkonie mikkonie added the bug Something isn't working label Sep 19, 2024
@mikkonie mikkonie added this to the v4.3.x-1 milestone Sep 19, 2024
@mikkonie mikkonie self-assigned this Sep 19, 2024
@mikkonie mikkonie changed the title iRODS 4.3.1 server occasionally hangs on container startup iRODS 4.3 server occasionally hangs on container startup Sep 19, 2024
@mikkonie mikkonie changed the title iRODS 4.3 server occasionally hangs on container startup iRODS 4.3 iCAT server occasionally hangs on container startup Sep 19, 2024
@mikkonie mikkonie mentioned this issue Sep 19, 2024
17 tasks
@mikkonie
Copy link
Contributor Author

This hasn't happened lately in dev, except occasionally after rebuilding the image. I'll keep looking into it and see if it also occurs in staging/production, but if not, I won't be considering this a blocker for v4.3.3-1.

@mikkonie
Copy link
Contributor Author

mikkonie commented Nov 27, 2024

Sadly, I have observed this also happening in staging after rebuilding the server. I should try to debug the issue with iRODS support. First, I should most likely enable trace logging and trigger the freeze-up with a rebuild.

@mikkonie
Copy link
Contributor Author

Minimal example for reproducing this here. This seems to (mostly? exclusively?) affect rebuilding an existing image. Maybe some critical file is not correctly stored in volumes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants
@mikkonie and others