-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]: lncli getinfo and LND in general, getting stuck at COMMIT when using Postgres #8009
Comments
It's hard to say, we've heard a similar report from someone else. So it would be really helpful if you could help us debug this. Since you're already on commit 9f4a883, can you please add the following to your config and restart:
And then while such a blocking request is going on (e.g. you're waiting for the response, but before aborting the command), please run the following three commands and upload the output (shouldn't contain any sensitive info, just some memory addresses):
Thanks! |
Ok, I'll come back to you in a moment |
Now when I restarted it (the 4th time this morning), started working perfectly... Before even after restarting for the 3rd time it wasn't responding. It will eventually start to bug again, it always does, and by then I'll run those command and post here. |
@FeatureSpitter thanks for the goroutine dump, that's very helpful indeed. @Roasbeef I think we might have another mutext locking issue here, possibly amplified by the single write lock of Postgres. Here's the full dump: goroutinedump.txt What looks suspicious:
Looking closer, I think this might actually be because goroutine
So not sure if things being locked on the server's main mutex is a problem in itself or only really possible if there is only a single DB writer possible. @FeatureSpitter can you please do the following:
|
I'll try that. The next time it hangs (shouldn't be taking too much time) I'll fetch the curls again and the postgres logs. |
@guggero here it goes. I increased the postgres max connection limit to 200 (both in LND and postgres configs).
And here are the new debug files: https://www.dropbox.com/scl/fo/h9xfa3k2yt08xr2lwygd5/h?rlkey=bbd6f3fh70674v4kxazza9trs&dl=0 |
I just did another test: After waiting for While it was down, lnd complained, as expected, and as soon as postgres was up it stopped complaining and went back to the normal logs:
However, getinfo is still stuck. So whatever is clogging this seems to be isolated to the lnd process only. Rebooting postgres didn't make any difference besides these log messages. Only rebooting lnd (without rebooting postgres) unclogs it, until it clogs again after like ~10-20 minutes of runtime. This test probably rules out postgres as the potential point of failure in this issue. |
This does look like something's not properly working with Postgres... Why would a commit take that long? Is there a way to find out all queries of that transaction? So basically, what it is committing? |
I would guess that the |
I'm using DataGrip. I don't see any query hanged in DataGrip, can be some background query. I am trying to figure if there's a way to see the commit statements without having to go thru all the query history. Maybe it would be easier to log this on LND's side. Edit: This Edit 2: I restarted DataGrip and SHOW TRANSACTIONs are gone, so maybe you're guess is right. |
I'm logging all the statements now to see if I find a pattern, meanwhile it seems not all query arguments are going as prepared statements (parent_id):
|
I'm not really that well versed in Postgres, but is there a way to find out what queries are in a transaction? If yes, then I'd try to inspect the transaction the next time a commit takes several minutes. Because all individual queries seem to be very quick. |
I don't know if there's a way to somehow filter per transaction. |
There's not much to tell from the logs other than these SQL statements are constantly showing up: There's a huge activity in walletdb. Look at the timestamps, it is a spam. But postgres is not in effort. I doubt this spam is causing much of a clogging. The I/O seems normal: So I don't understand why that routine holds to the mutex forever. |
By the way, I am compiling LND with these tags:
I am using this image: https://github.com/lightningnetwork/lnd/blob/master/Dockerfile Which means I changed line 27 to add the tags to the Probably not related to the issue but at this point I am out of ideas. |
I just tried without line 27 modification, i.e., https://github.com/lightningnetwork/lnd/blob/master/Dockerfile as it is. And the same happens. I have no more ideas. |
@FeatureSpitter btw you should remove the blocking and mutex profile args you if you don't need to obtain a profile as it adds overhead in the background even when you aren't gathering a profile. |
Can you increase the |
@FeatureSpitter did you see this behavior before updating? Which version did you update from? |
I started using 0.17, I didn't update, so this behavior is all I ever known. |
You mean If that's what you are asking, then no, that log message never shows. |
It is impossible only a handful of people are experiencing this. This is a remote, non-managed by me postgres server, and still clogs at the commit. Unfortunately I wasn't able to figure out a way to see what are the queries related to this commit. But everything is pointing out to some strange issue in the way LND uses postgres. Sorry if I can't be of better help for now. |
FWIW I tried to repro this on similar hardware, a RockPro64 with Setup:
Results:
The second-to-last one seems to have been an outlier. |
Give it more time, in the beginning is always fast. |
Could you try this patch #8019 and see if it helps? On my machine it speeds up |
IIUC, the OP is also running Docker in 32 bit mode as well, which may contribute to some trashing that can slow things down. As mentioned above, if you're running everything on a single machine, without any replication at all, then Zooming out: you're seeing postgres hand on commit, this isn't related to You should also attempt to increase
|
Just to add more details regarding the docker that I am using (which is the latest one in raspbian repos):
Regarding sqlite, I wanted to extend my postgres with remote replication, which is why I am using it in the first place. But keep in mind that even when I used an external postgres on a dedicated server with the specs I mentioned above, it had Tomorrow I'll try @yyforyongyu PR, and also figure out if I can get a better docker. |
I changed the OS from my Raspi 4 from Raspian 64bit to Ubuntu Server 23 64bit. It seems Raspbian 64bit uses the 32bit docker and armhf architecture when fetching and building apps, instead of arm64. Changing to Ubuntu 23 64bit solved this issue, and so far, LND is no longer clogged. So in sum, Raspberry 4 can't run LND in 32 bits with stability. |
Yeah, they'll also run into database size issues as well. Closing this now as it was a OS issue. |
I wouldn't close this so fast with a reason like "it was a OS issue", the same way that if I implemented a very inefficient sorting algorithm that people couldn't run on a slower OS, I wouldn't blame it on that OS. LND has an issue, which might, or might not be interesting to fix: Is this an OS problem? Certainly not. But it is a problem when apps require better performance, or require 64-bit addressing capabilities to run more efficiently, which is the case of LND. So, for anyone using LND with a Raspbian OS, be advised of these potential issues. I am actually curious how raspblitz fixed this, they also use some version of Raspbian. |
No, the issue here is that you rely on Docker, which because of the OS was only installed as 32bit and therefore could only run 32bit applications. RaspiBlitz has been using the 64bit version of |
Are you using the official |
I am starting to think that LND is just not compatible with Docker, no matter the bits. |
I've been running lnd within docker for many years, never had any issue. I'm actually starting to believe that Docker on a Raspberry Pi has issues. |
Well, not the first time I have issues with LND on Docker, my previous time was in an Intel 64 Ubuntu server. The common denominator starts to become clear (me, or LND lol). |
I haven't had the opportunity to try this. Could it be the answer? |
It depends. Do you still see the long-running commit in Postgres? Maybe you've gotten rid of that and are now running into the issue addressed by #8019. |
yes I still see long commits: And as ruled out above, it doesn't matter if postgres is on docker, on a pi, or on a quantum computer. |
I did some digging with the user on this issue and tried to repro the problem using his node plus my postgres server. Here's what I've analyzed. There is approx 100 ms network latency between his node and my postgres server. First, let's look at an example of what appears to be a hanging commit:
Note PID
The postgres lifecycle of a session:
If I understand correctly, this implies that the commit most likely did complete on postgres. |
Could you try running |
Yep, I think that's the next thing we're going to try if we can't figure out more from postgres state. |
OK seeing wildly different performance between timeout 0 and 30s (0 is faster) |
There are no errors in the logs, other commands like walletbalance and getnetworkinfo return (they take a few seconds tho), but getinfo and listpeers and maybe others take forever to return... no error in the logs. I am using 0.17 RC3 on raspberry pi 4 8GB. The CPU is at 1% or less most of the time, with some spikes at 40-70% either from LND or Bitcoind...
Logs: https://pastebin.com/W10vaKgF
LND: lncli version 0.17.0-beta.rc3 commit=v0.17.0-beta.rc3-19-g9f4a8836d
OS: Linux raspberrypi 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux
Examples:
LND.CONF:
Bitcoind seems ok:
The text was updated successfully, but these errors were encountered: