Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]: lncli getinfo and LND in general, getting stuck at COMMIT when using Postgres #8009

Closed
FeatureSpitter opened this issue Sep 19, 2023 · 67 comments · Fixed by #8019
Closed
Labels
bug Unintended code behaviour needs triage

Comments

@FeatureSpitter
Copy link

FeatureSpitter commented Sep 19, 2023

There are no errors in the logs, other commands like walletbalance and getnetworkinfo return (they take a few seconds tho), but getinfo and listpeers and maybe others take forever to return... no error in the logs. I am using 0.17 RC3 on raspberry pi 4 8GB. The CPU is at 1% or less most of the time, with some spikes at 40-70% either from LND or Bitcoind...

Logs: https://pastebin.com/W10vaKgF

image

LND: lncli version 0.17.0-beta.rc3 commit=v0.17.0-beta.rc3-19-g9f4a8836d

OS: Linux raspberrypi 6.1.21-v8+ #1642 SMP PREEMPT Mon Apr 3 17:24:16 BST 2023 aarch64 GNU/Linux

Examples:

pi@raspberrypi:~ $ time ll walletbalance
{
    "total_balance": "33268",
    "confirmed_balance": "33268",
    "unconfirmed_balance": "0",
    "locked_balance": "0",
    "reserved_balance_anchor_chan": "0",
    "account_balance": {
        "default": {
            "confirmed_balance": "33268",
            "unconfirmed_balance": "0"
        }
    }
}

real	0m0.678s
user	0m0.102s
sys	0m0.073s

$ time ll getinfo
^C[lncli] rpc error: code = Canceled desc = context canceled

real	3m5.953s
user	0m0.087s
sys	0m0.095s

(I ctrl+c to make it stop)

$ time ll listchannels
^C[lncli] rpc error: code = Canceled desc = context canceled

real	3m28.391s
user	0m0.083s
sys	0m0.159s

(I ctrl+c to make it stop)

LND.CONF:

[Application Options]
debuglevel=info
maxpendingchannels=10
alias=*******************
color=#ffffff

rpclisten=0.0.0.0:10009
listen=0.0.0.0:9735
restlisten=0.0.0.0:8001

[Bitcoin]
bitcoin.active=1
bitcoin.testnet=0
bitcoin.mainnet=1
bitcoin.node=bitcoind

; The base fee in millisatoshi we will charge for forwarding payments on our
; channels.
bitcoin.basefee=5

; The fee rate used when forwarding payments on our channels. The total fee
; charged is basefee + (amount * feerate / 1000000), where amount is the
; forwarded amount.
bitcoin.feerate=1

[Bitcoind]
bitcoind.rpchost=bitcoind:8332
bitcoind.rpcuser=*****************************
bitcoind.rpcpass=***********************
bitcoind.zmqpubrawblock=tcp://bitcoind:28332
bitcoind.zmqpubrawtx=tcp://bitcoind:28333

; Fee estimate mode for bitcoind. It must be either "ECONOMICAL" or "CONSERVATIVE".
; If unset, the default value is "CONSERVATIVE".
bitcoind.estimatemode=CONSERVATIVE

[tor]
tor.active=true
tor.v3=true
tor.streamisolation=false
tor.skip-proxy-for-clearnet-targets=false
tor.socks=tor:9050
tor.control=tor:9051
tor.password=*********************
tor.targetipaddress=10.5.0.6

[wtclient]

; Activate Watchtower Client. To get more information or configure watchtowers
; run `lncli wtclient -h`.
wtclient.active=true

; Specify the fee rate with which justice transactions will be signed. This fee
; rate should be chosen as a maximum fee rate one is willing to pay in order to
; sweep funds if a breach occurs while being offline. The fee rate should be
; specified in sat/vbyte.
wtclient.sweep-fee-rate=30

[db]
db.backend=postgres

[postgres]
; Postgres connection string.
; Default:
;   db.postgres.dsn=
; Example:
db.postgres.dsn=postgres://postgres:********************************************@pgsql:5432/lnd?sslmode=disable

; Postgres connection timeout. Valid time units are {s, m, h}. Set to zero to
; disable.
db.postgres.timeout=3m

; Postgres maximum number of connections. Set to zero for unlimited. It is
; recommended to set a limit that is below the server connection limit.
; Otherwise errors may occur in lnd under high-load conditions.
; Default:
db.postgres.maxconnections=10

Bitcoind seems ok:

$ btc getblockchaininfo
{
  "chain": "main",
  "blocks": 808403,
  "headers": 808403,
  "bestblockhash": "0000000000000000000314d5ce9e52d2677c08d3dc8a617690a13897203f53fa",
  "difficulty": 54150142369480,
  "time": 1695107587,
  "mediantime": 1695104573,
  "verificationprogress": 0.9999999672514687,
  "initialblockdownload": false,
  "chainwork": "000000000000000000000000000000000000000056098a6c3da24d6620caa504",
  "size_on_disk": 581069847788,
  "pruned": false,
  "warnings": ""
}
@FeatureSpitter FeatureSpitter added bug Unintended code behaviour needs triage labels Sep 19, 2023
@guggero
Copy link
Collaborator

guggero commented Sep 19, 2023

It's hard to say, we've heard a similar report from someone else. So it would be really helpful if you could help us debug this.

Since you're already on commit 9f4a883, can you please add the following to your config and restart:

profile=1234
blockingprofile=1
mutexprofile=1

And then while such a blocking request is going on (e.g. you're waiting for the response, but before aborting the command), please run the following three commands and upload the output (shouldn't contain any sensitive info, just some memory addresses):

curl http://localhost:1234/debug/pprof/goroutine?debug=2
curl http://localhost:1234/debug/pprof/block
curl http://localhost:1234/debug/pprof/mutex

Thanks!

@FeatureSpitter
Copy link
Author

Ok, I'll come back to you in a moment

@FeatureSpitter
Copy link
Author

FeatureSpitter commented Sep 19, 2023

Now when I restarted it (the 4th time this morning), started working perfectly... Before even after restarting for the 3rd time it wasn't responding.
LND is just to messing with me...

It will eventually start to bug again, it always does, and by then I'll run those command and post here.

@FeatureSpitter
Copy link
Author

@guggero
Copy link
Collaborator

guggero commented Sep 19, 2023

@FeatureSpitter thanks for the goroutine dump, that's very helpful indeed.

@Roasbeef I think we might have another mutext locking issue here, possibly amplified by the single write lock of Postgres.

Here's the full dump: goroutinedump.txt

What looks suspicious:

--> 8 peers seem to be stuck here:

goroutine 714932 [semacquire, 61 minutes]:
sync.runtime_Semacquire(0x4232f90)
	runtime/sema.go:62 +0x3c
sync.(*WaitGroup).Wait(0x4232f88)
	sync/waitgroup.go:116 +0x68
github.com/lightningnetwork/lnd/peer.(*Brontide).WaitForDisconnect(0x4232d80, 0x5122e80)
	github.com/lightningnetwork/lnd/peer/brontide.go:1102 +0xd8
github.com/lightningnetwork/lnd.(*server).peerTerminationWatcher(0x4372b60, 0x4232d80, 0x5122e80)
	github.com/lightningnetwork/lnd/server.go:4017 +0x74
created by github.com/lightningnetwork/lnd.(*server).peerInitializer in goroutine 714931
	github.com/lightningnetwork/lnd/server.go:3973 +0x114

--> 19 goroutines seem to be stuck here:

goroutine 2264025 [sync.RWMutex.RLock, 30 minutes]:
sync.runtime_SemacquireRWMutexR(0x4372bcc, 0x0, 0x0)
	runtime/sema.go:82 +0x3c
sync.(*RWMutex).RLock(0x4372bc0)
	sync/rwmutex.go:71 +0x58
github.com/lightningnetwork/lnd.(*server).BroadcastMessage(0x4372b60, 0x0, {0x1fe16f8, 0x0, 0x0})
	github.com/lightningnetwork/lnd/server.go:3287 +0x2c
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).sendLocalBatch(0xcf32310, {0x54e4000, 0x0, 0x6e})
	github.com/lightningnetwork/lnd/discovery/gossiper.go:1286 +0x70
github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).splitAndSendAnnBatch.func2()
	github.com/lightningnetwork/lnd/discovery/gossiper.go:1260 +0x198
created by github.com/lightningnetwork/lnd/discovery.(*AuthenticatedGossiper).splitAndSendAnnBatch in goroutine 50911
	github.com/lightningnetwork/lnd/discovery/gossiper.go:1251 +0x200


--> Waiting for `p.startReady`:

goroutine 2262974 [select, 30 minutes]:
github.com/lightningnetwork/lnd/peer.(*Brontide).Disconnect(0xeeb6b40, {0x1575da0, 0xdb130c0})
	github.com/lightningnetwork/lnd/peer/brontide.go:1115 +0x94
github.com/lightningnetwork/lnd.(*server).removePeer(0x4372b60, 0xeeb6b40)
	github.com/lightningnetwork/lnd/server.go:4309 +0x100
github.com/lightningnetwork/lnd.(*server).InboundPeerConnected(0x4372b60, {0x1583294, 0x15632960})
	github.com/lightningnetwork/lnd/server.go:3586 +0x6f0
created by github.com/btcsuite/btcd/connmgr.(*ConnManager).listenHandler in goroutine 714406
	github.com/btcsuite/[email protected]/connmgr/connmanager.go:507 +0x250

Looking closer, I think this might actually be because goroutine 714108 is holding the unique Postgres write lock but is waiting for an answer:

goroutine 714108 [runnable]:
syscall.Syscall(0x4, 0x3b, 0x810e800, 0x43)
	syscall/syscall_linux.go:69 +0x8
syscall.write(0x3b, {0x810e800, 0x43, 0x400})
	syscall/zsyscall_linux_arm.go:949 +0x48
syscall.Write(...)
	syscall/syscall_unix.go:209
internal/poll.ignoringEINTRIO(...)
	internal/poll/fd_unix.go:736
internal/poll.(*FD).Write(0x11648780, {0x810e800, 0x43, 0x400})
	internal/poll/fd_unix.go:380 +0x304
net.(*netFD).Write(0x11648780, {0x810e800, 0x43, 0x400})
	net/fd_posix.go:96 +0x38
net.(*conn).Write(0x1eb5aa98, {0x810e800, 0x43, 0x400})
	net/net.go:191 +0x48
github.com/jackc/pgconn.(*PgConn).execExtendedSuffix(0x7bd0000, {0x810e800, 0x2d, 0x400}, 0x7bd0054)
	github.com/jackc/[email protected]/pgconn.go:1153 +0x138
github.com/jackc/pgconn.(*PgConn).ExecPrepared(0x7bd0000, {0x1580210, 0x12f3dea0}, {0x11a29240, 0xe}, {0x154a2840, 0x1, 0x4}, {0x115a11a0, 0x1, ...}, ...)
	github.com/jackc/[email protected]/pgconn.go:1108 +0x118
github.com/jackc/pgx/v4.(*Conn).Query(0x114e8580, {0x1580210, 0x12f3dea0}, {0x64218c0, 0x56}, {0x479fa70, 0x2, 0x2})
	github.com/jackc/pgx/[email protected]/conn.go:655 +0xc38
github.com/jackc/pgx/v4/stdlib.(*Conn).QueryContext(0xb19bcc0, {0x1580210, 0x12f3dea0}, {0x64218c0, 0x56}, {0x21152870, 0x1, 0x1})
	github.com/jackc/pgx/[email protected]/stdlib/sql.go:436 +0x330
database/sql.ctxDriverQuery({0x1580210, 0x12f3dea0}, {0xef3c0cb8, 0xb19bcc0}, {0x0, 0x0}, {0x64218c0, 0x56}, {0x21152870, 0x1, ...})
	database/sql/ctxutil.go:48 +0x94
database/sql.(*DB).queryDC.func1()
	database/sql/sql.go:1748 +0x178
database/sql.withLock({0x157ce68, 0x7d725a0}, 0x6db5a1c)
	database/sql/sql.go:3502 +0x84
database/sql.(*DB).queryDC(0x42b0bd0, {0x1580210, 0x12f3dea0}, {0x15801d0, 0x82d30b0}, 0x7d725a0, 0xaec03e0, {0x64218c0, 0x56}, {0x6db5b40, ...})
	database/sql/sql.go:1743 +0x174
database/sql.(*Tx).QueryContext(0xdba5bc0, {0x1580210, 0x12f3dea0}, {0x64218c0, 0x56}, {0x6db5b40, 0x1, 0x1})
	database/sql/sql.go:2497 +0xbc
database/sql.(*Tx).QueryRowContext(...)
	database/sql/sql.go:2515
github.com/lightningnetwork/lnd/kvdb/sqlbase.(*readWriteTx).QueryRow(0x15ace3a8, {0x64218c0, 0x56}, {0x6db5b40, 0x1, 0x1})
	github.com/lightningnetwork/lnd/[email protected]/sqlbase/readwrite_tx.go:180 +0x60
github.com/lightningnetwork/lnd/kvdb/sqlbase.(*readWriteCursor).Next(0x6db5b80)
	github.com/lightningnetwork/lnd/[email protected]/sqlbase/readwrite_cursor.go:93 +0x1f8
github.com/lightningnetwork/lnd/kvdb/sqlbase.(*readWriteBucket).ForEach(0x5130c10, 0x1fe64c60)
	github.com/lightningnetwork/lnd/[email protected]/sqlbase/readwrite_bucket.go:62 +0x7c
github.com/lightningnetwork/lnd/channeldb.(*ChannelGraph).pruneGraphNodes(0x448bf80, {0x158597c, 0x5130b20}, {0x158597c, 0x5130c10})
	github.com/lightningnetwork/lnd/channeldb/graph.go:1458 +0x194
github.com/lightningnetwork/lnd/channeldb.(*ChannelGraph).PruneGraphNodes.func1({0x15844f4, 0x15ace3a8})
	github.com/lightningnetwork/lnd/channeldb/graph.go:1410 +0x11c
github.com/lightningnetwork/lnd/kvdb/sqlbase.(*db).executeTransaction.func2()
	github.com/lightningnetwork/lnd/[email protected]/sqlbase/db.go:311 +0x30
github.com/lightningnetwork/lnd/kvdb/sqlbase.catchPanic(0x6db5cb0)
	github.com/lightningnetwork/lnd/[email protected]/sqlbase/db.go:196 +0x68
github.com/lightningnetwork/lnd/kvdb/sqlbase.(*db).executeTransaction(0x439a5c0, 0x14ce83e0, 0x13dcd34, 0x0)
	github.com/lightningnetwork/lnd/[email protected]/sqlbase/db.go:311 +0x144
github.com/lightningnetwork/lnd/kvdb/sqlbase.(*db).Update(0x439a5c0, 0x14ce83e0, 0x13dcd34)
	github.com/lightningnetwork/lnd/[email protected]/sqlbase/db.go:227 +0x34
github.com/lightningnetwork/lnd/kvdb.Update(...)
	github.com/lightningnetwork/lnd/[email protected]/interface.go:16
github.com/lightningnetwork/lnd/channeldb.(*ChannelGraph).PruneGraphNodes(0x448bf80)
	github.com/lightningnetwork/lnd/channeldb/graph.go:1396 +0x84
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).pruneZombieChans(0x5622840)
	github.com/lightningnetwork/lnd/routing/router.go:990 +0x76c
github.com/lightningnetwork/lnd/routing.(*ChannelRouter).networkHandler(0x5622840)
	github.com/lightningnetwork/lnd/routing/router.go:1232 +0xbc4
created by github.com/lightningnetwork/lnd/routing.(*ChannelRouter).Start in goroutine 1
	github.com/lightningnetwork/lnd/routing/router.go:689 +0x944

So not sure if things being locked on the server's main mutex is a problem in itself or only really possible if there is only a single DB writer possible.

@FeatureSpitter can you please do the following:

  • Check the postgres log if you see anything suspicious that would explain why a single query is taking so long?
  • Try reducing the db.postgres.timeout=3m value to something smaller, e.g. 1m
  • Try increasing the db.postgres.maxconnections=10 value (but check the actually configured value in Postgres first, that must be at least as large as the number you're using in lnd)

@FeatureSpitter
Copy link
Author

I'll try that.

The next time it hangs (shouldn't be taking too much time) I'll fetch the curls again and the postgres logs.

@FeatureSpitter
Copy link
Author

FeatureSpitter commented Sep 19, 2023

@guggero here it goes.

I increased the postgres max connection limit to 200 (both in LND and postgres configs).
I reduced the timeout to 30s.
Here are the postgres logs since it started until the moment getinfo got stuck again:

pgsql       | 
pgsql       | PostgreSQL Database directory appears to contain a database; Skipping initialization
pgsql       | 
pgsql       | 2023-09-19 11:31:00.704 UTC [1] LOG:  starting PostgreSQL 15.4 (Debian 15.4-2.pgdg120+1) on arm-unknown-linux-gnueabihf, compiled by gcc (Debian 12.2.0-14) 12.2.0, 32-bit
pgsql       | 2023-09-19 11:31:00.705 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
pgsql       | 2023-09-19 11:31:00.705 UTC [1] LOG:  listening on IPv6 address "::", port 5432
pgsql       | 2023-09-19 11:31:00.857 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
pgsql       | 2023-09-19 11:31:00.877 UTC [30] LOG:  database system was shut down at 2023-09-19 11:28:58 UTC
pgsql       | 2023-09-19 11:31:00.907 UTC [1] LOG:  database system is ready to accept connections
pgsql       | 2023-09-19 11:36:00.979 UTC [28] LOG:  checkpoint starting: time
pgsql       | 2023-09-19 11:36:42.366 UTC [28] LOG:  checkpoint complete: wrote 414 buffers (2.5%); 0 WAL file(s) added, 0 removed, 0 recycled; write=41.286 s, sync=0.063 s, total=41.388 s; sync files=17, longest=0.005 s, average=0.004 s; distance=1623 kB, estimate=1623 kB
pgsql       | 2023-09-19 11:41:00.463 UTC [28] LOG:  checkpoint starting: time
pgsql       | 2023-09-19 11:41:06.929 UTC [28] LOG:  checkpoint complete: wrote 64 buffers (0.4%); 0 WAL file(s) added, 0 removed, 0 recycled; write=6.356 s, sync=0.053 s, total=6.467 s; sync files=11, longest=0.015 s, average=0.005 s; distance=366 kB, estimate=1497 kB
pgsql       | 2023-09-19 11:46:01.027 UTC [28] LOG:  checkpoint starting: time
pgsql       | 2023-09-19 11:47:15.293 UTC [28] LOG:  checkpoint complete: wrote 694 buffers (4.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=74.147 s, sync=0.073 s, total=74.266 s; sync files=18, longest=0.009 s, average=0.005 s; distance=5666 kB, estimate=5666 kB
pgsql       | 2023-09-19 11:51:01.391 UTC [28] LOG:  checkpoint starting: time
pgsql       | 2023-09-19 11:51:41.717 UTC [28] LOG:  checkpoint complete: wrote 397 buffers (2.4%); 0 WAL file(s) added, 0 removed, 1 recycled; write=40.183 s, sync=0.078 s, total=40.327 s; sync files=17, longest=0.020 s, average=0.005 s; distance=4033 kB, estimate=5502 kB
pgsql       | 2023-09-19 11:56:01.774 UTC [28] LOG:  checkpoint starting: time
pgsql       | 2023-09-19 11:59:04.401 UTC [28] LOG:  checkpoint complete: wrote 1810 buffers (11.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=182.531 s, sync=0.038 s, total=182.628 s; sync files=9, longest=0.010 s, average=0.005 s; distance=13725 kB, estimate=13725 kB
pgsql       | 2023-09-19 12:01:01.499 UTC [28] LOG:  checkpoint starting: time
pgsql       | 2023-09-19 12:02:55.945 UTC [28] LOG:  checkpoint complete: wrote 941 buffers (5.7%); 0 WAL file(s) added, 0 removed, 1 recycled; write=114.329 s, sync=0.050 s, total=114.447 s; sync files=15, longest=0.007 s, average=0.004 s; distance=11368 kB, estimate=13489 kB
pgsql       | 2023-09-19 12:11:02.129 UTC [28] LOG:  checkpoint starting: time
pgsql       | 2023-09-19 12:11:44.533 UTC [28] LOG:  checkpoint complete: wrote 423 buffers (2.6%); 0 WAL file(s) added, 0 removed, 0 recycled; write=42.321 s, sync=0.041 s, total=42.404 s; sync files=12, longest=0.007 s, average=0.004 s; distance=4030 kB, estimate=12543 kB

And here are the new debug files: https://www.dropbox.com/scl/fo/h9xfa3k2yt08xr2lwygd5/h?rlkey=bbd6f3fh70674v4kxazza9trs&dl=0

As a curiosity, here is the database size (per table):
image

@FeatureSpitter
Copy link
Author

FeatureSpitter commented Sep 19, 2023

I just did another test:

After waiting for getinfo to return for 25 mins (it didn't return), I sigterm postgres and started it again.

While it was down, lnd complained, as expected, and as soon as postgres was up it stopped complaining and went back to the normal logs:

lnd         | 2023-09-19 12:30:00.003 [INF] WTCL: (legacy) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
lnd         | 2023-09-19 12:30:00.011 [INF] WTCL: (anchor) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
lnd         | 2023-09-19 12:31:00.003 [INF] WTCL: (legacy) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
lnd         | 2023-09-19 12:31:00.011 [INF] WTCL: (anchor) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
lnd         | 2023-09-19 12:31:41.042 [ERR] CRTR: Failed to update mission control store: unknown postgres error: FATAL: terminating connection due to administrator command (SQLSTATE 57P01)
lnd         | 2023-09-19 12:31:42.043 [ERR] CRTR: Failed to update mission control store: unknown postgres error: FATAL: terminating connection due to administrator command (SQLSTATE 57P01)
lnd         | 2023-09-19 12:31:43.046 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:31:44.046 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:31:45.046 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:31:46.045 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:31:47.045 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:31:48.045 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:31:49.045 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:31:50.045 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:31:51.045 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:31:52.046 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:31:53.045 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:31:54.045 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:31:55.045 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:31:56.045 [ERR] CRTR: Failed to update mission control store: failed to connect to `host=pgsql user=postgres database=lnd`: hostname resolving error (lookup pgsql on 127.0.0.11:53: no such host)
lnd         | 2023-09-19 12:32:00.003 [INF] WTCL: (legacy) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
lnd         | 2023-09-19 12:32:00.011 [INF] WTCL: (anchor) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
lnd         | 2023-09-19 12:32:13.037 [INF] DISC: GossipSyncer(02d419c9af624d0e7a7c90a60b9ffa35f4934973a9d7d3b4175cc3cc28b7ffcde1): applying gossipFilter(start=0001-01-01 00:00:00 +0000 UTC, end=0001-01-01 00:00:00 +0000 UTC)
lnd         | 2023-09-19 12:32:13.038 [INF] DISC: GossipSyncer(0390cc12f782259427bdc45a283705f4c20ccbdff064ed43f6b74fefcafe9eb1f7): applying gossipFilter(start=2023-09-19 12:32:13.038208524 +0000 UTC m=+3669.907464421, end=2159-10-26 19:00:28.038208524 +0000 UTC)
lnd         | 2023-09-19 12:33:00.002 [INF] WTCL: (legacy) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)
lnd         | 2023-09-19 12:33:00.011 [INF] WTCL: (anchor) Client stats: tasks(received=0 accepted=0 ineligible=0) sessions(acquired=0 exhausted=0)

However, getinfo is still stuck.
With postgres being terminated gracefully, I would expect any "hanging" connection on postgres to be terminated, thus un-clogging my getinfo command (returning some error). But that was not the case, the command is still hanging, and if I cancel and ask getinfo again, it still hangs.

So whatever is clogging this seems to be isolated to the lnd process only. Rebooting postgres didn't make any difference besides these log messages. Only rebooting lnd (without rebooting postgres) unclogs it, until it clogs again after like ~10-20 minutes of runtime.

This test probably rules out postgres as the potential point of failure in this issue.

@FeatureSpitter FeatureSpitter changed the title [bug]: What can cause lncli getinfo to be so slow to the point it hasn't return anything after 5 minutes of waiting? [bug]: What can cause lncli getinfo to be so slow to the point it hasn't return anything after 25 minutes of waiting? Sep 19, 2023
@FeatureSpitter
Copy link
Author

FeatureSpitter commented Sep 19, 2023

Meanwhile I measure the query statistics on postgres while it is clogged and I learned this:

image

image

Look at that commit operation going...!

@FeatureSpitter
Copy link
Author

The commit is still going strong.

image

image

@guggero
Copy link
Collaborator

guggero commented Sep 19, 2023

This does look like something's not properly working with Postgres... Why would a commit take that long? Is there a way to find out all queries of that transaction? So basically, what it is committing?

@FeatureSpitter
Copy link
Author

It now finished the commit and started a new one, but the other SHOW TRANSACTION ISOLATION LEVEL is still there:

image

@guggero
Copy link
Collaborator

guggero commented Sep 19, 2023

I would guess that the SHOW TRANSACTION ISOLATION LEVEL query is probably from the SQL UI you're using in the screenshot itself, not from lnd.

@FeatureSpitter
Copy link
Author

FeatureSpitter commented Sep 19, 2023

I would guess that the SHOW TRANSACTION ISOLATION LEVEL query is probably from the SQL UI you're using in the screenshot itself, not from lnd.

I'm using DataGrip.

I don't see any query hanged in DataGrip, can be some background query.

I am trying to figure if there's a way to see the commit statements without having to go thru all the query history. Maybe it would be easier to log this on LND's side.

Edit: This SHOW TRANSACTION ISOLATION LEVEL seems to be used to probe the status of a transaction that is commiting, not sure, never used it, but if so, could be related to whatever driver LND/Go is using to handle db transactions.

Edit 2: I restarted DataGrip and SHOW TRANSACTIONs are gone, so maybe you're guess is right.

@FeatureSpitter
Copy link
Author

FeatureSpitter commented Sep 19, 2023

Lots of commits.

What's the most common or largest commit operation LND is performing?

image

I wonder how other users and images run LND in a raspberry if the db is a bottleneck in this system.

@FeatureSpitter
Copy link
Author

These are the slowest write queries:

image

@FeatureSpitter
Copy link
Author

I'm logging all the statements now to see if I find a pattern, meanwhile it seems not all query arguments are going as prepared statements (parent_id):

SELECT value FROM channeldb_kv WHERE parent_id=23 AND key=$1

@guggero
Copy link
Collaborator

guggero commented Sep 19, 2023

I'm not really that well versed in Postgres, but is there a way to find out what queries are in a transaction? If yes, then I'd try to inspect the transaction the next time a commit takes several minutes. Because all individual queries seem to be very quick.

@FeatureSpitter
Copy link
Author

I'm not really that well versed in Postgres, but is there a way to find out what queries are in a transaction? If yes, then I'd try to inspect the transaction the next time a commit takes several minutes. Because all individual queries seem to be very quick.

I don't know if there's a way to somehow filter per transaction.

@FeatureSpitter
Copy link
Author

FeatureSpitter commented Sep 19, 2023

There's not much to tell from the logs other than these SQL statements are constantly showing up:

image

There's a huge activity in walletdb. Look at the timestamps, it is a spam.

But postgres is not in effort. I doubt this spam is causing much of a clogging.

The I/O seems normal:

image

So I don't understand why that routine holds to the mutex forever.

@FeatureSpitter
Copy link
Author

By the way, I am compiling LND with these tags:

&&  make release-install tags="signrpc walletrpc chainrpc invoicesrpc watchtowerrpc peersrpc autopilotrpc dev kvdb_postgres"

I am using this image: https://github.com/lightningnetwork/lnd/blob/master/Dockerfile

Which means I changed line 27 to add the tags to the make release-install command.

Probably not related to the issue but at this point I am out of ideas.

@FeatureSpitter
Copy link
Author

I just tried without line 27 modification, i.e., https://github.com/lightningnetwork/lnd/blob/master/Dockerfile as it is. And the same happens.

I have no more ideas.

@Roasbeef
Copy link
Member

@FeatureSpitter btw you should remove the blocking and mutex profile args you if you don't need to obtain a profile as it adds overhead in the background even when you aren't gathering a profile.

@Roasbeef
Copy link
Member

Can you increase the CHDB log level to debug, want to see if we have any of these logs: https://github.com/lightningnetwork/lnd/blob/master/kvdb/sqlbase/db.go#L277-L279

@Roasbeef
Copy link
Member

@FeatureSpitter did you see this behavior before updating? Which version did you update from?

@FeatureSpitter
Copy link
Author

@FeatureSpitter did you see this behavior before updating? Which version did you update from?

I started using 0.17, I didn't update, so this behavior is all I ever known.

@FeatureSpitter
Copy link
Author

FeatureSpitter commented Sep 19, 2023

Can you increase the CHDB log level to debug, want to see if we have any of these logs: https://github.com/lightningnetwork/lnd/blob/master/kvdb/sqlbase/db.go#L277-L279

You mean debuglevel=debug?

If that's what you are asking, then no, that log message never shows.

@FeatureSpitter
Copy link
Author

FeatureSpitter commented Sep 20, 2023

image

It is impossible only a handful of people are experiencing this.

This is a remote, non-managed by me postgres server, and still clogs at the commit.

Unfortunately I wasn't able to figure out a way to see what are the queries related to this commit.

But everything is pointing out to some strange issue in the way LND uses postgres. Sorry if I can't be of better help for now.

@FeatureSpitter FeatureSpitter changed the title [bug]: What can cause lncli getinfo to be so slow to the point it hasn't return anything after 25 minutes of waiting? [bug]: lncli getinfo and LND in general, getting stuck at COMMIT when using Postgres Sep 20, 2023
@djkazic
Copy link
Contributor

djkazic commented Sep 20, 2023

FWIW I tried to repro this on similar hardware, a RockPro64 with aarch64. I was not able to repro it there, so I suspect the issue is possibly with bitcoind as @yyforyongyu mentioned. Here are some captured time for lncli getinfo that I did:

Setup:

Postgres on x86 machine as well as bitcoin core
ARM SBC setup with postgres backend
lncli create to init a new wallet
Bash script to run lncli getinfo with time at 3 minute intervals

Results:

[1]
time lncli getinfo > /dev/null
real    0m3.769s
user    0m0.120s
sys     0m0.060s

[2]
time lncli getinfo > /dev/null
real    0m3.429s
user    0m0.140s
sys     0m0.041s

[3]
time lncli getinfo > /dev/null
real    0m1.489s
user    0m0.098s
sys     0m0.073s

[4]
time lncli getinfo > /dev/null
real    0m6.491s
user    0m0.123s
sys     0m0.041s

[5]
time lncli getinfo > /dev/null
real    0m0.476s
user    0m0.139s
sys     0m0.064s

=== synced to chain here ===
during graph sync we can observe slower response times, which seems to track since we're presumably doing many writes during that period:

[0]
real    0m53.563s
user    0m0.131s
sys     0m0.259s

[1]
real    0m20.201s
user    0m0.079s
sys     0m0.187s

[2]
real    0m34.065s
user    0m0.127s
sys     0m0.135s

The second-to-last one seems to have been an outlier.

@FeatureSpitter
Copy link
Author

FWIW I tried to repro this on similar hardware, a RockPro64 with aarch64. I was not able to repro it there, so I suspect the issue is possibly with bitcoind as @yyforyongyu mentioned. Here are some captured time for lncli getinfo that I did:

Setup:

Postgres on x86 machine as well as bitcoin core
ARM SBC setup with postgres backend
lncli create to init a new wallet
Bash script to run lncli getinfo with time at 3 minute intervals

Results:

[1]
time lncli getinfo > /dev/null
real    0m3.769s
user    0m0.120s
sys     0m0.060s

[2]
time lncli getinfo > /dev/null
real    0m3.429s
user    0m0.140s
sys     0m0.041s

[3]
time lncli getinfo > /dev/null
real    0m1.489s
user    0m0.098s
sys     0m0.073s

[4]
time lncli getinfo > /dev/null
real    0m6.491s
user    0m0.123s
sys     0m0.041s

[5]
time lncli getinfo > /dev/null
real    0m0.476s
user    0m0.139s
sys     0m0.064s

The second-to-last one seems to have been an outlier.

Give it more time, in the beginning is always fast.

@yyforyongyu
Copy link
Member

Could you try this patch #8019 and see if it helps? On my machine it speeds up getinfo quite a lot.

@Roasbeef
Copy link
Member

Roasbeef commented Sep 20, 2023

IIUC, the OP is also running Docker in 32 bit mode as well, which may contribute to some trashing that can slow things down. As mentioned above, if you're running everything on a single machine, without any replication at all, then sqlite is a better fit for you hardware configuration.

Zooming out: you're seeing postgres hand on commit, this isn't related to lnd, as it needs to wait for a commit to finish before it can proceed. As is, we only have a single write commit going at any given time. #7992 will enable multiple write transactions to commit at a time.

You should also attempt to increase bitcoind.blockpollinginterval to something like bitcoind.blockpollinginterval=5m which will reduce load.

But everything is pointing out to some strange issue in the way LND uses postgres. Sorry if I can't be of better help for now.

lnd just exports everything as a key-value store, which we know isn't optimal. Future PRs will be able to take better advantage server side query filtering. At this point, what we know is happening is that postgres is blocking on a long running operation, if postgres blocks, then lnd does due to the current application level transaction locking.

@FeatureSpitter
Copy link
Author

IIUC, the OP is also running Docker in 32 bit mode as well, which may contribute to some trashing that can slow things down. As mentioned above, if you're running everything on a single machine, without any replication at all, then sqlite is a better fit for you hardware configuration.

Zooming out: you're seeing postgres hand on commit, this isn't related to lnd, as it needs to wait for a commit to finish before it can proceed. As is, we only have a single write commit going at any given time. #7992 will enable multiple write transactions to commit at a time.

You should also attempt to increase bitcoind.blockpollinginterval to something like bitcoind.blockpollinginterval=5m which will reduce load.

But everything is pointing out to some strange issue in the way LND uses postgres. Sorry if I can't be of better help for now.

lnd just exports everything as a key-value store, which we know isn't optimal. Future PRs will be able to take better advantage server side query filtering. At this point, what we know is happening is that postgres is blocking on a long running operation, if postgres blocks, then lnd does due to the current application level transaction locking.

Just to add more details regarding the docker that I am using (which is the latest one in raspbian repos):

pi@raspberrypi:~ $ docker version
Client:
 Version:           20.10.5+dfsg1
 API version:       1.41
 Go version:        go1.15.15
 Git commit:        55c4c88
 Built:             Mon May 30 18:34:49 2022
 OS/Arch:           linux/arm
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.5+dfsg1
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.15.15
  Git commit:       363e9a8
  Built:            Mon May 30 18:34:49 2022
  OS/Arch:          linux/arm
  Experimental:     false
 containerd:
  Version:          1.4.13~ds1
  GitCommit:        1.4.13~ds1-1~deb11u4
 runc:
  Version:          1.0.0~rc93+ds1
  GitCommit:        1.0.0~rc93+ds1-5+deb11u2
 docker-init:
  Version:          0.19.0
  GitCommit:        

pi@raspberrypi:~ $ docker-compose version
docker-compose version 1.25.0, build unknown
docker-py version: 4.1.0
CPython version: 3.9.2
OpenSSL version: OpenSSL 1.1.1n  15 Mar 2022

Regarding sqlite, I wanted to extend my postgres with remote replication, which is why I am using it in the first place.

But keep in mind that even when I used an external postgres on a dedicated server with the specs I mentioned above, it had commit operations that went for minutes. So it is not an issue caused by having everything running on the same machine, since externalizing postgres made no difference in this case.

Tomorrow I'll try @yyforyongyu PR, and also figure out if I can get a better docker.

@FeatureSpitter
Copy link
Author

I changed the OS from my Raspi 4 from Raspian 64bit to Ubuntu Server 23 64bit.

It seems Raspbian 64bit uses the 32bit docker and armhf architecture when fetching and building apps, instead of arm64.

Changing to Ubuntu 23 64bit solved this issue, and so far, LND is no longer clogged.

So in sum, Raspberry 4 can't run LND in 32 bits with stability.

@Roasbeef
Copy link
Member

So in sum, Raspberry 4 can't run LND in 32 bits with stability.

Yeah, they'll also run into database size issues as well.

Closing this now as it was a OS issue.

@FeatureSpitter
Copy link
Author

So in sum, Raspberry 4 can't run LND in 32 bits with stability.

Yeah, they'll also run into database size issues as well.

Closing this now as it was a OS issue.

I wouldn't close this so fast with a reason like "it was a OS issue", the same way that if I implemented a very inefficient sorting algorithm that people couldn't run on a slower OS, I wouldn't blame it on that OS.

LND has an issue, which might, or might not be interesting to fix:
It can't run well on the currently latest Raspbian OS 64-bit version. Because this OS is using repositories for ARMHF instead of ARM64 architecture.

Is this an OS problem? Certainly not. But it is a problem when apps require better performance, or require 64-bit addressing capabilities to run more efficiently, which is the case of LND.

So, for anyone using LND with a Raspbian OS, be advised of these potential issues.

I am actually curious how raspblitz fixed this, they also use some version of Raspbian.

@guggero
Copy link
Collaborator

guggero commented Sep 22, 2023

No, the issue here is that you rely on Docker, which because of the OS was only installed as 32bit and therefore could only run 32bit applications. RaspiBlitz has been using the 64bit version of lnd for a long time, as RaspiBlitz doesn't use Docker to run things.
There are other known issues with lnd running on 32bit architectures, even before Postgres. And we've been recommending to not use 32bit architectures anymore. Though maybe not explicitly enough.

@FeatureSpitter
Copy link
Author

Actually, I think the problem is back.

24h after the LND started, it is clogged again, this time in Ubuntu Server 23 64 bit:

image

@FeatureSpitter
Copy link
Author

image

@guggero
Copy link
Collaborator

guggero commented Sep 22, 2023

Are you using the official lnd docker image? Can you post the image hash please?

@FeatureSpitter
Copy link
Author

I am starting to think that LND is just not compatible with Docker, no matter the bits.

@guggero
Copy link
Collaborator

guggero commented Sep 22, 2023

I am starting to think that LND is just not compatible with Docker, no matter the bits.

I've been running lnd within docker for many years, never had any issue. I'm actually starting to believe that Docker on a Raspberry Pi has issues.

@FeatureSpitter
Copy link
Author

FeatureSpitter commented Sep 22, 2023

I am starting to think that LND is just not compatible with Docker, no matter the bits.

I've been running lnd within docker for many years, never had any issue. I'm actually starting to believe that Docker on a Raspberry Pi has issues.

REPOSITORY          TAG                DIGEST                                                                    IMAGE ID       CREATED         SIZE
postgres            15                 sha256:2fe3743ab3bfe0246c96b1cf397017b3e5477d161c9ddaeb0e1d4a8405d7abd1   ef13bf480f5b   44 hours ago    433MB
lightninglabs/lnd   v0.17.0-beta.rc4   sha256:263c226e0fb78b8bf4ec04e6dfe4f49fb4dfb321eea61ccb132369ad68efc7f1   3179976b7682   3 days ago      114MB
lncm/bitcoind       v25.0              sha256:fad11d4874f1c2dc4373f6fea913bf95e0f0491f377b9a0930f488565e6266f0   624e63a7e93d   3 months ago    35.7MB
btcpayserver/tor    0.4.7.10           sha256:583b9970b099f175d6a842816919c5828d5408116c5623e1abc186aa1570c9fe   f9f519447e06   10 months ago   97.3MB

Well, not the first time I have issues with LND on Docker, my previous time was in an Intel 64 Ubuntu server.

The common denominator starts to become clear (me, or LND lol).

@FeatureSpitter
Copy link
Author

Could you try this patch #8019 and see if it helps? On my machine it speeds up getinfo quite a lot.

I haven't had the opportunity to try this. Could it be the answer?

@guggero
Copy link
Collaborator

guggero commented Sep 22, 2023

It depends. Do you still see the long-running commit in Postgres? Maybe you've gotten rid of that and are now running into the issue addressed by #8019.

@FeatureSpitter
Copy link
Author

It depends. Do you still see the long-running commit in Postgres? Maybe you've gotten rid of that and are now running into the issue addressed by #8019.

yes I still see long commits:

image

And as ruled out above, it doesn't matter if postgres is on docker, on a pi, or on a quantum computer.

@FeatureSpitter
Copy link
Author

4 mins and counting

image

@djkazic
Copy link
Contributor

djkazic commented Sep 22, 2023

I did some digging with the user on this issue and tried to repro the problem using his node plus my postgres server. Here's what I've analyzed. There is approx 100 ms network latency between his node and my postgres server.

First, let's look at an example of what appears to be a hanging commit:

   pid   |       query_age        |  usename   |                                                                                                      query
---------+------------------------+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      32 |                        | postgres   |
      31 |                        |            |
      28 |                        |            |
      27 |                        |            |
      30 |                        |            |
 3494740 | 00:00:00.000558        | postgres   | SELECT                                                                                                                                                                                                          +
         |                        |            |     pid,                                                                                                                                                                                                        +
         |                        |            |     age(clock_timestamp(), query_start) AS query_age,                                                                                                                                                           +
         |                        |            |     usename,                                                                                                                                                                                                    +
         |                        |            |     query                                                                                                                                                                                                       +
         |                        |            | FROM                                                                                                                                                                                                            +
         |                        |            |     pg_stat_activity                                                                                                                                                                                            +
         |                        |            | ORDER BY                                                                                                                                                                                                        +
         |                        |            |     query_start DESC;
 3470432 | 00:00:00.060476        | quantumlnd | SELECT value FROM channeldb_kv WHERE parent_id=23 AND key=$1
 3466372 | 00:00:02.434555        | quantumlnd | commit
 3491954 | 00:00:14.600606        | postgres   | SELECT datname, pid, usename, application_name, backend_start, xact_start, query_start, state_change, wait_event_type, wait_event, state FROM pg_stat_activity WHERE wait_event != 'f' AND usename='quantumlnd';
 3471146 | 00:08:25.988444        | quantumlnd | commit
      33 | 4 days 22:16:28.146506 | replicator | START_REPLICATION SLOT "replication_slot_slave1" 1E/50000000 TIMELINE 1

Note PID 3471146.

SELECT datname, pid, usename, application_name, backend_start, xact_start, query_start, state_change, wait_event_type, wait_event, state FROM pg_stat_activity WHERE wait_event != 'f' AND usename='quantumlnd';
  datname   |   pid   |  usename   | application_name |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type | wait_event |        state
------------+---------+------------+------------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------+---------------------
 quantumlnd | 3466372 | quantumlnd |                  | 2023-09-22 16:25:55.245563+00 |                               | 2023-09-22 16:51:37.315327+00 | 2023-09-22 16:51:37.315364+00 | Client          | ClientRead | idle
 quantumlnd | 3471146 | quantumlnd |                  | 2023-09-22 16:33:05.410526+00 |                               | 2023-09-22 16:43:05.677998+00 | 2023-09-22 16:43:05.678045+00 | Client          | ClientRead | idle
 quantumlnd | 3470432 | quantumlnd |                  | 2023-09-22 16:32:17.499777+00 | 2023-09-22 16:33:05.032095+00 | 2023-09-22 16:51:37.353961+00 | 2023-09-22 16:51:37.353981+00 | Client          | ClientRead | idle in transaction

The postgres lifecycle of a session:

  • Connection establishment
  • Command isssuance
  • Query execution
  • ClientRead: awaiting next command <-- we are here

If I understand correctly, this implies that the commit most likely did complete on postgres.
ClientRead wait event indicates that the session is now waiting for the next command from the client, so I think LND isn't "reacting" to the commit in a timely fashion.

@guggero
Copy link
Collaborator

guggero commented Sep 22, 2023

Could you try running lnd outside of Docker? Maybe there's a weird (virtual) networking issue going on? Just to confirm it's not something "between" lnd and the DB that's causing the issue.

@djkazic
Copy link
Contributor

djkazic commented Sep 22, 2023

Yep, I think that's the next thing we're going to try if we can't figure out more from postgres state.

@djkazic
Copy link
Contributor

djkazic commented Sep 22, 2023

OK seeing wildly different performance between timeout 0 and 30s (0 is faster)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unintended code behaviour needs triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants