[Bug]: Data-node is getting SQL error when replaying from block 0 #10701

daniel1302 · 2024-02-19T08:15:35Z

Problem encountered

I am replaying node from block 0 and It is happening every 300-500k blocks. Validators have the same issue multiple times

Observed behaviour

Error in data-node logs during replay

Expected behaviour

Vega should be able to replay without restarting every a few times.

Steps to reproduce

1. Replay node from block 0

Software version

v0.71.5

Failing test

No response

Jenkins run

No response

Configuration used

I have tested multiple configurations:

1. PostgreSQL on the same node
2. PostgreSQL on the separated node

Both nodes are pretty big:
1. 128GB RAM, 16 cores, 4TB NVME
2. PostgreSQL 64GB RAM, 8 cores, 2 TB SSD, Vega+Data-node 32 GB RAM, 6 cores 2 TB SSD.

Relevant log output

Feb 18 17:29:17 data-node visor[13892]: 2024-02-18T17:29:17.348Z        ERROR        datanode.start.runNode        start/node.go:175        Vega data node stopped with error        {"error": "failed to flush subscriber:flushing margin levels: flushing margin levels: failed to copy margin_levels entries into database:ERROR: could not open relation with OID 206443 (SQLSTATE XX000)"}

daniel1302 · 2024-02-23T08:43:27Z

I have also got the following error today:


Feb 22 18:26:35 data-node2 visor[3077636]: 2024-02-22T18:26:35.854Z        ERROR        datanode.start.runNode        start/node.go:175        Vega data node stopped with error        {"error": "failed to flush subscriber:failed to copy orders entries into database:ERROR: deadlock detected (SQLSTATE 40P01)"}
Feb 22 18:26:35 data-node2 visor[3077636]: vega data node stopped with error: failed to flush subscriber:failed to copy orders entries into database:ERROR: deadlock detected (SQLSTATE 40P01)

daniel1302 added the bug label Feb 19, 2024

daniel1302 assigned gordsport Feb 19, 2024

vega-issues added this to Core Kanban Feb 19, 2024

gordsport added this to the 🏛️ Colosseo milestone Feb 19, 2024

gordsport removed their assignment Feb 22, 2024

gordsport modified the milestones: 🏛️ Colosseo, ⏭️ TBC Feb 23, 2024

gordsport moved this to Todo in Core Kanban Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Data-node is getting SQL error when replaying from block 0 #10701

[Bug]: Data-node is getting SQL error when replaying from block 0 #10701

daniel1302 commented Feb 19, 2024

daniel1302 commented Feb 23, 2024

[Bug]: Data-node is getting SQL error when replaying from block 0 #10701

[Bug]: Data-node is getting SQL error when replaying from block 0 #10701

Comments

daniel1302 commented Feb 19, 2024

Problem encountered

Observed behaviour

Expected behaviour

Steps to reproduce

Software version

Failing test

Jenkins run

Configuration used

Relevant log output

daniel1302 commented Feb 23, 2024