Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Data-node is getting SQL error when replaying from block 0 #10701

Open
daniel1302 opened this issue Feb 19, 2024 · 1 comment
Open

[Bug]: Data-node is getting SQL error when replaying from block 0 #10701

daniel1302 opened this issue Feb 19, 2024 · 1 comment
Labels
Milestone

Comments

@daniel1302
Copy link
Contributor

Problem encountered

I am replaying node from block 0 and It is happening every 300-500k blocks. Validators have the same issue multiple times

Observed behaviour

Error in data-node logs during replay

Expected behaviour

Vega should be able to replay without restarting every a few times.

Steps to reproduce

1. Replay node from block 0

Software version

v0.71.5

Failing test

No response

Jenkins run

No response

Configuration used

I have tested multiple configurations:

1. PostgreSQL on the same node
2. PostgreSQL on the separated node

Both nodes are pretty big:
1. 128GB RAM, 16 cores, 4TB NVME
2. PostgreSQL 64GB RAM, 8 cores, 2 TB SSD, Vega+Data-node 32 GB RAM, 6 cores 2 TB SSD.

Relevant log output

Feb 18 17:29:17 data-node visor[13892]: 2024-02-18T17:29:17.348Z        ERROR        datanode.start.runNode        start/node.go:175        Vega data node stopped with error        {"error": "failed to flush subscriber:flushing margin levels: flushing margin levels: failed to copy margin_levels entries into database:ERROR: could not open relation with OID 206443 (SQLSTATE XX000)"}
@daniel1302 daniel1302 added the bug label Feb 19, 2024
@gordsport gordsport added this to the 🏛️ Colosseo milestone Feb 19, 2024
@gordsport gordsport removed their assignment Feb 22, 2024
@daniel1302
Copy link
Contributor Author

I have also got the following error today:


Feb 22 18:26:35 data-node2 visor[3077636]: 2024-02-22T18:26:35.854Z        ERROR        datanode.start.runNode        start/node.go:175        Vega data node stopped with error        {"error": "failed to flush subscriber:failed to copy orders entries into database:ERROR: deadlock detected (SQLSTATE 40P01)"}
Feb 22 18:26:35 data-node2 visor[3077636]: vega data node stopped with error: failed to flush subscriber:failed to copy orders entries into database:ERROR: deadlock detected (SQLSTATE 40P01)

@gordsport gordsport moved this to Todo in Core Kanban Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Todo
Development

No branches or pull requests

2 participants