-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
indexer-alt: sum_obj_types pipeline #20054
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
3 Skipped Deployments
|
86d530b
to
e81dfcb
Compare
e81dfcb
to
d2b7457
Compare
crates/sui-indexer-alt/migrations/2024-10-27-150938_sum_obj_types/up.sql
Show resolved
Hide resolved
ON sum_obj_types (owner_kind, owner_id, object_id, object_version); | ||
|
||
CREATE INDEX IF NOT EXISTS sum_obj_types_pkg | ||
ON sum_obj_types (package, object_id, object_version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to include the object_version
in these indices, since this only tracks the live objects set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you're right, we can get rid of it -- I added it for ordering purposes, originally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I take it back -- we need the version in the index because we are querying for it. By including the version in the index we can keep this as an index-only scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about
CREATE INDEX IF NOT EXISTS sum_obj_types_pkg
ON sum_obj_types (package, object_id) INCLUDE (object_version)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only benefit of doing this is if the index is otherwise unique on (package, object_id)
, which we do not need, so I prefer to use the form that doesn't require the extra SQL features.
crates/sui-indexer-alt/migrations/2024-10-27-150938_sum_obj_types/up.sql
Show resolved
Hide resolved
d2b7457
to
ffb298b
Compare
## Description The `sum_obj_types` sequential pipeline is the `sui-indexer-alt` equivalent of `objects` and `objects_snapshot`. It is a `sum`-mary of the object owner and type information as of a particular checkpoint (its watermark). It differs from `objects` and `objects_snapshot` in the following ways: - It only contains owner and type information (no coin balance, content, digest, or dynamic field information), but has a more "complete" set of indices for this data. - The type information has been standardised into the same style as used by event indices. - It uses the painter's algorithm to avoid processing redundant object updates (process transactions in a checkpoint in reverse order). - Updates to the underlying table are atomic (all writes are sent out within a single transaction, along with a watermark update). - It avoids redundant writes across checkpoints as well (e.g. if an object is modified multiple times, only the last modification will be turned into a DB update). This PR does not include the logic to delay updates (necessary to replicate the behaviour of `objects_snapshot`). This will be added at a later date. ## Test plan Ran the indexer on the first 1M checkpoints in mainnet (produced 2481 live objects by the end): ``` sui$ cargo run -p sui-indexer-alt --release -- \ --database-url "postgres://postgres:postgrespw@localhost:5432/sui_indexer_alt" \ --remote-store-url https://checkpoints.mainnet.sui.io \ --last-checkpoint 1000000 ```
ffb298b
to
3e5c7bb
Compare
## Description Trying to make the purposes of various tuning parameters clearer by renaming them. ## Test plan :eyes: ## Stack - #20050 - #20051 - #20052 - #20053 - #20054 --- ## Release notes Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required. For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates. - [ ] Protocol: - [ ] Nodes (Validators and Full nodes): - [ ] Indexer: - [ ] JSON-RPC: - [ ] GraphQL: - [ ] CLI: - [ ] Rust SDK: - [ ] REST API:
.execute(conn) | ||
}); | ||
|
||
let deleted: usize = try_join_all(delete_chunks).await?.into_iter().sum(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here updates and deletions are handled separately and sequentically, which dilutes the tuning power of BATCH_SIZE and likely generate small DB commit especially for deletions, wdyt of splitting updates and deletions earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the suggestion (to split updates and deletions earlier) could you elaborate?
The reason why I ended up having to split them up here was because of access to the conn
ection -- Rust did not like me sharing it between two closures (the map
over update chunks, and the map
over deletion chunks), because it is a &mut
.
There are other ways around this, where we split the updates, chunk them up, and then recombine them into a single stream of updates. Then we can map over these updates in a single pass, using one closure that can take ownership of the conn
. We can do that if we notice this is a bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I saw the &mut conn
sharing issue too; and by splitting earlier, I meant having separate functions for commit_mutations
and commit_deletions
, so that we can batch the mutations and deletions "earlier" and avoid committing very small deletion chunks b/c deletion count is often smaller than mutation count https://metrics.sui.io/goto/OcJUs1WNg?orgId=1 It makes the trait more verbose indeed so it has pros and cons.
} | ||
} | ||
|
||
async fn commit(values: &Self::Batch, conn: &mut db::Connection<'_>) -> anyhow::Result<usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remind me why you chose a conn
instead of a pool
here, as naively try_join_all
will interleave on execution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The connection is tied to a transaction -- all these commits are done atomically, alongside the watermark update. This is required for consistency but something that the current indexer does not do.
Description
The
sum_obj_types
sequential pipeline is thesui-indexer-alt
equivalent ofobjects
andobjects_snapshot
. It is asum
-mary of the object owner and type information as of a particular checkpoint (its watermark).It differs from
objects
andobjects_snapshot
in the following ways:This PR does not include the logic to delay updates (necessary to replicate the behaviour of
objects_snapshot
). This will be added at a later date.Test plan
Ran the indexer on the first 1M checkpoints in mainnet (produced 2481 live objects by the end):
Stack
Release notes
Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.
For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.