Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexer-alt: sum_obj_types pipeline #20054

Merged
merged 5 commits into from
Oct 29, 2024
Merged

indexer-alt: sum_obj_types pipeline #20054

merged 5 commits into from
Oct 29, 2024

Conversation

amnn
Copy link
Contributor

@amnn amnn commented Oct 28, 2024

Description

The sum_obj_types sequential pipeline is the sui-indexer-alt equivalent of objects and objects_snapshot. It is a sum-mary of the object owner and type information as of a particular checkpoint (its watermark).

It differs from objects and objects_snapshot in the following ways:

  • It only contains owner and type information (no coin balance, content, digest, or dynamic field information), but has a more "complete" set of indices for this data.
  • The type information has been standardised into the same style as used by event indices.
  • It uses the painter's algorithm to avoid processing redundant object updates (process transactions in a checkpoint in reverse order).
  • Updates to the underlying table are atomic (all writes are sent out within a single transaction, along with a watermark update).
  • It avoids redundant writes across checkpoints as well (e.g. if an object is modified multiple times, only the last modification will be turned into a DB update).

This PR does not include the logic to delay updates (necessary to replicate the behaviour of objects_snapshot). This will be added at a later date.

Test plan

Ran the indexer on the first 1M checkpoints in mainnet (produced 2481 live objects by the end):

sui$ cargo run -p sui-indexer-alt --release --                                   \
  --database-url "postgres://postgres:postgrespw@localhost:5432/sui_indexer_alt" \
  --remote-store-url https://checkpoints.mainnet.sui.io                          \
  --last-checkpoint 1000000

Stack


Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

  • Protocol:
  • Nodes (Validators and Full nodes):
  • Indexer:
  • JSON-RPC:
  • GraphQL:
  • CLI:
  • Rust SDK:
  • REST API:

@amnn amnn self-assigned this Oct 28, 2024
Copy link

vercel bot commented Oct 28, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 29, 2024 6:00pm
3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Oct 29, 2024 6:00pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Oct 29, 2024 6:00pm
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview Oct 29, 2024 6:00pm

ON sum_obj_types (owner_kind, owner_id, object_id, object_version);

CREATE INDEX IF NOT EXISTS sum_obj_types_pkg
ON sum_obj_types (package, object_id, object_version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to include the object_version in these indices, since this only tracks the live objects set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you're right, we can get rid of it -- I added it for ordering purposes, originally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I take it back -- we need the version in the index because we are querying for it. By including the version in the index we can keep this as an index-only scan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

CREATE INDEX IF NOT EXISTS sum_obj_types_pkg
ON sum_obj_types (package, object_id) INCLUDE (object_version)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only benefit of doing this is if the index is otherwise unique on (package, object_id), which we do not need, so I prefer to use the form that doesn't require the extra SQL features.

crates/sui-indexer-alt/src/models/objects.rs Show resolved Hide resolved
@amnn amnn force-pushed the amnn/idx-sum-obj-types branch from d2b7457 to ffb298b Compare October 29, 2024 14:49
Base automatically changed from amnn/idx-seq to amnn/idx-proc October 29, 2024 14:52
Base automatically changed from amnn/idx-proc to main October 29, 2024 16:46
@amnn amnn requested a review from wlmyng October 29, 2024 17:01
amnn added 5 commits October 29, 2024 17:40
## Description

The `sum_obj_types` sequential pipeline is the `sui-indexer-alt`
equivalent of `objects` and `objects_snapshot`. It is a `sum`-mary of
the object owner and type information as of a particular checkpoint (its
watermark).

It differs from `objects` and `objects_snapshot` in the following ways:

- It only contains owner and type information (no coin balance, content,
  digest, or dynamic field information), but has a more "complete"
  set of indices for this data.
- The type information has been standardised into the same style as used
  by event indices.
- It uses the painter's algorithm to avoid processing redundant object
  updates (process transactions in a checkpoint in reverse order).
- Updates to the underlying table are atomic (all writes are sent out
  within a single transaction, along with a watermark update).
- It avoids redundant writes across checkpoints as well (e.g. if an
  object is modified multiple times, only the last modification will be
  turned into a DB update).

This PR does not include the logic to delay updates (necessary to
replicate the behaviour of `objects_snapshot`). This will be added at a
later date.

## Test plan

Ran the indexer on the first 1M checkpoints in mainnet (produced 2481
live objects by the end):

```
sui$ cargo run -p sui-indexer-alt --release --                                   \
  --database-url "postgres://postgres:postgrespw@localhost:5432/sui_indexer_alt" \
  --remote-store-url https://checkpoints.mainnet.sui.io                          \
  --last-checkpoint 1000000
```
@amnn amnn merged commit 27978f1 into main Oct 29, 2024
52 checks passed
@amnn amnn deleted the amnn/idx-sum-obj-types branch October 29, 2024 18:31
amnn added a commit that referenced this pull request Oct 29, 2024
## Description
Trying to make the purposes of various tuning parameters clearer by
renaming them.

## Test plan
:eyes:

## Stack

- #20050 
- #20051 
- #20052 
- #20053 
- #20054 

---

## Release notes

Check each box that your changes affect. If none of the boxes relate to
your changes, release notes aren't required.

For each box you select, include information after the relevant heading
that describes the impact of your changes that a user might notice and
any actions they must take to implement updates.

- [ ] Protocol: 
- [ ] Nodes (Validators and Full nodes): 
- [ ] Indexer: 
- [ ] JSON-RPC: 
- [ ] GraphQL: 
- [ ] CLI: 
- [ ] Rust SDK:
- [ ] REST API:
.execute(conn)
});

let deleted: usize = try_join_all(delete_chunks).await?.into_iter().sum();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here updates and deletions are handled separately and sequentically, which dilutes the tuning power of BATCH_SIZE and likely generate small DB commit especially for deletions, wdyt of splitting updates and deletions earlier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the suggestion (to split updates and deletions earlier) could you elaborate?

The reason why I ended up having to split them up here was because of access to the connection -- Rust did not like me sharing it between two closures (the map over update chunks, and the map over deletion chunks), because it is a &mut.

There are other ways around this, where we split the updates, chunk them up, and then recombine them into a single stream of updates. Then we can map over these updates in a single pass, using one closure that can take ownership of the conn. We can do that if we notice this is a bottleneck.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I saw the &mut conn sharing issue too; and by splitting earlier, I meant having separate functions for commit_mutations and commit_deletions, so that we can batch the mutations and deletions "earlier" and avoid committing very small deletion chunks b/c deletion count is often smaller than mutation count https://metrics.sui.io/goto/OcJUs1WNg?orgId=1 It makes the trait more verbose indeed so it has pros and cons.

}
}

async fn commit(values: &Self::Batch, conn: &mut db::Connection<'_>) -> anyhow::Result<usize> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remind me why you chose a conn instead of a pool here, as naively try_join_all will interleave on execution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection is tied to a transaction -- all these commits are done atomically, alongside the watermark update. This is required for consistency but something that the current indexer does not do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants