Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD-0186: Loaded Transaction Data Size Specification #186

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
152 changes: 152 additions & 0 deletions proposals/0186-transaction-data-size-specification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
---
simd: '0186'
title: Transaction Data Size Specification
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This title reads as if the proposal specifies the size of a serialized transaction rather than size of the account data loaded for processing a transaction. How about Transaction Account Data Size Specification?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went with Loaded Transaction Data Size Specification

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went with Loaded Transaction Data Size Specification

authors:
- Hanako Mumei
category: Standard
type: Core
status: Review
created: 2024-10-20
feature: (fill in with feature tracking issues once accepted)
---

## Summary

Before a transaction can be executed, every account it may read from or write to
must be loaded, including any programs it may call. The amount of data a
transaction is allowed to load is capped, and if it exceeds that limit, loading
is aborted. This functionality is already implemented in the validator. The
purpose of this SIMD is to explicitly define how loaded transaction data size is
calculated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last 2 sentences here make it sound like the SIMD just documents the existing behavior.
Let's change the last sentence so its' clear there is a proposal for a new algorithm for calculating loaded account size.


## Motivation

Transaction data size accounting is currently unspecified, and the
implementation-defined algorithm used in the Agave client exhibits some
surprising behaviors:

* BPF loaders required by top-level programs are counted against transaction
data size. BPF loaders required by CPI programs are not. If a required BPF
loader is also included in the accounts list, it is counted twice.
* The size of a program owned by LoaderV3 may or may not include the size of its
programdata depending on how the program account is used on the transaction.
Programdata is also itself counted if included in the transaction accounts list.
This means programdata may be counted zero, one, or two times per transaction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This means programdata may be counted zero, one, or two times per transaction.
This means programdata may be counted zero, one, or two times per transaction.

🫠

* Due to certain quirks of implementation, accounts owned by loaders which do
not contain valid programs for execution may or may not be counted against the
transaction data size total depending on how it is used on the transaction. This
includes, but is not limited to, LoaderV3 buffer accounts.

All validator clients must arrive at precisely the same transaction data size
for all transactions because a difference of one byte can determine whether a
transaction is executed or failed, and thus affects consensus. Also, we want the
calculated transaction data size to correspond well with the actual amount of
data the transaction requests.

Therefore, this SIMD seeks to specify an algorithm that is straightforward to
implement in a client-agnostic way, while also accurately accounting for the
total data required by the transaction.

## New Terminology

One term is defined within the scope of this SIMD:

* Valid program: a program that has been loaded, or a builtin. This definition
excludes programs that have failed verification, or LoaderV3 programs that have
been closed or have delayed visibility due to being deployed or modified in the
current slot.

These terms are not new, however we define them for clarity:

* Top-level program: the program corresponding to the program id on a given
instruction.
* Instruction account: an account passed to an instruction, which allows its
program to view the actual bytes of the account. CPI can only happen through
programs provided as instruction accounts.
* Transaction accounts list: all accounts for the transaction, which includes
top-level programs, the fee-payer, all instruction accounts, and any extra
accounts added to the list but not used for any purpose.

## Detailed Design

The proposed algorithm is as follows:

1. Every account explicitly included on the transaction accounts list is counted
once and only once.
2. A valid program owned by LoaderV3 also includes the size of its programdata.
3. Other than point 2, no accounts are implicitly added to the total data size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edge-case for consideration.

Does it matter if the account is used as a top-level or instruction account? Realistically we should not see these if people construct txs reasonably, but if an account is:

  1. not the fee-payer
  2. not passed to any instructions
  3. not used a program

we could reasonably choose to not load it at all?

With the proposed algorithm we would still need to load it in order to find the size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as-written it wouldnt matter, all accounts are treated the same

but thats an interesting idea, i wasnt thinking about account loading with this simd. i was only focused on how to arrive at a correct total given a set of accounts. but youre right, i cant think of a reason why we should load these kinds of accounts at all, other than the fact that we already do it now

if we do that (im in favor), there are several formulas that seem plausible to me

Option 1a: sum the sizes of the unique set of accounts instruction accounts + program ids + fee payer. any account in that set which is a LoaderV3 program account also includes its programdata

Option 1b: the same as above but programdata is only added to program size if the program is valid for execution in the current slot

Option 2: sum the sizes of the unique set of accounts instruction accounts + fee payer. add it to the sum of the sizes of the unique set of program ids, which include programdata sizes for LoaderV3 program ids

i think Option 1 is preferable. Option 2 fails to count cpi programs, but it sidesteps any issues with tombstoned programs because (after fee-only transactions... i say that a lot lately) the sizes dont matter, transactions that use them as program ids would fail anyway

Option 1a and 1b behave the same for closed programs since there is no programdata. they differ on handling programs which fail validation and programs that are changed in-slot. this is especially germaine to simd83 since transactions will be able to mess with programs in the same batch

all the options can double-count LoaderV3-owned accounts but i think this is appropriate, the compiled program and the account data if accessed in an instruction are basically their own separate things

im undecided between 1a and 1b. do you have an opinion?

Copy link
Contributor

@apfitzge apfitzge Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Difference of a and b is just with valid programs v3 program's programdata?

A bit of a detail, but can we determine if it's valid or not without loading the programdata? In the general case** assuming not in cache already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, you need programdata because once the program account is deployed it never changes

1a is probably safer than 1b on reflection, i was thinking 1b is easy because we can just check program cache tombstones (and firedancer has to have something similar to conform) but the FailedVerification case means both our elf verifiers would need to produce identical results to maintain consensus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah 1a seems better option to me in that case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on 1a. And I think I agree that it's best to double count programdata accounts if they are included in the ix account list since the SVM would effectively keep 2 copies in memory anyways.

Transactions may include a
`ComputeBudgetInstruction::SetLoadedAccountsDataSizeLimit` instruction to define
a data size limit for the transaction. Otherwise, the default limit is 64MiB
(`64 * 1024 * 1024` bytes).

If a transaction exceeds its data size limit, account loading is aborted and the
transaction is failed. Fees will be charged once
`enable_transaction_loading_failure_fees` is enabled.

Adding required loaders to transaction data size is abolished. They are treated
the same as any other account: counted if on the transaction accounts list, not
counted otherwise.

Read-only and writable accounts are treated the same. In the future, when direct
mapping is enabled, this SIMD may be amended to count them differently.

As a consequence of 1 and 2, for LoaderV3 programs, programdata is counted twice
if a transaction includes both programdata and the program account itself in the
accounts list, unless the program is not valid for execution. This is partly
done for ease of implementation: we always want to count programdata when the
program is included, and there is no reason for any transaction to include both
accounts except during initial deployment, in which case the program is not yet
valid.

We include programdata size in program size for LoaderV3 programs because in
nearly all cases a transaction will include the program account (the only way to
invoke the program) and will not include the programdata account because
including it serves no purpose. Including the program account forces an
unconditional load of the programdata account because it is required to compile
the program for execution. Therefore we always count it, even when the program
is an instruction account.

There is no special handling for programs owned by the native loader, LoaderV1,
or LoaderV2.

There is no special handling for non-executable accounts owned by the loaders.
These are to be included in the transaction data size total under all
circumstances.

Account size for programs owned by LoaderV4 is left undefined. This SIMD should
be amended to define the required semantics before LoaderV4 is enabled on any
network.

## Alternatives Considered

* Transaction data size accounting is already enabled, so the null option is to
enshrine the current Agave behavior in the protocol. This is undesirable because
the current behavior is highly idiosyncratic, and LoaderV3 program sizes are
routinely undercounted.
* Builtin programs are backed by accounts that only contain the program name as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also in the works to make (some/most) builtins just normal programs, so adding any special cases here would probably be nullified in the future anyway

a string, typically making them 15-40 bytes. We could make them free when not
instruction accounts, since they're part of the validator. However this
adds complexity for no real benefit.

## Impact

The primary impact is this SIMD makes correctly implementing transaction data
size accounting much easier for other validator clients.

It makes transactions which include program accounts for CPI somewhat larger,
but given the generous 64MiB limit, it is unlikely that any existing users will
be affected.

## Security Considerations

Security impact is minimal because this SIMD merely simplifies an existing
feature.

This SIMD requires a feature gate.

## Backwards Compatibility

Transactions that currently have a total transaction data size close to the
64MiB limit, which call LoaderV3 programs via CPI, may now exceed it and fail.
Loading