Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] snarkVM fails to find transaction ID when executing programs with complex call graph #2570

Open
HarukaMa opened this issue Nov 9, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@HarukaMa
Copy link
Contributor

HarukaMa commented Nov 9, 2024

🐛 Bug Report

When trying to figure out the transition ordering rule, I tried to write some programs with a fairly complex call graph, but the test execution on my local network was rejected by snarkVM with reason Child transition ID not found.

https://github.com/AleoNet/snarkVM/blob/bdfc0c43975df5832c44e3184f3b4a245880a1aa/synthesizer/process/src/finalize.rs#L266-L275

Steps to Reproduce

The program used:

outer.aleo
import inner.aleo;
import nofin.aleo;
import mid.aleo;

program outer.aleo;

function call_mid:
    input r0 as field.public;
    add r0 1field into r1;
    call nofin.aleo/do r1 into r2;
    call mid.aleo/save_mid_rand r0 into r3 r4;
    call nofin.aleo/do r2 into r5;
    call mid.aleo/save_mid_rand r3 into r6 r7;
    call nofin.aleo/do r5 into r8;
    async call_mid r4 r7 into r9;
    output r8 as field.public;
    output r9 as outer.aleo/call_mid.future;

finalize call_mid:
    input r0 as mid.aleo/save_mid_rand.future;
    input r1 as mid.aleo/save_mid_rand.future;
    await r0;
    await r1;

mid.aleo
import inner.aleo;
import nofin.aleo;
program mid.aleo;

mapping rand_store:
    key as u8.public;
    value as u128.public;

function save_mid_rand:
    input r0 as field.public;
    call nofin.aleo/do r0 into r1;
    call inner.aleo/save_inner_rand r1 into r2;
    call nofin.aleo/do r1 into r3;
    async save_mid_rand r2 into r4;
    output r3 as field.public;
    output r4 as mid.aleo/save_mid_rand.future;

finalize save_mid_rand:
    input r0 as inner.aleo/save_inner_rand.future;
    await r0;
    rand.chacha into r1 as u128;
    set r1 into rand_store[0u8];
inner.aleo
// The 'inner.aleo' program.
program inner.aleo;

mapping rand_store:
    key as u8.public;
    value as u128.public;

function save_inner_rand:
    input r0 as field.public;
    async save_inner_rand r0 into r1;
    output r1 as inner.aleo/save_inner_rand.future;

finalize save_inner_rand:
    input r0 as field.public;
    rand.chacha r0 into r1 as u128;
    set r1 into rand_store[0u8];
nofin.aleo
program nofin.aleo;


function do:
    input r0 as field.public;
    mul r0 100field into r1;
    output r1 as field.public;

Call graph:

outer
|-nofin
|-mid
| |-nofin
| |-inner
| \-nofin
|-nofin
|-mid
| |-nofin
| |-inner
| \-nofin
\-nofin

outer.aleo/call_mid calls will be rejected by snarkVM.

Additional test case: switch the order of the two await commands in outer.aleo.

Expected Behavior

The program should execute without being rejected.

Your Environment

  • snarkVM canary-v1.1.1
  • Rust 1.81
  • Debian
@HarukaMa HarukaMa added the bug Something isn't working label Nov 9, 2024
@d0cd
Copy link
Collaborator

d0cd commented Nov 12, 2024

Thank you for filing this issue @HarukaMa!

Fundamentally, while the call graph is constructed correctly, it is used in an unsound way to find the transition ID that corresponds to Future that is passed in as an operand to an await command (source). The child_transition_id linked above is used to initialize the FinalizeRegisters. The transition_id in FinalizeRegisters is only used to initialize an RNG in the rand instructions. The purpose is the ensure that on-chain logic (in finalize scopes) have a unique seed each time they are invoked.

The current implementation has the following issues:

  1. In the case where an async function calls a mix of async and non-async functions, a non-existent transition is used in the call graph. This result in the transaction getting rejected.
  2. In the case where only async calls are made, but are awaited in a different order than the call graph, an incorrect transition ID is used for the FinalizeRegisters.
    As a result, the impacted programs are those that use a mix of async/non-async external calls and those that use the rand opcode.

There are a number of possible ways this could be addressed:

  1. Update the implementation to get the correct transition ID. This would require more information to be passed in during finalization.
  2. Add a transition ID to the Future. This would be a breaking change in the data format, but conceptually sound.
  3. Fix the use of the transition ID to not require the precise sub-transition, while maintaining the invariant that RNG seeds are unique. This solution is prototyped here.

Some points that the solution should address are:

  • how is backwards compatibility ensured?
  • how should existing programs migrate?

@vicsn
Copy link
Collaborator

vicsn commented Nov 12, 2024

Amazing find and analysis so far! Starting to wrap my head around this and have some questions, hope it helps to nail down the right solution:

how should existing programs migrate?

Can you explain what you mean with this, with an example? Are you referring to potential programs which are deployed which encounter this specific call graph issue?

Fix the use of the transition ID to not require the precise sub-transition, while maintaining the invariant that RNG seeds are unique. This solution is prototyped here.

In your prototype, is the returned child_transition_id actually a child? If not, should we consider renaming?

Would using the existing call_counter variable instead of the new nonce in the preimage suit the same goal?

Also, is this comment in the wrong conditional?

@HarukaMa
Copy link
Contributor Author

I think it should be mostly fine for existing programs, as:

  1. While programs with rand commands might have generated their random data with wrong seeds before, they are still random data after all, and functionally it should still be the same after we update the underlying implementation;
  2. Programs whose execution was previously rejected will be "fixed" by the update, making it work as intended instead of breaking something else.

@d0cd
Copy link
Collaborator

d0cd commented Nov 12, 2024

Can you explain what you mean with this, with an example? Are you referring to potential programs which are deployed which encounter this specific call graph issue?

At a high level, we'd want to determine which programs have been impacted and offer users a path for migration if relevant.
I don't suspect that many programs will need to, but it would be the right thing to do.

In your prototype, is the returned child_transition_id actually a child? If not, should we consider renaming?

They are not and we could definitely rename for more clarity.

Would using the existing call_counter variable instead of the new nonce in the preimage suit the same goal?

The existing call_counter would not work, as it is not unique w.r.t to the async execution of the top-level future.
For example, suppose you have the following call graph:

A  
|-B 
|-B

All instances of A and B will have the same call_counter (0) when they are initialized, and thus the RNG seed would also be the same. The call_counter is only incremented when an async scope has an await and it is only incremented locally. We could modify the call_counter so that it functions as a unique counter, but I feel that it would be less clear.

Also, is this comment in the wrong conditional?

Good catch, will clean that up.

@d0cd
Copy link
Collaborator

d0cd commented Jan 17, 2025

@HarukaMa would love to get your input on #2596 if you've got the time! The tests in the PR are inspired by your design in the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants