Verifiable Compute for Akash Network #614

sriramvish · 2024-06-27T17:55:46Z

sriramvish
Jun 27, 2024

Introduction

Verifiable computing is an entire class of algorithms or systems, where a particular portion of the compute stack is verifiable/provable in a trustless manner to participants within a decentralized network. Verifiable computing can take many forms, including:
Verifiable provisioning of hardware: This corresponds to the case where we desire to verify the nature and extent to which a piece of hardware is provisioned for the Akash network.

Specifically, if a 4090 GPU were to be incorporated in the Akash network, verifiable provisioning ensures that it indeed matches its hardware specifications, and it is genuinely allocated for functions on the Akash network.

Verifiable execution of program/software: This corresponds to the case where a program (any AI program, ranging from inference to training) is correctly executed on a node/set of nodes in the Akash network. For example, that a particular piece of code was executed correctly in a cluster of 4090s on the Akash network. Verifiable execution of programs/software also comes in multiple flavors, including:
Non-real-time: An offline verification mechanism that presents a proof in non-real-time, where the proof has no time or size constraints.
Optimistic, real-time proofs: An optimistic proof mechanism that can be verified or contested in (near) real time.
Zero knowledge, real-time proofs: A zero knowledge proof mechanism (that does not reveal anything about the inputs but can still be verified, in (near) real time.

In this proposal, for the first year of this project, we focus on only the first type of verifiability: That of provisioning of hardware. After the completion of this first portion of the project, a further proposal will be submitted on non-real-time and subsequently, real-time verifiable computing within the Akash network.

Benefits to Akash Network

The need for verifiable provisioning of hardware is significant for a variety of reasons, including the elimination/reduction of Sybil attacks, and of other forms of misrepresentation and abuse in the network.

Verifiable Hardware Provisioning

Verifiable hardware provisioning can be achieved in a variety of ways: by using schemes uniquely associated with particular types of hardware, by using access patterns and footprints associated with a particular make and model, and other ways. However, these schemes are dependent on hardware configurations and do not necessarily generalize well. In order to develop a scalable, universal solution, we take a trusted enclave (trusted execution environment) approach as follows:
Akash providers that intend to be “hardware verifiable” are equipped with a TEE, configured by Akash (such as Trusty [1], for more information on TEE, see tutorial [2]). Such a TEE contains a physically unclonable function (a PUF, see [3]) that can securely sign transactions. To ensure uniformity, this TEE will be designed to be a USB A/C dongle that can be attached to any hardware configuration.

We will verify that the USB A/C dongle can be attached to any hardware configuration and provide a detailed set of instructions to install and use this dongle to enable each provider to become “hardware verifiable” on Akash.

This TEE will periodically perform the following two tasks, based on an internal pseudo-random timer:

Identification task:

Following a pseudo-random clock, the TEE will query every GPU in the specific Akash provider on its status and device-level details.

Provisioning task:

Periodically and randomly, a random machine learning task will be assigned to the GPUs within this provider. These provisioning tasks are based on existing, well known benchmarks on the performance of GPUs to certain deep learning tasks, including particular types of models [4], more general deep learning models [5] and other tasks that are well known benchmarks on existing GPUs [6].

After the conclusion of each type of pseudorandomly repeated task, the TEE will securely sign the message, and will share the secure message with the Akash network.

The tasks are used to ensure the following properties:

Identification task:

The identification task sets up the base configuration for each GPU cluster, and assigns a unique signature associated with the TEE with that cluster. As the identification is performed at the operating system level, it can potentially be spoofed, and therefore, the provisioning/benchmarking tasks are required.

Provisioning task:

The provisioning/benchmarking tasks verifies the identification while simultaneously ensuring that the associated GPUs are dedicated for the Akash network and are not prioritizing other tasks. In case they are not provisioned for Akash network, they will fail the provisioning task.

A key point is that both the entire system (user, operating system) cannot differentiate between a provisioning/benchmarking tasks and a regular AI workload provided by the Akash network, and therefore cannot selectively serve a particular type of workload/task. This ensures that the GPUs are both correctly identified and are made available to Akash network-centric tasks at all times.

Team

The team for this project is led by Prof. Sriram Vishwanath from The University of Texas, Austin. Sriram Vishwanath is a professor at The University of Texas, Austin and Shruti Raghavan is a PhD candidate in Computer Science at UT Austin. They are working together with the Harvard Medical School and MITRE on the design of new foundation/base models in healthcare, with causal learning incorporated into such a platform.

Sriram Vishwanath received the B. Tech. degree in Electrical Engineering from the Indian Institute of Technology (IIT), Madras, India in 1998, the M.S. degree in Electrical Engineering from California Institute of Technology (Caltech, Pasadena USA in 1999, and the Ph.D. degree in Electrical Engineering from Stanford University, Stanford, CA USA in 2003. Currently, he is Professor in the Chandra Department of Electrical and Computer Engineering at The University of Texas at Austin, and recently, a Technical Fellow for Distributed Systems and Machine Learning at MITRE Labs.

Timeline

The timeline for this project is as follows:
Open Discussions: Starting end of June 2024
Governance Proposal: Through first half of July, 2024
Design Phase: Through Q3 and Q4 2024
Hacknet TEE Phase: Q1 2025
Devent TEE Phase: Q2 2025
Conclusion of Hardware Provisioning testing and handover to Akash Team: End of Q2 2025
Note: This is subject to change based on feedback

Deliverables

Q3 2024 - High Level Design
Q4 2024 - Design Specification
Q1 2025 - Initial Hacknet Prototype
Q2 2025 - Devnet and Conclusion of Testing

Budget

The tentative budget for this project is presented in the spreadsheet attached here: https://docs.google.com/spreadsheets/d/1asmvyi5r7QgKRjsImZInAENXptr_cwoW/edit?usp=sharing&ouid=103645797398143147236&rtpof=true&sd=true).

The high-level breakdown for the budget is:
R&D Costs (Student salaries + tuition + University Overhead): 146,547
Akash Computing/Hardware Costs: 75,000

Disbursement:

Disbursement will happen in two increments, coinciding with the few weeks before the beginning of each semester - Fall 2024 (on July 22nd 2024) and Spring 2025 (December 15 2024).

FAQ

Among the different families of GPUs on Akash Network, which sets will be prioritized in this project?

Response: As H100s come with integrated TEEs, we will focus this project on older generations of GPUs, with a particular emphasis on RTX series of GPUs.

This FAQ section will be populated as soon as we have questions arising from discussions.

References

[1] Trusty TEE: Android Open Source Project https://source.android.com/docs/security/features/trusty
[2] TEE 101 White Paper https://www.securetechalliance.org/wp-content/uploads/TEE-101-White-Paper-FINAL2-April-2018.pdf
[3] Shamsoshoara, Alireza, et al. "A survey on physical unclonable function (PUF)-based security solutions for Internet of Things." Computer Networks 183 (2020): 107593.
[4] Wang, Yu Emma, Gu-Yeon Wei, and David Brooks. "Benchmarking TPU, GPU, and CPU platforms for deep learning." arXiv preprint arXiv:1907.10701 (2019).
[5] Shi, Shaohuai, et al. "Benchmarking state-of-the-art deep learning software tools." 2016 7th International Conference on Cloud Computing and Big Data (CCBD). IEEE, 2016.
[6] Araujo, Gabriell, et al. "NAS Parallel Benchmarks with CUDA and beyond." Software: Practice and Experience 53.1 (2023): 53-80.

brewsterdrinkwater · 2024-07-02T22:13:21Z

brewsterdrinkwater
Jul 2, 2024
Maintainer

Sriram talked about discussion 614 during the Akash Steering Committee meeting on June 27th, 2024.

The video can be found here, starting at the 24 minute mark: June 27th, 2024 Akash Network Steering Committee Monthly Meeting Video Recording

You can catch up on all past Akash Steering Committee meetings here.

0 replies

anilmurty · 2024-07-03T15:59:53Z

anilmurty
Jul 3, 2024
Maintainer

Thank you for writing this incredibly important proposal that adds a great amount of value to Akash Network by solving one of the harder challenges in growing the only fully permissionless network as well as enabling onchain incentives.

As @brewsterdrinkwater mentioned above - this was presented and discussed at the last Steering Comittee meeting with support from everyone there. As such, the core team (at OCL) will be working with @sriramvish to get this proposal Onchain for a formal vote, in the next day or two

0 replies

31trainman · 2024-07-04T15:05:01Z

31trainman
Jul 4, 2024

I'm working on something similar with Naoris protocol. I'd be very interested in talking with you on discord. My discord is 31trainman.

0 replies

instafinanzas · 2024-07-06T15:04:04Z

instafinanzas
Jul 6, 2024

Thanks for stepping in and sharing, I still need to catch up with the last steering committee call recording, however, this text proposal looks solid and the benefits this would bring to Akash network are clear: reducing abuse within the Akash network through verifiable provisioning of hardware

0 replies

88plug · 2024-07-08T17:20:12Z

88plug
Jul 8, 2024

I do not think this is the proper way to approach verifiable compute for the Akash network. There is a much simpler and economical solution which also has already been tested and produced very good results. There is an existing benchmarking API which simply needs funding for 24/7/365 operation that can easily check all providers on the network. Additionally I do not think there is any value in a 3rd party USB key that plugs into anything. If you run a provider and learn how Kubernetes works, this is not a reasonable solution at scale. Your plan also requires paying your tuition and hiring graduate students to "research" solutions, I do not think community funds should be used for this and instead should be used for builders who have working solutions (functional POC and beyond) and deep familiarity with the Akash network. Your budget does not make sense and is unreasonable.

$96,216 in Tuition and "Research Assistants"

Other - Tuition | Use "Other" General Cost in SmartForm. |   |   |   | 29,016

Graduate Research Assistant | Grad Student | 33,600
-- | -- | --
Graduate Research Assistant | Grad Student | 33,600

$52,907 "indirect costs" with no explanation.

Total Indirect Costs |   |   | 52,907
-- | -- | -- | --

$75k for Akash hardware - no explanation or rational behind this amount in your proposal.

Akash Computing/Hardware Costs |   | 75,000
-- | -- | --

Finally, please feel free to reach out to me to work on the Benchmarking API solutions that have already been developed.

I voted No and I would recommend others vote No as well.

3 replies

sriramvish Jul 8, 2024
Author

Dear 88plug, Thank you for the feedback. On scalability of the solution, as it does the benchmarking and identification offchain, it will work scalably and does not need a central coordinator. The USB itself is inexpensive (is effectively equivalent to an e-SIM). As it pseudorandomly generates the workload, it is able to make sure the provisioning is also correct. Happy to answer any further questions you might have as well.

KamuelBob Jul 9, 2024

We have also voted No in the absence of sufficient responses to 88plug's comments.

88plug Jul 9, 2024

Dear 88plug, Thank you for the feedback. On scalability of the solution, as it does the benchmarking and identification offchain, it will work scalably and does not need a central coordinator. The USB itself is inexpensive (is effectively equivalent to an e-SIM). As it pseudorandomly generates the workload, it is able to make sure the provisioning is also correct. Happy to answer any further questions you might have as well.

Dear sriramvish,

Thank you for your input, but I must strongly disagree with the current proposal for several reasons:

Scalability and Practicality:
- The proposal's reliance on USB dongles and TEEs is impractical at scale. These physical devices add unnecessary complexity and cost, particularly when existing software-based solutions are available and have proven effective.
Existing Solutions:
- There is already a functional and efficient benchmarking API that can continuously monitor and verify hardware on the Akash network. This API does not require the additional overhead of physical devices or the extensive R&D proposed. Investing in this existing solution would be far more economical and efficient.
Budget Concerns:
- The proposed budget is excessively high, particularly the $96,216 allocated for tuition and research assistants. Community funds should prioritize actionable solutions over academic research costs.
- The $52,907 in "indirect costs" lacks clear justification. Transparent and detailed budget allocations are essential for community trust.
- The $75,000 for Akash hardware also lacks detailed explanation. Why is this amount necessary, and what specific hardware will it be used for?
- Overall, the budget lacks sufficient detail to understand precisely where and how the funds will be allocated. Without clear, itemized expenses, it's challenging to ensure that the funds will be used effectively and responsibly.
Community Focus:
- Community funds should be directed towards builders with proven solutions and deep familiarity with the Akash network. This proposal seems to focus more on academic research rather than practical, immediate solutions for the network.

Given these points, I urge others to consider voting NO on this proposal and to support more practical, cost-effective solutions like the existing benchmarking API. Let’s focus our resources on solutions that are already working and can be scaled efficiently without unnecessary overhead.

Thank you for considering these points.

Best regards,
88plug

Zblocker64 · 2024-07-10T17:42:03Z

Zblocker64
Jul 10, 2024

I will be voting no. With no real answer to some questions on the discussion, it looks like it needs more discussion in my opinion. It’s not that I’m against someone developing a verifiable compute add on, I just don’t think the USB stick makes sense and without reply’s to discussion questions makes me want to know more before I can vote yes.

0 replies

lechenghiskhan · 2024-07-12T16:49:37Z

lechenghiskhan
Jul 12, 2024
Collaborator

Thank you all for contributing your thoughts @Zblocker64 @88plug @KamuelBob. Your feedback and participation is what helps make our community so robust.

I would like to point out some of the practical points that make this effort extremely beneficial to Akash Network:

Using a third-party USB device and the concern around the scalability of a hardware device. This might seem foreign, but we have already adapted this into our lives as crypto natives. I'll point to the Ledger, Trezor, and other hardware wallets we use to secure our funds. These third-party hardware devices initially seemed foreign but ultimately became the gold standard across crypto. I posit that perhaps a simple, secure USB device, purpose-built for verifying compute for DePIN marketplaces can fulfill its intended purpose? Either way, Akash Network has the opportunity to be the first to pioneer a solution to the verifiable compute challenge.
No matter how verification is done, it is essential for one particular purpose for Akash Network Provider Incentives: unlocking Liquidity Pool-styled incentives, which will scale Akash's provider lineup dynamically and seamlessly alongside the Committed Pool's programs. Without proper, robust verification for the programmatic distribution of incentives, spoofing and Sybil attacks become an untenable problem. This has already played out in multiple other DePIN clouds outside of Akash. This effort, if successful, can help dramatically scale the number of quality providers on Akash programmatically vs slower, manually administered systems.
Akash Network stands to benefit from the reputational gain from working with well-respected researchers and Universities, which also paves the way for an entirely new niche of adoption for Akash Network. Powering and funding AI projects with major Universities opens up and formalizes a real cohort of users who are almost exclusively locked into traditional clouds. Students of today create companies of tomorrow and giving the universities and students the rails to build on Akash paves the way towards the pervasive mainstream adoption of Akash.

Improvements for future props and discussions

I would encourage a more detailed breakdown of future budgets.
Talk more thoroughly about the practical tradeoffs regarding the technical and practical choices the proposal's approach makes

Improvements to community feedback

@88plug you mention there's already something out there that accomplishes what this proposal aims to achieve. Can you show this off in a demon during a SIG or WG? Can you help us understand how this can be adopted and if it's feasible?

Good feedback should present tangible and reasonable solutions.

8 replies

31trainman · 2024-07-17T16:57:10Z

31trainman
Jul 17, 2024

Yo @sriramvish do you have a discord or anything? I'm working on integrating Naoris security protocol into akash network and your idea fits into my security integration.

0 replies

brewsterdrinkwater · 2024-09-16T15:19:25Z

brewsterdrinkwater
Sep 16, 2024
Maintainer

This proposal was put on chain and passed: https://www.mintscan.io/akash/proposals/261

Work will be tracked here: https://github.com/orgs/akash-network/projects/5?pane=issue&itemId=71129502

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Akash Network

Verifiable Compute for Akash Network #614

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Verifiable Compute for Akash Network #614

Introduction

Benefits to Akash Network

Verifiable Hardware Provisioning

Identification task:

Provisioning task:

Identification task:

Provisioning task:

Team

Timeline

Deliverables

Budget

Disbursement:

FAQ

References

Replies: 9 comments · 11 replies

brewsterdrinkwater Jul 2, 2024 Maintainer

anilmurty Jul 3, 2024 Maintainer

sriramvish Jul 8, 2024 Author

lechenghiskhan Jul 12, 2024 Collaborator

I would like to point out some of the practical points that make this effort extremely beneficial to Akash Network:

Improvements for future props and discussions

Improvements to community feedback

brewsterdrinkwater Sep 16, 2024 Maintainer

Replies: 9 comments 11 replies

brewsterdrinkwater
Jul 2, 2024
Maintainer

anilmurty
Jul 3, 2024
Maintainer

sriramvish Jul 8, 2024
Author

lechenghiskhan
Jul 12, 2024
Collaborator

brewsterdrinkwater
Sep 16, 2024
Maintainer