Deprecate IPFS as default publisher #3862

wdbaruni · 2024-01-25T08:36:16Z

The Problem

Today, when users run bacalhau serve, an embedded and private ipfs node is also created within bacalhau process. The original intention to do this was for people to easily test bacalhau and to be self contained. Though it was not intended for production use or actual use outside of just trying out bacalhau. The recommended way was for people to run their own ipfs node outside of bacalhau, and connect to it using bacalhau serve --ipfs-connect <addr>

In addition to that, when users submit a job using bacalhau docker run without defining --publisher flag, the CLI will use ipfs as the default publisher and the job will fail if the compute node does not support ipfs.

There are many problems with our current approach, including:

Bacalhau compute nodes are acting as both compute and storage nodes. Even if we replace ipfs with something else, we should avoid coupling those together, and just let bacalhau integrate with different storage services as a client and not embedded. This enables bacalhau to work well with existing data as well, simplifies the operations for users, and gives operators the choice of deploying storage on the same machines as the compute nodes, or just next to them on the same rack or region.
ipfs is meant for public networks and we had to take extra steps to make it work as a private network. Still, I am confident it is not yet secure enough with many potential leaks, such as when an authorized node join the private network but the also public peers, it will make all the private data accessible to the public network (I think)
I am not sure what value ipfs is providing for our users and our use case. ipfs solves content discovery and routing, but we already know where the data is and where to get it from as we keep track of the compute node that handled the job and generated the results. If someone is already using ipfs or have other use cases to use ipfs, then great and we will continue to support an integration with ipfs as a client using --ipfs-connect, but no value to introduce ipfs to fresh new users as the default and recommended storage for bacalhau

The Proposal

Option 1: My first recommendation is to not provide any default publishers. If a user runs a job that prints result to stdout, then they can use bacalhau job logs to get the output (I understand logs need some improvements). If they want to publish results to a remote destination, then they can select the uri where to publish the results (e.g. s3://..., ipfs://), and the job will be routed to the compatible compute nodes. This means that bacalhau serve without any flags or configurations will support jobs that don't publish results to remote destinations, which is more than enough for testing out bacalhau. In the future, we discussed an improvement where the requester node can populate job defaults such that the requester can set a default publisher to be some s3 bucket, and now users don't need to specify the remote destination anymore in their job submission
Option 2: Implement a new local publisher that keeps results locally on the compute nodes, and make them accessible through a simple HTTP based API to send local results from compute nodes directly back to the client. This should enable bacalhau serve to work out of the box, and for bacalhau get to be able to retrieve results not published to stdout. The critical point that I am hoping to make clear is this should only be for testing purposes and to simplify trying out bacalhau, and shouldn't be used for actual workloads. Bacalhau is not a storage service, we don't do replication, storage monitoring, or moving data around when a node is shutting down. To avoid doing the same mistake with embedded ipfs becoming more than for testing purposes, we should:
1. Print a warning when a user calls bacalhau get when data is using local storage
2. The requester will NOT do any tunneling or act as a gateway to retrieve data from compute nodes and forward them to the client. Meaning the client must be on the same network as the compute node to be able to fetch the results, which should be acceptable for someone trying out bacalhau. This limitation not only avoids additional complexities to the requester node, but explicitly making it clear to not use this publisher for production workloads

I've discussed this with @aronchick before, and he wasn't inclined with the first option. I am putting both options here for completeness and to make sure the tradeoffs are clear.

Priority

Priority: 1 (High)
Quarter: 2024Q1
Justification:
- ipfs is causing a lot of pain and will be great to deprecate as soon as we can
- we are launching on marketplace soon, and we need to figure out what else we need to deploy on our nodes and what services users need to be aware of.
- we are not storing results in a very secure way

The text was updated successfully, but these errors were encountered:

rossjones · 2024-02-05T15:43:05Z

@wdbaruni for a user who may have signed up via their organisation's expanso-managed account, the publisher would most likely be defined organisation-wide. Previously when I've been through security audits, a lot of the concerns are that data might be exfiltrated from the cluster and in those situations it would potentially be useful for us to disallow users from specifying their own publisher, and rely solely on the organisation's configured publisher.

rossjones · 2024-02-20T10:40:02Z

IPFS is no longer the default publisher, the default being Noop but settable. Any chances we want to the local publisher we can put into other tickets.

rossjones · 2024-03-27T13:57:58Z

IPFS node still embedded, re-opening to make sure we remove it

rossjones · 2024-04-12T13:14:49Z

Have updated #3816 with recommendation for making this change over two releases.

frrist · 2024-05-09T15:10:08Z

I am going to close this issues as IPFS as a default publisher has been deprecated in favor of the local publisher. The remaining items of work can be completed outside of this issue that has grown into an epic.

wdbaruni transferred this issue from bacalhau-project/bacalhau Jan 25, 2024

wdbaruni added the th/production-readiness Get ready for production workloads label Jan 25, 2024

rossjones self-assigned this Feb 5, 2024

rossjones closed this as completed Feb 20, 2024

rossjones reopened this Mar 27, 2024

aronchick unassigned rossjones Apr 15, 2024

wdbaruni transferred this issue from another repository Apr 21, 2024

wdbaruni added the comp/publisher/ipfs Issues related to IPFS publisher label Apr 22, 2024

wdbaruni added this to Engineering Planning Apr 29, 2024

github-project-automation bot moved this to Inbox in Engineering Planning Apr 29, 2024

frrist added this to the v1.3.1 milestone Apr 29, 2024

frrist closed this as completed May 9, 2024

github-project-automation bot moved this from Inbox to Done in Engineering Planning May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate IPFS as default publisher #3862

Deprecate IPFS as default publisher #3862

wdbaruni commented Jan 25, 2024 •

edited by rossjones

Loading

Tasks

rossjones commented Feb 5, 2024

rossjones commented Feb 20, 2024 •

edited

Loading

rossjones commented Mar 27, 2024

rossjones commented Apr 12, 2024

frrist commented May 9, 2024

Deprecate IPFS as default publisher #3862

Deprecate IPFS as default publisher #3862

Comments

wdbaruni commented Jan 25, 2024 • edited by rossjones Loading

The Problem

The Proposal

Priority

Tasks

rossjones commented Feb 5, 2024

rossjones commented Feb 20, 2024 • edited Loading

rossjones commented Mar 27, 2024

rossjones commented Apr 12, 2024

frrist commented May 9, 2024

wdbaruni commented Jan 25, 2024 •

edited by rossjones

Loading

rossjones commented Feb 20, 2024 •

edited

Loading