Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate IPFS as default publisher #3862

Closed
11 of 12 tasks
wdbaruni opened this issue Jan 25, 2024 · 5 comments
Closed
11 of 12 tasks

Deprecate IPFS as default publisher #3862

wdbaruni opened this issue Jan 25, 2024 · 5 comments
Labels
comp/publisher/ipfs Issues related to IPFS publisher th/production-readiness Get ready for production workloads
Milestone

Comments

@wdbaruni
Copy link
Member

wdbaruni commented Jan 25, 2024

The Problem

Today, when users run bacalhau serve, an embedded and private ipfs node is also created within bacalhau process. The original intention to do this was for people to easily test bacalhau and to be self contained. Though it was not intended for production use or actual use outside of just trying out bacalhau. The recommended way was for people to run their own ipfs node outside of bacalhau, and connect to it using bacalhau serve --ipfs-connect <addr>

In addition to that, when users submit a job using bacalhau docker run without defining --publisher flag, the CLI will use ipfs as the default publisher and the job will fail if the compute node does not support ipfs.

There are many problems with our current approach, including:

  1. Bacalhau compute nodes are acting as both compute and storage nodes. Even if we replace ipfs with something else, we should avoid coupling those together, and just let bacalhau integrate with different storage services as a client and not embedded. This enables bacalhau to work well with existing data as well, simplifies the operations for users, and gives operators the choice of deploying storage on the same machines as the compute nodes, or just next to them on the same rack or region.
  2. ipfs is meant for public networks and we had to take extra steps to make it work as a private network. Still, I am confident it is not yet secure enough with many potential leaks, such as when an authorized node join the private network but the also public peers, it will make all the private data accessible to the public network (I think)
  3. I am not sure what value ipfs is providing for our users and our use case. ipfs solves content discovery and routing, but we already know where the data is and where to get it from as we keep track of the compute node that handled the job and generated the results. If someone is already using ipfs or have other use cases to use ipfs, then great and we will continue to support an integration with ipfs as a client using --ipfs-connect, but no value to introduce ipfs to fresh new users as the default and recommended storage for bacalhau

The Proposal

  1. Option 1: My first recommendation is to not provide any default publishers. If a user runs a job that prints result to stdout, then they can use bacalhau job logs to get the output (I understand logs need some improvements). If they want to publish results to a remote destination, then they can select the uri where to publish the results (e.g. s3://..., ipfs://), and the job will be routed to the compatible compute nodes. This means that bacalhau serve without any flags or configurations will support jobs that don't publish results to remote destinations, which is more than enough for testing out bacalhau. In the future, we discussed an improvement where the requester node can populate job defaults such that the requester can set a default publisher to be some s3 bucket, and now users don't need to specify the remote destination anymore in their job submission
  2. Option 2: Implement a new local publisher that keeps results locally on the compute nodes, and make them accessible through a simple HTTP based API to send local results from compute nodes directly back to the client. This should enable bacalhau serve to work out of the box, and for bacalhau get to be able to retrieve results not published to stdout. The critical point that I am hoping to make clear is this should only be for testing purposes and to simplify trying out bacalhau, and shouldn't be used for actual workloads. Bacalhau is not a storage service, we don't do replication, storage monitoring, or moving data around when a node is shutting down. To avoid doing the same mistake with embedded ipfs becoming more than for testing purposes, we should:
    1. Print a warning when a user calls bacalhau get when data is using local storage
    2. The requester will NOT do any tunneling or act as a gateway to retrieve data from compute nodes and forward them to the client. Meaning the client must be on the same network as the compute node to be able to fetch the results, which should be acceptable for someone trying out bacalhau. This limitation not only avoids additional complexities to the requester node, but explicitly making it clear to not use this publisher for production workloads

I've discussed this with @aronchick before, and he wasn't inclined with the first option. I am putting both options here for completeness and to make sure the tradeoffs are clear.

Priority

  • Priority: 1 (High)
  • Quarter: 2024Q1
  • Justification:
    • ipfs is causing a lot of pain and will be great to deprecate as soon as we can
    • we are launching on marketplace soon, and we need to figure out what else we need to deploy on our nodes and what services users need to be aware of.
    • we are not storing results in a very secure way

Tasks

  1. comp/publisher/ipfs type/epic
    wdbaruni
  2. MichaelHoepler
@wdbaruni wdbaruni transferred this issue from bacalhau-project/bacalhau Jan 25, 2024
@wdbaruni wdbaruni added the th/production-readiness Get ready for production workloads label Jan 25, 2024
@rossjones
Copy link
Contributor

@wdbaruni for a user who may have signed up via their organisation's expanso-managed account, the publisher would most likely be defined organisation-wide. Previously when I've been through security audits, a lot of the concerns are that data might be exfiltrated from the cluster and in those situations it would potentially be useful for us to disallow users from specifying their own publisher, and rely solely on the organisation's configured publisher.

@rossjones rossjones self-assigned this Feb 5, 2024
@rossjones
Copy link
Contributor

rossjones commented Feb 20, 2024

IPFS is no longer the default publisher, the default being Noop but settable. Any chances we want to the local publisher we can put into other tickets.

@rossjones
Copy link
Contributor

IPFS node still embedded, re-opening to make sure we remove it

@rossjones rossjones reopened this Mar 27, 2024
@rossjones
Copy link
Contributor

Have updated #3816 with recommendation for making this change over two releases.

@wdbaruni wdbaruni transferred this issue from another repository Apr 21, 2024
@wdbaruni wdbaruni transferred this issue from another repository Apr 21, 2024
@wdbaruni wdbaruni added the comp/publisher/ipfs Issues related to IPFS publisher label Apr 22, 2024
@frrist frrist added this to the v1.3.1 milestone Apr 29, 2024
@frrist
Copy link
Member

frrist commented May 9, 2024

I am going to close this issues as IPFS as a default publisher has been deprecated in favor of the local publisher. The remaining items of work can be completed outside of this issue that has grown into an epic.

@frrist frrist closed this as completed May 9, 2024
@github-project-automation github-project-automation bot moved this from Inbox to Done in Engineering Planning May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp/publisher/ipfs Issues related to IPFS publisher th/production-readiness Get ready for production workloads
Projects
Status: Done
Development

No branches or pull requests

3 participants