forked from zenml-io/mlstacks
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lakeFS data lake recipe for GCP & k3d #1
Open
AdrianoKF
wants to merge
17
commits into
develop
Choose a base branch
from
data-lake-lakefs
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: Max Mynter <[email protected]>
The kubectl provider seems to be abandoned and has problems with more recent versions of cloud k8s clusters, see for example: gavinbunney/terraform-provider-kubectl#270
This solves the problem of having to guard all subresources required for lakeFS based on the configuration of the GCP module.
In some cases, it might be desirable to have subfolders inside a Terraform recipe folder (e.g., for local modules). The current implementation ignored these, breaking the initialization of the copied Terraform modules due to missing files. This commit changes the logic in terraform_utils.py to copy any subfolders (not starting with a leading dot, like `.terraform`) of a Terraform module to the destination.
AdrianoKF
commented
Mar 28, 2024
kubernetes_namespace.k8s-workloads, | ||
] | ||
} | ||
# TODO: Find out why this is failing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs addressing
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Describe changes
This PR adds support for the lakeFS data lake to MLStacks. It adds a new high-level component type
data_lake
withlakefs
as its only flavor.Everything is up for debate and discussion at this point, I just wanted to put together an end-to-end example to have a starting point for the review.
Note
This design choice is up for discussion. Data lakes are conceptually not covered in the current component types and don't really fit anywhere else (like artifact stores).
Defining a new component type would also impact the ZenML side of things, which hasn't been considered yet.
Feedback in the review is highly appreciated! 🙏🏻
At this point, deployment on a local k3d cluster with MinIO as the storage backend and Google Cloud Platform (with GCS for data storage and Cloud SQL with private VPC connectivity as the database backend) are supported.
The lakeFS instance is exposed via a Kubernetes ingress, initial admin credentials are provisioned automatically on k3d clusters, and all user-relevant options are exposed through the stack output (under the
data_lake_configuration
key).Open questions/issues
gcp-modular
: The VPC service networking resource does not destroy cleanly, since it depends on the k8s cluster being removed from the VPC firstgcp-modular
: Open TODO https://github.com/aai-institute/mlstacks/pull/1/files#diff-1e6881032f0907b314f95cf6b5fe4e113b4fdcf4b140d9967c050072787ea7f1R41Example stack definitions
Below are the stacks used during development/testing:
GCP:
k3d local:
Pre-requisites
Please ensure you have done the following:
accordingly.
develop
and the open PR is targetingdevelop
. If your branch wasn't based on develop readContribution guide on rebasing branch to develop.
Types of changes
change)