-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Databricks docs #3360
Comments
And also dbx:
https://docs.databricks.com/en/archive/dev-tools/dbx/index.html |
I'll take a look at this in an upcoming sprint -- we did some updates for the asset bundles recently as suggested by Harmony. |
Another couple of things I found in our Databricks workspace guide
It's true that creating a Databricks Repo synced with a GH repo gives some nice advantages, like being able to edit the code in an actual IDE (whether a local editor or a cloud development environment like Gitpod or GitHub Codespaces). And it's also true that Databricks recommends in different places that data should live in the DBFS root. However, it would be nice to consider what's the shortest and simplest guide we can write for users to get started with Kedro on Databricks, and then build from there. |
To clarify on the initial comment:
Both things above require the Databricks CLI version 0.205 and above. Apart from that, the commands haven't changed, so what we should do in this regard is making sure we're not sending users to legacy docs and that's it. |
To summarise:
|
@astrojuanlu Is this something that the team can pick up or do we need to ask for time from Jannic or another databricks expert (maybe @deepyaman could ultimately review)? What are we prioritising this? I'm guessing it's relatively high importance to keep databricks docs tracking with their tech. |
We need to build Databricks expertise in the team, so I hope we don't need to ask external experts to do it (it's OK if they give assistance, but we need to own this). |
Added this to the Inbox so that we prioritise. |
At this point, it's been almost 4 years since I've used Databricks (and don't currently have any interest in getting back into it), so I'd defer to somebody else. 🙂 |
More than fair enough @deepyaman! Good to confirm though. |
I'm adding one more item:
Every time I give a talk or workshop, invariably somebody from the audience asks "how does the Kedro Catalog play along with Databricks Unity Catalog?". Our reference docs for And there's one subtle mention of it in databricks.ManagedTableDataset ("the name of the catalog in Unity". The broader question of Delta datasets is a topic for kedro-org/kedro-plugins#542. |
Maybe this could help: |
This looks really cool. @JenspederM do you want to share a bit more insight on how far do you intend to go with your project? |
Hey @astrojuanlu Actually I don't really know if there's more to do. I almost want the project to be as barebone as possible. The way I left it now is with a very simple As for the DAB resource generator, I'm considering if I could find a better way for users to set defaults such as which job clusters, instance pools, etc.. One thing that is generally lacking is the documentation so that will definitely receive some attention once I have the time. Do you have any suggestions? |
I gave two Kedro on Databricks demos yesterday, so I'm sharing that very simple notebook here https://github.com/astrojuanlu/kedro-databricks-demo hopefully it can be the basis of what I proposed in #3360 (comment) (still no Kedro Framework there) |
@JenspederM I gave your |
@astrojuanlu Go for it! I've been a bit busy these last few days and haven't had the chance to make any progress. But it's always nice to have some concrete issues to address. 😉 |
@astrojuanlu Just fyi, I'll merge a quite big PR soon so hopefully that will address most issues that you found. The substitution algorithm was a bit more cumbersome than first anticipated.. |
There is now a community plugin exists: https://github.com/JenspederM/kedro-databricks We need to update the documentation according to the readme to walkthrough the steps to setup on Databricks. |
This comment is also 10 months old. Do we want to mention the
There is also a comment about UnityCatalog, is it in the scope of this ticket or should we separate this out? Not quite sure what should be done, maybe we can mention there are some databricks dataset that can work with
Do we want to mention it? |
Yes!
It's in the scope of this ticket because this is a parent ticket. No need to do everything in the same PR. The key thing is that we explain clearly how the Kedro Catalog and Unity Catalog can be used together. Example: https://github.com/astrojuanlu/kedro-databricks-demo/blob/main/First%20Steps%20with%20Kedro%20on%20Databricks.ipynb
Do you mean https://docs.databricks.com/en/dev-tools/vscode-ext/index.html ? Not sure if there's anything Kedro-specific about that extension. We can discuss about that in another ticket. |
Not directly, but it could be a better way to work with Databricks for the IDE experience. (no manual sync etc). Will keep this outside of the ticket. Also, I think we should add new pages instead of just deleting the old one. I think there are still a significant amount of users are using |
Steps to use
|
Yeah, I'm not too happy about having to change user paths either. But what is done, is simply to check that paths that refer to dbfs use the correct package name. This is mostly a legacy feature that helped developer experience with the old version of the starter, where dbfs paths were not aligned with the package name. Since that has now changed, I might also remove this from the init command. |
@DimedS Do you want to take the rest of the docs as I am working on the data asset bundles ? |
Yes, I’m ready to take care of them. I just need to wrap up the synthesis of our deployment interviews. Four of them were about Databricks, and we identified four distinct ways people deploy there. I believe we’ll complete the Databricks synthesis in the next two days. After that, I’ll share my thoughts on updating the Databricks docs here and will start making updates once we align on the changes. |
Just want to share some notes for my DAB deployment experience with our own internal infra, there are two options: 1. Azure, 2. AWS instance With Azure Databricks (the more common one), everything works well until the step 8 (run job with CLI) because we don't have job cluster with the sandbox. With AWS Databricks, we do have job cluster, but I get stuck even earlier at step 7 (
To proceed, I will go with the Azure Databricks option, and use the UI to run the job as the last step. I expect users will likely encounter cryptic error message related to security/permission. I am not sure what kind of help we can provide other |
Hi, I'd like to share a part of synthesis of our recent interviews with four users who deploy projects to Databricks. In the following diagram, I’ve outlined their user flows, highlighting the main steps for deploying to Databricks. While there are many similarities, certain parts of the process differ. As you can see, the overall workflow is quite similar across users, but there are a few points where multiple options are available. I believe adding a diagram like this to our main Databricks documentation page (https://docs.kedro.org/en/stable/deployment/databricks/index.html) would be helpful. It could provide a concise overview of the steps required to deploy code and the various options for each step. This approach would be more efficient than detailing each complete workflow as we currently do. For instance, these two pages:
currently overlap by about 50%. Instead, we could focus on describing specific differences and options at each step, such as:
This structure would allow users to better understand the options available at each step without redundant information. |
This diagram is fantastic, and probably more useful than the current one we have. Also +1 on trying to reduce overlap between the pages. And finally, I noticed that some users didn't really want to |
Yes, from what I gathered, users didn’t find much benefit from packaging their projects - they only followed that approach because we recommended it. Some simplified the process and realised they could just upload code directly to their Databricks Repo, and they mentioned they don't need to package their projects. One user, Miguel, figured out that by specifying the notebook that runs the Kedro project within the Databricks Job deployment, he could use the same notebook during development, making debugging easier. |
Do you plan to write these documentation, or it will be created as separate ticket from the current issue? |
Another think we could try to document at some point is how to make use of the VS Code extension + And the workaround seems to be |
I don’t think I proceed with it in the current sprint. Let’s create a new issue as a follow-up to the Databricks deployment research #4317. We can probably close current one after merging your PR, which transitions from dbx to Databricks asset bundles. |
Current status of this issue:
|
Description
It looks like Databricks have deprecated their cli tools, which has had the knock on effect of breaking our docs. A quick fix in #3358 adds the necessary
/archive/
to the broken links, but maybe we should rethink the section as a whole?CC: @stichbury
docs: https://docs.kedro.org/en/stable/deployment/databricks/databricks_ide_development_workflow.html
Edit: See #3360 (comment) for current status of this parent issue
The text was updated successfully, but these errors were encountered: