-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Template for single transform notebook examples #754
Comments
@Bytes-Explorer You had mentioned you had a template we can use. Thanks |
Team: pls see these from one of my previous projects https://github.ibm.com/data-readiness-for-ai/dart/tree/offering_dev/notebooks |
As @sujee has lots of experience giving engaging demos over past 2 months for DPK showcasing multiple RAG notebooks - we want to leverage his learning and feedback to come up with sample template which is simple and engaging for a new user |
I am going to jot here my first proposal of template along with best practices as discussed with @sujee and some open points for discussion.
Open Questions
|
Proposed structure for notebook:
|
If you want to see a sample notebook here is my attempt 😄 : https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb
|
@agoyal26 should we move this into a discussion? |
Thanks, @agoyal26 and @sujee This is a good discussion, and I like https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb as a model to use. Of course, for a "single" transform, it will be simpler than this example, e.g., after the diagram, setup, and configuration, one step for data ingestion (zip, pdf, html) and conversion to parquet and then a second step for running a single transform (parquet to parquet). |
Team - please make your suggestions so that we can close on a template and share with transform owners. |
@agoyal26 I had to do one today for @sujee to use for his html2parquet and in fact, I end up very close to what you were proposing above: #754 (comment) Here is the link to the notebook I did : https://github.com/touma-I/data-prep-kit-pkg/blob/html2parquet-example/transforms/language/html2parquet/notebooks/html2parquet.ipynb Here is how I adapted your proposall:
I am a big proponent of keeping things simple/minimal for this first iteration |
@touma-I This is a nice, simple template, and I am all for it. |
I think we should include the additional cells - as usually the initial data will not be in parquet format |
@shahrokhDaijavad today, all transforms work the same way, they have an input folder and they produce an output folder. Most of the transform expect files with .parquet extension in the input folder except for the ingest transforms, such as html2parquet, pdf2parquet, code2parquet,etc who accept .html, .pdf, .py, etc. So the structure should be still the same for all examples, just the type of files in the input folder is different. |
@Ryan-Gordon-314159 Based on all the discussion above and the two notebook examples that @touma-I and @sujee are linking to, I think we have all the ingredients to build a nice template. |
I have been discussing this with @touma-I today, and we have concluded that the best way for us to make fast progress is to use an iterative method, i.e., instead of waiting for a "complete" template, start with a simple functional Notebook, a la what Maroun did for html2parquet (https://github.com/touma-I/data-prep-kit-pkg/blob/html2parquet-example/transforms/language/html2parquet/notebooks/html2parquet.ipynb) with added explanation of what each cell does. Then, in the next iteration, all the niceties (diagram, being able to run both local and on Colab), nicer formatting, etc. can come. This decision is also influenced by the discussion we have with Tsuzuku-san (PR #790) and Michele (PR #800) about adding some example code to their README files (that they have finished) and their argument that adding such code is redundant if a Notebook will be added. |
yes, 100%. Let's not get bogged down in creating the 'perfect template'. We have some good examples already. We can iterate quickly |
I agree, no need to boil the ocean. One small suggestion that I have is to add some comments before each cell and few lines at the top on what is the functionality being demonstrated in this notebook. @shahrokhDaijavad I believe you are already thinking of it. I would suggest making the notebook colab compatible if possible - that really helps when we do hands on demos. |
Sure, @Bytes-Explorer. I like the idea of making each notebook colab comaptible, even in the first iteration. |
@shahrokhDaijavad @Bytes-Explorer I would keep the collab requirement as a nice to have but not a must have. I agree it is easy to do but we might hit some issues down the road. I think the ask here is for the developer to show us how their code is used in a notebook with as little constraints as possible. |
We have some different perspectives here which is good and means it needs some discussion on what would be the ROI from doing this work. I have started a thread on internal channel as that would be good way to gather feedback from other users and people who have done socialisation activities with DPK in the past. |
@Bytes-Explorer @shahrokhDaijavad so what is the consensus ? |
Adding comments from slack discussion here: |
Thanks for documenting this, @agoyal26. |
@shahrokhDaijavad if you can add link to updated notebook - we can capture it and close this issue |
Sure, @agoyal26. Now that the notebook for pdf2parquet has been merged let's declare this as our template: |
Search before asking
Component
Other
Feature
As related to the second task in #753, we need a notebook template as a starting point.
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: