Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template for single transform notebook examples #754

Closed
1 of 2 tasks
shahrokhDaijavad opened this issue Oct 29, 2024 · 27 comments
Closed
1 of 2 tasks

Template for single transform notebook examples #754

shahrokhDaijavad opened this issue Oct 29, 2024 · 27 comments
Assignees
Labels
enhancement New feature or request simplify-DPK

Comments

@shahrokhDaijavad
Copy link
Member

Search before asking

  • I searched the issues and found no similar issues.

Component

Other

Feature

As related to the second task in #753, we need a notebook template as a starting point.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@shahrokhDaijavad shahrokhDaijavad added the enhancement New feature or request label Oct 29, 2024
@touma-I
Copy link
Collaborator

touma-I commented Nov 5, 2024

@Bytes-Explorer You had mentioned you had a template we can use. Thanks

@Bytes-Explorer
Copy link
Collaborator

Team: pls see these from one of my previous projects https://github.ibm.com/data-readiness-for-ai/dart/tree/offering_dev/notebooks

@shahrokhDaijavad
Copy link
Member Author

shahrokhDaijavad commented Nov 6, 2024

@agoyal26 Please discuss with @sujee and finalize and show us the outcome.

@agoyal26
Copy link
Collaborator

agoyal26 commented Nov 6, 2024

As @sujee has lots of experience giving engaging demos over past 2 months for DPK showcasing multiple RAG notebooks - we want to leverage his learning and feedback to come up with sample template which is simple and engaging for a new user

@sujee
Copy link
Contributor

sujee commented Nov 6, 2024

@agoyal26 Please discuss with @sujee and finalize and show us the outcome.

@agoyal26 and I will be working on this 👍

@agoyal26
Copy link
Collaborator

agoyal26 commented Nov 7, 2024

I am going to jot here my first proposal of template along with best practices as discussed with @sujee and some open points for discussion.
Best Practices

  1. Ideally include a graphical/flowchart type representation of the data transformation flow pipeline
  2. Notebook should be Google Collab friendly
  3. Please add requirement.txt so users can see all packages in one place
  4. Number and label each notebook cell for easy reference
  5. Have a separate one section right after imports for setting config parameters - should not be spread throughout the notebook

Open Questions

  1. Should there be 2 notebooks? Jupyter conda based and Google collab based or integrated into 1 ?
  2. For notebooks - do we suggest pip install modules or assume people have done git clone - @sujee suggested latter as it avoids installing packages on machines
  3. Do we check-in output of cells? Should we display output of parquet files ?
  4. What is best way to show progress through the notebook? so that user can skim and learn if they don't want to run notebook for now

@agoyal26
Copy link
Collaborator

agoyal26 commented Nov 7, 2024

Proposed structure for notebook:

  1. Import Libraries: Import necessary libraries and any additional dependencies for visualization or analysis.
  2. Set up Config parameters
  3. Import and Load input Data
  4. Module Application to input data and Demonstration of usage with clear comments about parameters
  5. Output/Analysis of output data

@sujee
Copy link
Contributor

sujee commented Nov 7, 2024

If you want to see a sample notebook here is my attempt 😄 : https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb

  • unified notebook runs on local + colab
  • workflow diagram on top of the notebook
  • numbered steps for easy reference (step 4, Step 4.1 ..etc)
  • outputs are checked-in, so users can skim the notebook and see whats going on (without running)
  • I am printing out intermediate results (pq) as I go along, so we can track transformations.

@sujee
Copy link
Contributor

sujee commented Nov 7, 2024

@agoyal26 should we move this into a discussion?

@shahrokhDaijavad
Copy link
Member Author

Thanks, @agoyal26 and @sujee This is a good discussion, and I like https://github.com/IBM/data-prep-kit/blob/dev/examples/notebooks/intro/dpk_intro_1_python.ipynb as a model to use. Of course, for a "single" transform, it will be simpler than this example, e.g., after the diagram, setup, and configuration, one step for data ingestion (zip, pdf, html) and conversion to parquet and then a second step for running a single transform (parquet to parquet).
@sujee You are right that this would ideally go to the "discussion," but to keep it actionable, let's keep it in the issues.

@agoyal26
Copy link
Collaborator

Team - please make your suggestions so that we can close on a template and share with transform owners.
@Bytes-Explorer @touma-I

@touma-I
Copy link
Collaborator

touma-I commented Nov 13, 2024

@agoyal26 I had to do one today for @sujee to use for his html2parquet and in fact, I end up very close to what you were proposing above: #754 (comment)

Here is the link to the notebook I did : https://github.com/touma-I/data-prep-kit-pkg/blob/html2parquet-example/transforms/language/html2parquet/notebooks/html2parquet.ipynb

Here is how I adapted your proposall:

  1. pip install the required packages
  2. Import Libraries: Import necessary libraries and any additional dependencies for visualization or analysis.
  3. Set up Config parameters
  4. Invoke run-time
  5. Output/Analysis of output data

I am a big proponent of keeping things simple/minimal for this first iteration

cc: @shahrokhDaijavad @Bytes-Explorer

@shahrokhDaijavad
Copy link
Member Author

@touma-I This is a nice, simple template, and I am all for it.
For transform owners who do parquet to parquet, should the input to the notebook be parquet or should the notebook have extra initial cells for starting with pdf or html, converting to parquet, before running the transform?

@agoyal26
Copy link
Collaborator

I think we should include the additional cells - as usually the initial data will not be in parquet format

@touma-I
Copy link
Collaborator

touma-I commented Nov 14, 2024

@shahrokhDaijavad today, all transforms work the same way, they have an input folder and they produce an output folder. Most of the transform expect files with .parquet extension in the input folder except for the ingest transforms, such as html2parquet, pdf2parquet, code2parquet,etc who accept .html, .pdf, .py, etc. So the structure should be still the same for all examples, just the type of files in the input folder is different.

@shahrokhDaijavad
Copy link
Member Author

@Ryan-Gordon-314159 Based on all the discussion above and the two notebook examples that @touma-I and @sujee are linking to, I think we have all the ingredients to build a nice template.

@shahrokhDaijavad
Copy link
Member Author

I have been discussing this with @touma-I today, and we have concluded that the best way for us to make fast progress is to use an iterative method, i.e., instead of waiting for a "complete" template, start with a simple functional Notebook, a la what Maroun did for html2parquet (https://github.com/touma-I/data-prep-kit-pkg/blob/html2parquet-example/transforms/language/html2parquet/notebooks/html2parquet.ipynb) with added explanation of what each cell does. Then, in the next iteration, all the niceties (diagram, being able to run both local and on Colab), nicer formatting, etc. can come. This decision is also influenced by the discussion we have with Tsuzuku-san (PR #790) and Michele (PR #800) about adding some example code to their README files (that they have finished) and their argument that adding such code is redundant if a Notebook will be added.

@sujee
Copy link
Contributor

sujee commented Nov 15, 2024

yes, 100%. Let's not get bogged down in creating the 'perfect template'. We have some good examples already. We can iterate quickly

@Bytes-Explorer
Copy link
Collaborator

Bytes-Explorer commented Nov 15, 2024

I agree, no need to boil the ocean. One small suggestion that I have is to add some comments before each cell and few lines at the top on what is the functionality being demonstrated in this notebook. @shahrokhDaijavad I believe you are already thinking of it.

I would suggest making the notebook colab compatible if possible - that really helps when we do hands on demos.

@shahrokhDaijavad
Copy link
Member Author

Sure, @Bytes-Explorer. I like the idea of making each notebook colab comaptible, even in the first iteration.

@touma-I
Copy link
Collaborator

touma-I commented Nov 15, 2024

@shahrokhDaijavad @Bytes-Explorer I would keep the collab requirement as a nice to have but not a must have. I agree it is easy to do but we might hit some issues down the road. I think the ask here is for the developer to show us how their code is used in a notebook with as little constraints as possible.

@Bytes-Explorer
Copy link
Collaborator

We have some different perspectives here which is good and means it needs some discussion on what would be the ROI from doing this work. I have started a thread on internal channel as that would be good way to gather feedback from other users and people who have done socialisation activities with DPK in the past.

@agoyal26
Copy link
Collaborator

@Bytes-Explorer @shahrokhDaijavad so what is the consensus ?

@agoyal26
Copy link
Collaborator

Adding comments from slack discussion here:
Step 1: Let's build notebooks for usage demo as Maroun has suggested. These will lie in the respective folders, web2parquet is a good example.
Step 2: We will build application oriented demos as well as other simple usage demos that will lie in examples folder. These will have pip install version to work with DPK.

@shahrokhDaijavad
Copy link
Member Author

Thanks for documenting this, @agoyal26.
Here is a better notebook than web2parquet that has not been merged yet: https://github.com/IBM/data-prep-kit/blob/dol-update-readme-docs/transforms/language/pdf2parquet/pdf2parquet.ipynb
It is more relevant because it has an input and output folder that almost all our transforms have.
@touma-I did this today, and we have given it to Michele to finalize before we ask other transform owners to mimic it.

@agoyal26
Copy link
Collaborator

@shahrokhDaijavad if you can add link to updated notebook - we can capture it and close this issue

@shahrokhDaijavad
Copy link
Member Author

Sure, @agoyal26. Now that the notebook for pdf2parquet has been merged let's declare this as our template:
https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/pdf2parquet.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request simplify-DPK
Projects
None yet
Development

No branches or pull requests

5 participants