Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc Quality Transform: update readme and add sample notebook #790

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from

Conversation

dtsuzuku-ibm
Copy link
Collaborator

@dtsuzuku-ibm dtsuzuku-ibm commented Nov 11, 2024

following template #753 (comment)

What I didn't include

  1. Header/Author, since we can see it in github
  2. Header/Date, since we can see it in github
  3. Changelog, since we may better consider a way to generate changelog from the commits as github action.
  4. Code Examples and Documentation, since we can provide it in jupyter notebook

Why are these changes needed?

Related issue number (if any).

#753

@shahrokhDaijavad
Copy link
Member

Thanks, @dtsuzuku-ibm. I am ok with most of the decisions you have made (like not including the information we can get from github about the Author and Date. However, @agoyal26, who created the template, may insist that we still provide such information in the README file.
Since we use your README as a model for others, let me not approve it yet, until @agoyal26 has also reviewed this and we decide what is a "must" in the README file. Again, thanks for doing this before all other transform owners!

@agoyal26
Copy link
Collaborator

I have made some suggestions to the read me. We can skip adding date of revisions but I think adding the list of contributors with their email would be good.

@shahrokhDaijavad
Copy link
Member

Where are your suggestions, @agoyal26? (other than adding the list of contributors and their emails)

@agoyal26
Copy link
Collaborator

I added them in readme.md itself as comments- are they visible? @shahrokhDaijavad

@shahrokhDaijavad
Copy link
Member

They are probably on your local copy of the file, @agoyal26. I don't see them. Do you want to just copy and paste your local README file here?

Copy link
Collaborator

@agoyal26 agoyal26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see some comments on readme to improve readability

transforms/language/doc_quality/python/README.md Outdated Show resolved Hide resolved
transforms/language/doc_quality/python/README.md Outdated Show resolved Hide resolved
transforms/language/doc_quality/python/README.md Outdated Show resolved Hide resolved
@shahrokhDaijavad
Copy link
Member

Thanks, @agoyal26 Now, I see your comments!

Signed-off-by: Daiki Tsuzuku <[email protected]>
@dtsuzuku-ibm
Copy link
Collaborator Author

dtsuzuku-ibm commented Nov 13, 2024

@agoyal26

Should we not call it content then instead of description in table header ?

I'm ok with that. Should I change the name of description column of the table in Output columns annotated by this transform to content, too? Both columns are for explaining what kind of data you'll see in each column listed in table.

@agoyal26
Copy link
Collaborator

oh my apologies - I get it now. Let it be description of the column. Also please add contributors and authors name and email id to readme. Then looks like we are good to go. Thank you

Signed-off-by: Daiki Tsuzuku <[email protected]>
Copy link
Collaborator

@agoyal26 agoyal26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dtsuzuku-ibm
Copy link
Collaborator Author

@shahrokhDaijavad Could you assign any autorized user as reviewer?

Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you @dtsuzuku-ibm and @agoyal26

@shahrokhDaijavad
Copy link
Member

@dtsuzuku-ibm I think this is ready to merge. I will ask @touma-I to do this.

@dtsuzuku-ibm
Copy link
Collaborator Author

@shahrokhDaijavad I'm afraid I'm not authorized to merge this PR... Could you merge this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shahrokhDaijavad @dtsuzuku-ibm This shows how to run the example script once we have cloned the repo. I wonder if we should also add a section to it that would explain how to use it in a notebook or a python script without cloning the repo but only with pip install . i.e.:

!pip install data-prep-toolkit
!pip install data-prep-toolkit-transform[doc_quality]

from doc_quality_transform python import ...

params = {
...
}
...

laucher.launch()

Copy link
Collaborator Author

@dtsuzuku-ibm dtsuzuku-ibm Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@touma-I

Please add example. with a few lines of code on how one can do pip install, setup the parameter block for input/output folder and then invoke the runtime launcher using the default parameters . Thanks

We can do all of these in jupyter/collab notebook which will be added in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtsuzuku-ibm . that is great. Once you have the notebook, you can reference it. For now, I don't think this is complete until we have one or the other or both.

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add example. with a few lines of code on how one can do pip install, setup the parameter block for input/output folder and then invoke the runtime launcher using the default parameters . Thanks

@shahrokhDaijavad
Copy link
Member

@dtsuzuku-ibm Since we don't have the notebook example yet, @touma-I is suggesting adding these few lines of code as an example of how we will be using all transforms in the future using pip install and not cloning the repo. Once we have the notebook example, we may remove these lines and just refer to the notebook.

@agoyal26
Copy link
Collaborator

very good point @touma-I . I agree too- as we are promoting that DPK transforms are easy to use via pip install - we should include these steps.

@dtsuzuku-ibm
Copy link
Collaborator Author

@touma-I @shahrokhDaijavad @agoyal26
User can refer to src/doc_quality_local_python.py. Is adding link enough?
I want to avoid copying&pasting duplicated codes.

@touma-I
Copy link
Collaborator

touma-I commented Nov 14, 2024

@touma-I @shahrokhDaijavad @agoyal26 User can refer to src/doc_quality_local_python.py. Is adding link enough? I want to avoid copying&pasting duplicated codes.

Hi @dtsuzuku-ibm I understand the desire not to copy/paste and hopefully we will get to the point where we can automate some of this stuff maybe via CI/CD. But for now, it is not clear to me what is the minimum that the user needs to do to use this transform. I think having it explained in 5-6 lines of code will help get us to that point.

Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
@shahrokhDaijavad
Copy link
Member

@dtsuzuku-ibm and @touma-I, I added a minimal version of the notebook for this transform based on what we have decided as a notebook template for all transforms. I tried to mimic what I learned from Maroun yesterday, but this being my first time, I may have made mistakes. Please check and modify as needed.
After checking (testing), @dtsuzuku-ibm please link to this notebook from your README file and we will be ready to merge the PR.

shahrokhDaijavad and others added 2 commits November 21, 2024 13:34
Signed-off-by: SHAHROKH DAIJAVAD <[email protected]>
Signed-off-by: Daiki Tsuzuku <[email protected]>
@dtsuzuku-ibm
Copy link
Collaborator Author

@shahrokhDaijavad @touma-I I've added link to jupyter notebook. Could you review?

@shahrokhDaijavad
Copy link
Member

@dtsuzuku-ibm The link is fine. However, the output of the notebook you have run shows an error (ModuleNotFoundError: No module named 'doc_quality_transform'). I assume you were able to fix this error and run the notebook to the end, because the final output looks good. Am I right? It would be good to have a version of the notebook without output errors in the repo.

Signed-off-by: Daiki Tsuzuku <[email protected]>
@dtsuzuku-ibm
Copy link
Collaborator Author

@shahrokhDaijavad I've updated notebook. It seems my latest local change was not synced.

Copy link
Member

@shahrokhDaijavad shahrokhDaijavad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you, @dtsuzuku-ibm.
Question to @touma-I: Although pip installs are commented out in this notebook, we don't need a separate: pip install dpk-doc-quality-transform-python when we use: pip install data-prep-toolkit-transforms[all]>=0.2.2.dev3, right?

@touma-I
Copy link
Collaborator

touma-I commented Nov 22, 2024

LGTM. Thank you, @dtsuzuku-ibm. Question to @touma-I: Although pip installs are commented out in this notebook, we don't need a separate: pip install dpk-doc-quality-transform-python when we use: pip install data-prep-toolkit-transforms[all]>=0.2.2.dev3, right?

@shahrokhDaijavad That is correct. @dtsuzuku-ibm Please remove the pip install dpk-doc-quality-transform-python . We no longer publish individual transforms.

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove reference to dpk-doc-quality-transform-python . We no longer package individual transforms. Thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove references to pip install dpk-doc-quality-transform-python (as we no longer publish the individual transforms) and to pip install data-prep-connector (as this notebook does not seem to have a dependency on the web crawler).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@touma-I

Please remove references to pip install dpk-doc-quality-transform-python

Without dpk-doc-quality-transform-python, we cannot import doc_quality_transform module. Could you tell me what kind of dependency we need to install instead?

pip install data-prep-connector

Got it 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtsuzuku-ibm I think if you do: pip install ‘data-prep-toolkit-transforms[all]>=0.2.2.dev3', it has doc_quality_transform included. @touma-I can confirm this. Please see the comment at the top of the notebook that says: "These pip installs need to be adapted to use the appropriate release level."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @dtsuzuku-ibm.

@touma-I touma-I changed the title update readme Doc Quality Transform: update readme and add sample notebook Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants