From 1a70530af57f530d5ac98acacafbf94512a977b3 Mon Sep 17 00:00:00 2001 From: Daiki Tsuzuku Date: Mon, 11 Nov 2024 12:13:53 +0900 Subject: [PATCH 1/9] update readme following template https://github.com/IBM/data-prep-kit/issues/753#issuecomment-2460867526 Signed-off-by: Daiki Tsuzuku --- .../language/doc_quality/python/README.md | 57 +++++++++++++++---- 1 file changed, 47 insertions(+), 10 deletions(-) diff --git a/transforms/language/doc_quality/python/README.md b/transforms/language/doc_quality/python/README.md index 38421f34f..f3944cdc0 100644 --- a/transforms/language/doc_quality/python/README.md +++ b/transforms/language/doc_quality/python/README.md @@ -1,13 +1,21 @@ # Document Quality Transform + Please see the set of [transform project conventions](../../../README.md#transform-project-conventions) for details on general project conventions, transform configuration, testing and IDE set up. -## Summary -This transform will calculate and annotate several metrics related to document, which are usuful to see the quality of document. +## Description +This transform will calculate and annotate several metrics related to document, which are usuful to see the quality of document. +Text is the type of data this transform operates on. + +### Input -In this transform, following metrics will be included: +| input column name | data type | descrition | +|-|-|-| +| the one specified in _doc_content_column_ configuration | string | text whose quality will be calculated by this transform | + +### Output columns annotated by this transform | output column name | data type | description | supported language | |-|-|-|-| @@ -27,7 +35,7 @@ In this transform, following metrics will be included: You can see more detailed backgrounds of some columns in [Deepmind's Gopher paper](https://arxiv.org/pdf/2112.11446.pdf) -## Configuration and command line Options +## Configuration The set of dictionary keys holding [DocQualityTransform](src/doc_quality_transform.py) configuration for values are as follows: @@ -36,13 +44,19 @@ configuration for values are as follows: * _doc_content_column_ - specifies column name that contains document text. By default, "contents" is used. * _bad_word_filepath_ - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words. -## Running +Example +``` +{ + text_lang_key: "en", + doc_content_column_key: "contents", + bad_word_filepath_key: os.path.join(basedir, "ldnoobw", "en"), +} +``` + +## Usage ### Launched Command Line Options -When running the transform with the Ray launcher (i.e. TransformLauncher), -the following command line arguments are available in addition to -the options provided by -the [python launcher](../../../../data-processing-lib/doc/python-launcher-options.md). +The following command line arguments are available ``` --docq_text_lang DOCQ_TEXT_LANG language used in the text content. By default, "en" is used. --docq_doc_content_column DOCQ_DOC_CONTENT_COLUMN column name that contain document text. By default, "contents" is used. @@ -70,6 +84,9 @@ ls output ``` To see results of the transform. +### Code example + +TBD (link to the notebook will be provided) ### Transforming data using the transform image @@ -77,7 +94,27 @@ To use the transform image to transform your data, please refer to the [running images quickstart](../../../../doc/quick-start/run-transform-image.md), substituting the name of this transform image and runtime as appropriate. +## Testing + +Following [the testing strategy of data-processing-lib](../../../../data-processing-lib/doc/transform-testing.md) + +Currently we have: +- [Unit test](test/test_doc_quality_python.py) +- [Integration test](test/test_doc_quality.py) + + +## Further Resource + +- For those who want to learn C4 heuristic rules + - https://arxiv.org/pdf/1910.10683.pdf +- For those who want to learn Gopher statistics + - https://arxiv.org/pdf/2112.11446.pdf +- For those who want to see the source of badwords used by default + - https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words + + +## Consideration -## Troubleshooting guide +### Troubleshooting guide For M1 Mac user, if you see following error during make command, `error: command '/usr/bin/clang' failed with exit code 1`, you may better follow [this step](https://freeman.vc/notes/installing-fasttext-on-an-m1-mac) \ No newline at end of file From ecb87b0afd8042d122edc549639880c8b74d6ad5 Mon Sep 17 00:00:00 2001 From: Daiki Tsuzuku Date: Wed, 13 Nov 2024 10:23:10 +0900 Subject: [PATCH 2/9] fix typo and update description Signed-off-by: Daiki Tsuzuku --- transforms/language/doc_quality/python/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/transforms/language/doc_quality/python/README.md b/transforms/language/doc_quality/python/README.md index f3944cdc0..1e060018d 100644 --- a/transforms/language/doc_quality/python/README.md +++ b/transforms/language/doc_quality/python/README.md @@ -6,12 +6,12 @@ for details on general project conventions, transform configuration, testing and IDE set up. ## Description -This transform will calculate and annotate several metrics related to document, which are usuful to see the quality of document. -Text is the type of data this transform operates on. +This transform will calculate and annotate several metrics which are useful to assess the quality of the document. +The document quality transform operates on text documents only ### Input -| input column name | data type | descrition | +| input column name | data type | description | |-|-|-| | the one specified in _doc_content_column_ configuration | string | text whose quality will be calculated by this transform | From e3fae5db338ee16ae4fdcf6eced47da962e20b06 Mon Sep 17 00:00:00 2001 From: Daiki Tsuzuku Date: Wed, 13 Nov 2024 17:52:42 +0900 Subject: [PATCH 3/9] add name/email of contributor Signed-off-by: Daiki Tsuzuku --- transforms/language/doc_quality/python/README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/transforms/language/doc_quality/python/README.md b/transforms/language/doc_quality/python/README.md index 1e060018d..6a085ef05 100644 --- a/transforms/language/doc_quality/python/README.md +++ b/transforms/language/doc_quality/python/README.md @@ -5,6 +5,10 @@ Please see the set of for details on general project conventions, transform configuration, testing and IDE set up. +## Contributors + +- Daiki Tsuzuku (dtsuzuku@jp.ibm.com) + ## Description This transform will calculate and annotate several metrics which are useful to assess the quality of the document. The document quality transform operates on text documents only From 170af4bb5c0e95ac31ea6971791b29d1f818cbb4 Mon Sep 17 00:00:00 2001 From: SHAHROKH DAIJAVAD Date: Thu, 21 Nov 2024 12:22:38 -0800 Subject: [PATCH 4/9] first version of a notebook Signed-off-by: SHAHROKH DAIJAVAD --- .../language/doc_quality/doc_quality.ipynb | 169 ++++++++++++++++++ 1 file changed, 169 insertions(+) create mode 100644 transforms/language/doc_quality/doc_quality.ipynb diff --git a/transforms/language/doc_quality/doc_quality.ipynb b/transforms/language/doc_quality/doc_quality.ipynb new file mode 100644 index 000000000..99bab8ff3 --- /dev/null +++ b/transforms/language/doc_quality/doc_quality.ipynb @@ -0,0 +1,169 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "afd55886-5f5b-4794-838e-ef8179fb0394", + "metadata": {}, + "source": [ + "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n", + "```\n", + "make venv \n", + "source venv/bin/activate \n", + "pip install jupyterlab\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "## This is here as a reference only\n", + "# Users and application developers must use the right tag for the latest from pypi\n", + "#!pip install data-prep-toolkit\n", + "#!pip install data-prep-toolkit-transforms\n", + "#!pip install data-prep-connector" + ] + }, + { + "cell_type": "markdown", + "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n", + "* text_lang - specifies language used in the text content. By default, \"en\" is used.\n", + "* doc_content_column - specifies column name that contains document text. By default, \"contents\" is used.\n", + "* bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.\n", + "#####" + ] + }, + { + "cell_type": "markdown", + "id": "ebf1f782-0e61-485c-8670-81066beb734c", + "metadata": {}, + "source": [ + "##### ***** Import required classes and modules" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c2a12abc-9460-4e45-8961-873b48a9ab19", + "metadata": {}, + "outputs": [], + "source": [ + "import ast\n", + "import os\n", + "import sys\n", + "\n", + "from data_processing.runtime.pure_python import PythonTransformLauncher\n", + "from data_processing.utils import ParamsUtils\n", + "from doc_quality_transform import (bad_word_filepath_cli_param, doc_content_column_cli_param, text_lang_cli_param,)\n", + "from doc_quality_transform_python import DocQualityPythonTransformConfiguration" + ] + }, + { + "cell_type": "markdown", + "id": "7234563c-2924-4150-8a31-4aec98c1bf33", + "metadata": {}, + "source": [ + "##### ***** Setup runtime parameters for this transform" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e90a853e-412f-45d7-af3d-959e755aeebb", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# create parameters\n", + "input_folder = os.path.join(\"python\", \"test-data\", \"input\")\n", + "output_folder = os.path.join( \"python\", \"output\")\n", + "local_conf = {\n", + " \"input_folder\": input_folder,\n", + " \"output_folder\": output_folder,\n", + "}\n", + "params = {\n", + " # Data access. Only required parameters are specified\n", + " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", + " # execution info\n", + " \"runtime_pipeline_id\": \"pipeline_id\",\n", + " \"runtime_job_id\": \"job_id\",\n", + " \"runtime_code_location\": ParamsUtils.convert_to_ast(code_location),\n", + " # doc_quality params\n", + " text_lang_cli_param: \"en\",\n", + " doc_content_column_cli_param: \"contents\",\n", + " bad_word_filepath_cli_param: os.path.join(\"python\", \"ldnoobw\", \"en\"),\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a", + "metadata": {}, + "source": [ + "##### ***** Use python runtime to invoke the transform" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0775e400-7469-49a6-8998-bd4772931459", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "sys.argv = ParamsUtils.dict_to_req(d=params)\n", + "launcher = PythonTransformLauncher(runtime_config=DocQualityPythonTransformConfiguration())\n", + "launcher.launch()" + ] + }, + { + "cell_type": "markdown", + "id": "c3df5adf-4717-4a03-864d-9151cd3f134b", + "metadata": {}, + "source": [ + "##### **** The specified folder will include the transformed parquet files." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7276fe84-6512-4605-ab65-747351e13a7c", + "metadata": {}, + "outputs": [], + "source": [ + "import glob\n", + "glob.glob(\"python/output/*\")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.8" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 10851f6643fcf96c2417b332118842660d225d3d Mon Sep 17 00:00:00 2001 From: SHAHROKH DAIJAVAD Date: Thu, 21 Nov 2024 13:34:47 -0800 Subject: [PATCH 5/9] fixed code_location Signed-off-by: SHAHROKH DAIJAVAD --- .../language/doc_quality/doc_quality.ipynb | 52 ++++++++++++++++--- 1 file changed, 46 insertions(+), 6 deletions(-) diff --git a/transforms/language/doc_quality/doc_quality.ipynb b/transforms/language/doc_quality/doc_quality.ipynb index 99bab8ff3..c6617b2bc 100644 --- a/transforms/language/doc_quality/doc_quality.ipynb +++ b/transforms/language/doc_quality/doc_quality.ipynb @@ -52,7 +52,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "id": "c2a12abc-9460-4e45-8961-873b48a9ab19", "metadata": {}, "outputs": [], @@ -77,7 +77,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "id": "e90a853e-412f-45d7-af3d-959e755aeebb", "metadata": {}, "outputs": [], @@ -90,6 +90,7 @@ " \"input_folder\": input_folder,\n", " \"output_folder\": output_folder,\n", "}\n", + "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n", "params = {\n", " # Data access. Only required parameters are specified\n", " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n", @@ -114,10 +115,30 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "id": "0775e400-7469-49a6-8998-bd4772931459", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "13:32:09 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': 'python/ldnoobw/en', 's3_cred': None, 'docq_data_factory': }\n", + "13:32:09 INFO - pipeline id pipeline_id\n", + "13:32:09 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}\n", + "13:32:09 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n", + "13:32:09 INFO - data factory data_ max_files -1, n_sample -1\n", + "13:32:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "13:32:09 INFO - orchestrator docq started at 2024-11-21 13:32:09\n", + "13:32:09 INFO - Number of files is 1, source profile {'max_file_size': 0.0009870529174804688, 'min_file_size': 0.0009870529174804688, 'total_file_size': 0.0009870529174804688}\n", + "13:32:09 INFO - Load badwords found locally from python/ldnoobw/en\n", + "13:32:11 INFO - Completed 1 files (100.0%) in 0.025 min\n", + "13:32:11 INFO - Done processing 1 files, waiting for flush() completion.\n", + "13:32:11 INFO - done flushing in 0.0 sec\n", + "13:32:11 INFO - Completed execution in 0.025 min, execution result 0\n" + ] + } + ], "source": [ "%%capture\n", "sys.argv = ParamsUtils.dict_to_req(d=params)\n", @@ -135,14 +156,33 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "id": "7276fe84-6512-4605-ab65-747351e13a7c", "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "['python/output/metadata.json', 'python/output/test1.parquet']" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "import glob\n", "glob.glob(\"python/output/*\")" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { From 7545872c6e059eb67f2f947418572d255bf66685 Mon Sep 17 00:00:00 2001 From: Daiki Tsuzuku Date: Fri, 22 Nov 2024 10:43:05 +0900 Subject: [PATCH 6/9] add link to jupyter notebook Signed-off-by: Daiki Tsuzuku --- .../language/doc_quality/doc_quality.ipynb | 30 +++++++++++++------ .../language/doc_quality/python/README.md | 2 +- 2 files changed, 22 insertions(+), 10 deletions(-) diff --git a/transforms/language/doc_quality/doc_quality.ipynb b/transforms/language/doc_quality/doc_quality.ipynb index c6617b2bc..f3978dc96 100644 --- a/transforms/language/doc_quality/doc_quality.ipynb +++ b/transforms/language/doc_quality/doc_quality.ipynb @@ -15,7 +15,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695", "metadata": {}, "outputs": [], @@ -23,9 +23,10 @@ "%%capture\n", "## This is here as a reference only\n", "# Users and application developers must use the right tag for the latest from pypi\n", - "#!pip install data-prep-toolkit\n", - "#!pip install data-prep-toolkit-transforms\n", - "#!pip install data-prep-connector" + "%pip install data-prep-toolkit\n", + "%pip install data-prep-toolkit-transforms\n", + "%pip install data-prep-connector\n", + "%pip install dpk-doc-quality-transform-python" ] }, { @@ -52,12 +53,23 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 6, "id": "c2a12abc-9460-4e45-8961-873b48a9ab19", "metadata": {}, - "outputs": [], + "outputs": [ + { + "ename": "ModuleNotFoundError", + "evalue": "No module named 'doc_quality_transform'", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", + "Cell \u001b[0;32mIn[6], line 6\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mdata_processing\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mruntime\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpure_python\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m PythonTransformLauncher\n\u001b[1;32m 5\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mdata_processing\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mutils\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m ParamsUtils\n\u001b[0;32m----> 6\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mdoc_quality_transform\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m (bad_word_filepath_cli_param, doc_content_column_cli_param, text_lang_cli_param,)\n\u001b[1;32m 7\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mdoc_quality_transform_python\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m DocQualityPythonTransformConfiguration\n", + "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'doc_quality_transform'" + ] + } + ], "source": [ - "import ast\n", "import os\n", "import sys\n", "\n", @@ -187,7 +199,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": ".venv", "language": "python", "name": "python3" }, @@ -201,7 +213,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.8" + "version": "3.11.0" } }, "nbformat": 4, diff --git a/transforms/language/doc_quality/python/README.md b/transforms/language/doc_quality/python/README.md index 6a085ef05..c10bc4b88 100644 --- a/transforms/language/doc_quality/python/README.md +++ b/transforms/language/doc_quality/python/README.md @@ -90,7 +90,7 @@ To see results of the transform. ### Code example -TBD (link to the notebook will be provided) +[notebook](../doc_quality.ipynb) ### Transforming data using the transform image From 9ee506e341749765c59b8ab8430829fd442f4950 Mon Sep 17 00:00:00 2001 From: Daiki Tsuzuku Date: Fri, 22 Nov 2024 15:09:11 +0900 Subject: [PATCH 7/9] update notebook Signed-off-by: Daiki Tsuzuku --- .../language/doc_quality/doc_quality.ipynb | 52 +++++++------------ 1 file changed, 20 insertions(+), 32 deletions(-) diff --git a/transforms/language/doc_quality/doc_quality.ipynb b/transforms/language/doc_quality/doc_quality.ipynb index f3978dc96..91aafd74d 100644 --- a/transforms/language/doc_quality/doc_quality.ipynb +++ b/transforms/language/doc_quality/doc_quality.ipynb @@ -15,7 +15,7 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695", "metadata": {}, "outputs": [], @@ -53,22 +53,10 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 8, "id": "c2a12abc-9460-4e45-8961-873b48a9ab19", "metadata": {}, - "outputs": [ - { - "ename": "ModuleNotFoundError", - "evalue": "No module named 'doc_quality_transform'", - "output_type": "error", - "traceback": [ - "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", - "\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)", - "Cell \u001b[0;32mIn[6], line 6\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mdata_processing\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mruntime\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mpure_python\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m PythonTransformLauncher\n\u001b[1;32m 5\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mdata_processing\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mutils\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m ParamsUtils\n\u001b[0;32m----> 6\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mdoc_quality_transform\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m (bad_word_filepath_cli_param, doc_content_column_cli_param, text_lang_cli_param,)\n\u001b[1;32m 7\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mdoc_quality_transform_python\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m DocQualityPythonTransformConfiguration\n", - "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'doc_quality_transform'" - ] - } - ], + "outputs": [], "source": [ "import os\n", "import sys\n", @@ -89,7 +77,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 9, "id": "e90a853e-412f-45d7-af3d-959e755aeebb", "metadata": {}, "outputs": [], @@ -127,7 +115,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 10, "id": "0775e400-7469-49a6-8998-bd4772931459", "metadata": {}, "outputs": [ @@ -135,19 +123,19 @@ "name": "stderr", "output_type": "stream", "text": [ - "13:32:09 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': 'python/ldnoobw/en', 's3_cred': None, 'docq_data_factory': }\n", - "13:32:09 INFO - pipeline id pipeline_id\n", - "13:32:09 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}\n", - "13:32:09 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n", - "13:32:09 INFO - data factory data_ max_files -1, n_sample -1\n", - "13:32:09 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "13:32:09 INFO - orchestrator docq started at 2024-11-21 13:32:09\n", - "13:32:09 INFO - Number of files is 1, source profile {'max_file_size': 0.0009870529174804688, 'min_file_size': 0.0009870529174804688, 'total_file_size': 0.0009870529174804688}\n", - "13:32:09 INFO - Load badwords found locally from python/ldnoobw/en\n", - "13:32:11 INFO - Completed 1 files (100.0%) in 0.025 min\n", - "13:32:11 INFO - Done processing 1 files, waiting for flush() completion.\n", - "13:32:11 INFO - done flushing in 0.0 sec\n", - "13:32:11 INFO - Completed execution in 0.025 min, execution result 0\n" + "10:38:40 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': 'python/ldnoobw/en', 's3_cred': None, 'docq_data_factory': }\n", + "10:38:40 INFO - pipeline id pipeline_id\n", + "10:38:40 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}\n", + "10:38:40 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n", + "10:38:40 INFO - data factory data_ max_files -1, n_sample -1\n", + "10:38:40 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "10:38:40 INFO - orchestrator docq started at 2024-11-22 10:38:40\n", + "10:38:40 INFO - Number of files is 1, source profile {'max_file_size': 0.0009870529174804688, 'min_file_size': 0.0009870529174804688, 'total_file_size': 0.0009870529174804688}\n", + "10:38:40 INFO - Load badwords found locally from python/ldnoobw/en\n", + "10:38:49 INFO - Completed 1 files (100.0%) in 0.146 min\n", + "10:38:49 INFO - Done processing 1 files, waiting for flush() completion.\n", + "10:38:49 INFO - done flushing in 0.0 sec\n", + "10:38:49 INFO - Completed execution in 0.146 min, execution result 0\n" ] } ], @@ -168,7 +156,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 11, "id": "7276fe84-6512-4605-ab65-747351e13a7c", "metadata": {}, "outputs": [ @@ -178,7 +166,7 @@ "['python/output/metadata.json', 'python/output/test1.parquet']" ] }, - "execution_count": 5, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } From cf133880deac097f18cc580dc9364c680f1a9623 Mon Sep 17 00:00:00 2001 From: Daiki Tsuzuku Date: Mon, 25 Nov 2024 09:55:49 +0900 Subject: [PATCH 8/9] stop installing data-prep-connector Signed-off-by: Daiki Tsuzuku --- transforms/language/doc_quality/doc_quality.ipynb | 1 - 1 file changed, 1 deletion(-) diff --git a/transforms/language/doc_quality/doc_quality.ipynb b/transforms/language/doc_quality/doc_quality.ipynb index 91aafd74d..5b87c91b8 100644 --- a/transforms/language/doc_quality/doc_quality.ipynb +++ b/transforms/language/doc_quality/doc_quality.ipynb @@ -25,7 +25,6 @@ "# Users and application developers must use the right tag for the latest from pypi\n", "%pip install data-prep-toolkit\n", "%pip install data-prep-toolkit-transforms\n", - "%pip install data-prep-connector\n", "%pip install dpk-doc-quality-transform-python" ] }, From edb605bb681c57db1f9eb5d3fe9f425681f57c2b Mon Sep 17 00:00:00 2001 From: Daiki Tsuzuku Date: Mon, 25 Nov 2024 12:39:31 +0900 Subject: [PATCH 9/9] use data-prep-toolkit-transforms==0.2.2.dev3 Signed-off-by: Daiki Tsuzuku --- .../language/doc_quality/doc_quality.ipynb | 41 +++++++++---------- 1 file changed, 20 insertions(+), 21 deletions(-) diff --git a/transforms/language/doc_quality/doc_quality.ipynb b/transforms/language/doc_quality/doc_quality.ipynb index 5b87c91b8..bf91047b6 100644 --- a/transforms/language/doc_quality/doc_quality.ipynb +++ b/transforms/language/doc_quality/doc_quality.ipynb @@ -15,7 +15,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 1, "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695", "metadata": {}, "outputs": [], @@ -24,8 +24,7 @@ "## This is here as a reference only\n", "# Users and application developers must use the right tag for the latest from pypi\n", "%pip install data-prep-toolkit\n", - "%pip install data-prep-toolkit-transforms\n", - "%pip install dpk-doc-quality-transform-python" + "%pip install data-prep-toolkit-transforms==0.2.2.dev3" ] }, { @@ -52,7 +51,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 2, "id": "c2a12abc-9460-4e45-8961-873b48a9ab19", "metadata": {}, "outputs": [], @@ -76,7 +75,7 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": 3, "id": "e90a853e-412f-45d7-af3d-959e755aeebb", "metadata": {}, "outputs": [], @@ -114,7 +113,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 4, "id": "0775e400-7469-49a6-8998-bd4772931459", "metadata": {}, "outputs": [ @@ -122,19 +121,19 @@ "name": "stderr", "output_type": "stream", "text": [ - "10:38:40 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': 'python/ldnoobw/en', 's3_cred': None, 'docq_data_factory': }\n", - "10:38:40 INFO - pipeline id pipeline_id\n", - "10:38:40 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}\n", - "10:38:40 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n", - "10:38:40 INFO - data factory data_ max_files -1, n_sample -1\n", - "10:38:40 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", - "10:38:40 INFO - orchestrator docq started at 2024-11-22 10:38:40\n", - "10:38:40 INFO - Number of files is 1, source profile {'max_file_size': 0.0009870529174804688, 'min_file_size': 0.0009870529174804688, 'total_file_size': 0.0009870529174804688}\n", - "10:38:40 INFO - Load badwords found locally from python/ldnoobw/en\n", - "10:38:49 INFO - Completed 1 files (100.0%) in 0.146 min\n", - "10:38:49 INFO - Done processing 1 files, waiting for flush() completion.\n", - "10:38:49 INFO - done flushing in 0.0 sec\n", - "10:38:49 INFO - Completed execution in 0.146 min, execution result 0\n" + "12:39:07 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': 'python/ldnoobw/en', 's3_cred': None, 'docq_data_factory': }\n", + "12:39:07 INFO - pipeline id pipeline_id\n", + "12:39:07 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}\n", + "12:39:07 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n", + "12:39:07 INFO - data factory data_ max_files -1, n_sample -1\n", + "12:39:07 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n", + "12:39:07 INFO - orchestrator docq started at 2024-11-25 12:39:07\n", + "12:39:07 INFO - Number of files is 1, source profile {'max_file_size': 0.0009870529174804688, 'min_file_size': 0.0009870529174804688, 'total_file_size': 0.0009870529174804688}\n", + "12:39:07 INFO - Load badwords found locally from python/ldnoobw/en\n", + "12:39:09 INFO - Completed 1 files (100.0%) in 0.033 min\n", + "12:39:09 INFO - Done processing 1 files, waiting for flush() completion.\n", + "12:39:09 INFO - done flushing in 0.0 sec\n", + "12:39:09 INFO - Completed execution in 0.033 min, execution result 0\n" ] } ], @@ -155,7 +154,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 5, "id": "7276fe84-6512-4605-ab65-747351e13a7c", "metadata": {}, "outputs": [ @@ -165,7 +164,7 @@ "['python/output/metadata.json', 'python/output/test1.parquet']" ] }, - "execution_count": 11, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" }