NVIDIA · sarahyurick · Dec 6, 2024 · Dec 6, 2024 · Dec 6, 2024 · Dec 6, 2024
diff --git a/...ion/distributed_data_classification.ipynb → ...lassification/domain-classification.ipynb b/...ion/distributed_data_classification.ipynb → ...lassification/domain-classification.ipynb
@@ -4,11 +4,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Distributed Data Classification with Domain and Quality Classifiers\n",
+    "# Distributed Data Classification with NeMo Curator's `DomainClassifier`\n",
     "\n",
-    "The notebook demonstrates the use of two classifiers for distributed data classification, including domain and quality classifiers. The [domain classifier](https://huggingface.co/nvidia/domain-classifier) is used to classify the domain of the data, while the [quality classifier](https://huggingface.co/nvidia/quality-classifier-deberta) is used to classify the quality of the data. These classifers help with annotation which helps data blending for foundation model training.\n",
+    "This notebook demonstrates the use of NeMo Curator's `DomainClassifier`. The [domain classifier](https://huggingface.co/nvidia/domain-classifier) is used to classify the domain of a text. It helps with data annotation, which is useful in data blending for foundation model training.\n",
     "\n",
-    "The classifiers are accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets."
+    "The domain classifier is accelerated using [CrossFit](https://github.com/rapidsai/crossfit), a library that leverages intellegent batching and RAPIDS to accelerate the offline inference on large datasets."
    ]
   },
   {
@@ -39,7 +39,7 @@
    "outputs": [],
    "source": [
     "from nemo_curator import get_client\n",
-    "from nemo_curator.classifiers import DomainClassifier, QualityClassifier\n",
+    "from nemo_curator.classifiers import DomainClassifier\n",
     "from nemo_curator.datasets import DocumentDataset\n",
     "import cudf\n",
     "import dask_cudf"
@@ -49,7 +49,15 @@
    "cell_type": "code",
    "execution_count": 3,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "cuDF Spilling is enabled\n"
+     ]
+    }
+   ],
    "source": [
     "client = get_client(cluster_type=\"gpu\")"
    ]
@@ -63,7 +71,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -74,23 +82,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Create a Classifier"
+    "# Prepare Text Data and Initialize Classifier"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 5,
    "metadata": {},
    "outputs": [],
-   "source": [
-    "classifier_type = \"DomainClassifier\" # or \"QualityClassifier\""
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 6,
-   "metadata": {},
-   "outputs": [],
    "source": [
     "# Create sample DataFrame\n",
     "text = [\n",
@@ -119,18 +118,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [],
    "source": [
-    "if classifier_type == \"DomainClassifier\":\n",
-    "    classifier = DomainClassifier(batch_size=1024)\n",
-    "\n",
-    "elif classifier_type == \"QualityClassifier\":\n",
-    "    classifier = QualityClassifier(batch_size=1024)\n",
-    "\n",
-    "else:\n",
-    "    raise ValueError(\"Invalid classifier type\")"
+    "classifier = DomainClassifier(batch_size=1024)"
    ]
   },
   {
@@ -139,35 +131,22 @@
    "source": [
     "# Run the  Classifier\n",
     "\n",
-    "Dask operations are lazy, so the the classifier will not run until we call a eager operation like `to_json`, `compute` or `persist`. "
+    "Dask operations are lazy, so the the classifier will not run until we call an eager operation like `to_json`, `compute`, or `persist`. "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Starting domain classifier inference\n"
-     ]
-    },
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "GPU: 0, Part: 0: 100%|██████████| 10/10 [00:04<00:00,  2.12it/s]\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Writing to disk complete for 1 partitions\n",
-      "CPU times: user 393 ms, sys: 244 ms, total: 638 ms\n",
-      "Wall time: 6.04 s\n"
+      "Starting domain classifier inference\n",
+      "Writing to disk complete for 1 partition(s)\n",
+      "CPU times: user 2.56 s, sys: 1.65 s, total: 4.21 s\n",
+      "Wall time: 19.5 s\n"
      ]
     }
    ],
@@ -187,7 +166,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": null,
    "metadata": {},
    "outputs": [
     {
@@ -268,20 +247,20 @@
        "4  Traveling to Europe during the off-season can ...  "
       ]
      },
-     "execution_count": 9,
+     "execution_count": 8,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "output_dataset = DocumentDataset.read_json(output_file_path, backend=\"cudf\", add_filename=write_to_filename)\n",
-    "output_dataset.df.head()"
+    "output_dataset.head()"
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "NeMo-Curator-env-2",
+   "display_name": "nemo_curator",
    "language": "python",
    "name": "python3"
   },
@@ -295,7 +274,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.14"
+   "version": "3.10.15"
   }
  },
  "nbformat": 4,