hash verification for datasets used in vision transformer and neuralc…

…hat finetuning workflow interface examples (#936) * hash verification Signed-off-by: kta-intel <[email protected]> * seeding cell got removed, re-adding it Signed-off-by: kta-intel <[email protected]> * lint fix Signed-off-by: kta-intel <[email protected]> * sha384 and simplying hash computing Signed-off-by: kta-intel <[email protected]> * pep8 guideline fix Signed-off-by: kta-intel <[email protected]> --------- Signed-off-by: kta-intel <[email protected]> Signed-off-by: manuelhsantana <[email protected]>
securefederatedai · Jul 9, 2024 · b2577f0 · b2577f0
1 parent 809e33f
commit b2577f0
Show file tree

Hide file tree

Showing 3 changed files with 111 additions and 47 deletions.
diff --git a/openfl-tutorials/experimental/LLM/neuralchat/Workflow_Interface_NeuralChat.ipynb b/openfl-tutorials/experimental/LLM/neuralchat/Workflow_Interface_NeuralChat.ipynb
@@ -15,7 +15,7 @@
    "id": "bd059520",
    "metadata": {},
    "source": [
-    "In this tutorial, we build on the ideas from the [first](https://github.com/intel/openfl/blob/develop/openfl-tutorials/experimental/Workflow_Interface_101_MNIST.ipynb) quick start notebook, and demonstrate how to fine-tune an LLM in a federated learning workflow. \n",
+    "In this tutorial, we build on the ideas from the [first](https://github.com/intel/openfl/blob/develop/openfl-tutorials/experimental/Workflow_Interface_101_MNIST.ipynb) quick start notebook, and demonstrate how to fine-tune a Large Language Model (LLM) in a federated learning workflow. \n",
     "\n",
     "We will fine-tune **Intel's [neural-chat-7b](https://huggingface.co/Intel/neural-chat-7b-v1)** model on the [MedQuAD](https://github.com/abachaa/MedQuAD) dataset, an open-source medical question-answer pair dataset collated from 12 NIH websites. To do this, we will leverage the **[Intel(R) Extension for Transformers](https://github.com/intel/intel-extension-for-transformers)**, which extends th [Hugging Face Transformers](https://github.com/huggingface/transformers)  library with added features for optimal performance on Intel hardware.."
    ]
@@ -42,44 +42,30 @@
    "metadata": {},
    "source": [
     "## Initial Setup\n",
-    "### Installing dependencies\n",
-    "Start by installing Intel(R) Extension for Transformers (for stability, we will use v1.2.2) and OpenFL"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "56f4628e-7a1b-4576-bf6e-637757b2726d",
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "!pip install intel-extension-for-transformers==1.2.2\n",
+    "### Installing Intel(R) Extension for Transformers*\n",
+    "- Start by installing Intel(R) Extension for Transformers* and the required dependencies for the Neural Chat framework. \n",
+    "For successful installation, please follow the steps outlined in the [Installation Guide](https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/neural_chat#installation).\n",
+    "- For additional information, please refer to the [Official Documentation](https://intel.github.io/intel-extension-for-transformers/latest/docs/Welcome.html)\n",
     "\n",
-    "# Requirements to run neuralchat on CPU\n",
-    "!wget https://raw.githubusercontent.com/intel/intel-extension-for-transformers/v1.2.2/intel_extension_for_transformers/neural_chat/requirements_cpu.txt\n",
-    "!pip install -r requirements_cpu.txt"
+    "*Note: This Jupyter Notebook has been tested and confirmed to work with `intel-extension-for-transformers==1.2.2`*"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "124ae236-2e33-4349-9979-f506d796276d",
    "metadata": {},
    "source": [
-    "From here, we can install requirements needed to run OpenFL's workflow interface and Intel(R) Extension for Transformer's Neural Chat framework"
+    "### Installing OpenFL\n",
+    "- Lets now install OpenFL and the necessary dependencies for the workflow interface by running the cell below:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "63207a15-e1e3-4b7a-8a85-53618f8ec8ef",
-   "metadata": {
-    "scrolled": true
-   },
+   "id": "c808dd12-6795-4203-9221-0f6b43fc785f",
+   "metadata": {},
    "outputs": [],
    "source": [
-    "# Requirements to run workflow interface\n",
     "!pip install git+https://github.com/intel/openfl.git\n",
     "!pip install -r ../../requirements_workflow_interface.txt\n",
     "!pip install numpy --upgrade"
@@ -101,9 +87,18 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "!rm -rf MedQuAD\n",
     "!git clone https://github.com/abachaa/MedQuAD.git"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "98014201-01b6-4726-b483-6d7101a3aa51",
+   "metadata": {},
+   "source": [
+    "From here, we provide a preprocessing code to verify the dataset and prepare it to be readily ingestible by the fine-tuning pipeline"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -115,17 +110,11 @@
     "\n",
     "# User input for folder paths\n",
     "input_base_folder = \"./MedQuAD/\"\n",
+    "subfolders = [\"1_CancerGov_QA\", \"2_GARD_QA\", \"3_GHR_QA\", \"4_MPlus_Health_Topics_QA\",\n",
+    "                  \"5_NIDDK_QA\", \"6_NINDS_QA\", \"7_SeniorHealth_QA\", \"8_NHLBI_QA_XML\", \"9_CDC_QA\"]\n",
     "output_folder = \"./\"\n",
     "\n",
-    "xml_to_json(input_base_folder, output_folder)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "98014201-01b6-4726-b483-6d7101a3aa51",
-   "metadata": {},
-   "source": [
-    "From here, we provide a preprocessing code to prepare the dataset to be readily ingestible by the fine-tuning pipeline"
+    "xml_to_json(input_base_folder, subfolders, output_folder, verify_hash=1)"
    ]
   },
   {
@@ -170,7 +159,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "1d5b078f-599a-4264-b575-9d15e14afb7e",
+   "id": "c9aa89c7-76f7-49a1-a50b-b4a8cabe22d3",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -214,7 +203,7 @@
     "\n",
     "data_args = DataArguments(\n",
     "    train_file=\"medquad_alpaca_train.json\",\n",
-    "    validation_split_percentage=20\n",
+    "    validation_split_percentage=20,\n",
     ")\n",
     "\n",
     "training_args = TrainingArguments(\n",
@@ -660,9 +649,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "NeuralChat Finetune",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
-   "name": "neuralchat_finetune"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {

diff --git a/openfl-tutorials/experimental/LLM/neuralchat/preprocess_dataset.py b/openfl-tutorials/experimental/LLM/neuralchat/preprocess_dataset.py
@@ -5,9 +5,10 @@
 import json
 import os
 import math
+import hashlib
 
 
-def xml_to_json(input_base_folder, output_folder):
+def xml_to_json(input_base_folder, subfolders, output_folder, verify_hash=1):
 
     if not os.path.exists(input_base_folder):
         raise SystemExit(f"The folder '{input_base_folder}' does not exist.")
@@ -16,8 +17,11 @@ def xml_to_json(input_base_folder, output_folder):
     test_data = []
     train_count, test_count = 0, 0
 
-    subfolders = ["1_CancerGov_QA", "2_GARD_QA", "3_GHR_QA", "4_MPlus_Health_Topics_QA",
-                  "5_NIDDK_QA", "6_NINDS_QA", "7_SeniorHealth_QA", "8_NHLBI_QA_XML", "9_CDC_QA"]
+    if verify_hash == 1:
+        expected_hash = ('9d645c469ba37eb9ec2e121ae6ac90fbebccfb91f2aff7f'
+                         'faabc0531f2ede54ab4c91bea775922e5910b276340c040e8')
+        verify_aggregated_hashes(input_base_folder, subfolders,
+                                 expected_hash=expected_hash)
 
     for subfolder in subfolders:
         folder_path = os.path.join(input_base_folder, subfolder)
@@ -36,6 +40,8 @@ def xml_to_json(input_base_folder, output_folder):
                 new_data, count = process_xml_file(folder_path, xml_file)
                 test_data.extend(new_data)
                 test_count += count
+        else:
+            raise SystemError(f"{folder_path} does not exist")
 
     # Save the data to JSON files
     save_json(train_data, os.path.join(output_folder, 'medquad_alpaca_train.json'))
@@ -46,6 +52,8 @@ def xml_to_json(input_base_folder, output_folder):
         f.write(f"Training data pairs: {train_count}\n")
         f.write(f"Test data pairs: {test_count}\n")
 
+    print("Preprocessing complete")
+
 
 def process_xml_file(folder, xml_file):
     xml_path = os.path.join(folder, xml_file)
@@ -78,3 +86,38 @@ def process_xml_file(folder, xml_file):
 def save_json(data, filename):
     with open(filename, 'w', encoding='utf-8') as f:
         json.dump(data, f, ensure_ascii=False, indent=4)
+
+
+def compute_hash(file_path, hash_name='sha384'):
+    """Compute the hash of a single file using SHA-384."""
+    hash_func = getattr(hashlib, hash_name)()
+    with open(file_path, 'rb') as f:
+        for chunk in iter(lambda: f.read(8192), b''):
+            hash_func.update(chunk)
+    return hash_func.hexdigest()
+
+
+def verify_aggregated_hashes(input_base_folder, dir_list, expected_hash):
+    """Verify the aggregated hash of all files against a single, hardcoded hash."""
+    aggregated_hash_func = hashlib.sha384()
+
+    for sub_directory in dir_list:
+        directory = os.path.join(input_base_folder, sub_directory)
+        if os.path.isdir(directory):
+            for root, _, files in os.walk(directory):
+                for file in files:
+                    file_path = os.path.join(root, file)
+                    file_hash = compute_hash(file_path)
+                    aggregated_hash_func.update(file_hash.encode('utf-8'))
+        else:
+            raise SystemError(f"{directory} does not exist")
+
+    # Compute the aggregated hash
+    aggregated_hash = aggregated_hash_func.hexdigest()
+
+    # Compare the aggregated hash with the expected, hardcoded hash
+    if aggregated_hash != expected_hash:
+        raise SystemError(
+            "Verification failed. Downloaded hash doesn\'t match expected hash.")
+    else:
+        print("Verification passed")
diff --git a/...tutorials/experimental/Vision_Transformer/Workflow_Interface_102_Vision_Transformer.ipynb b/...tutorials/experimental/Vision_Transformer/Workflow_Interface_102_Vision_Transformer.ipynb
@@ -50,9 +50,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# !pip install git+https://github.com/intel/openfl.git\n",
-    "# !pip install -r ../requirements_workflow_interface.txt\n",
-    "# !pip install -r requirements_vision_transformer.txt\n",
+    "!pip install git+https://github.com/intel/openfl.git\n",
+    "!pip install -r ../requirements_workflow_interface.txt\n",
+    "!pip install -r requirements_vision_transformer.txt\n",
     "\n",
     "# Uncomment this if running in Google Colab\n",
     "#!pip install -r https://raw.githubusercontent.com/intel/openfl/develop/openfl-tutorials/experimental/requirements_workflow_interface.txt\n",
@@ -122,6 +122,38 @@
     "DataClass = getattr(medmnist, info['python_class'])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "4b039af1-8806-4c90-839a-6919171ff181",
+   "metadata": {},
+   "source": [
+    "The cell below is download the PathMNIST dataset and perform hash verification. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6e642d59-ce00-4490-a4a5-e8f4cc4118fd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from urllib.request import urlretrieve\n",
+    "from openfl.utilities import validate_file_hash\n",
+    "\n",
+    "def download_and_verify_data():\n",
+    "    datapath = os.path.join(os.path.expanduser('~'), '.medmnist')\n",
+    "    os.makedirs(datapath, exist_ok=True)\n",
+    "    \n",
+    "    _ = urlretrieve('https://zenodo.org/records/6496656/files/pathmnist.npz', os.path.join(datapath, 'pathmnist.npz'))\n",
+    "    \n",
+    "    validate_file_hash(os.path.join(datapath, 'pathmnist.npz'), \n",
+    "                                    '3f281f2cb6673bb06799d5d31ddbf6d87203e418970f92366d4fce3310749595c7e3b09798b98e0c3c50cc9a63012333')\n",
+    "    print('Verification passed')\n",
+    "\n",
+    "download_and_verify_data()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "2ed5bba7",
@@ -166,8 +198,8 @@
     "\n",
     "\n",
     "# load the data\n",
-    "medmnist_train = DataClass(split='train', transform=train_transforms, download=True)\n",
-    "medmnist_test = DataClass(split='test', transform=test_transforms, download=True)\n",
+    "medmnist_train = DataClass(split='train', transform=train_transforms)\n",
+    "medmnist_test = DataClass(split='test', transform=test_transforms)\n",
     "\n",
     "# For demonstration purposes, we take a subset to reduce overall size and training time\n",
     "##################\n",
@@ -668,9 +700,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "openfl_ViT",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
-   "name": "openfl_vit"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {