Skip to content

Commit

Permalink
hash verification for datasets used in vision transformer and neuralc…
Browse files Browse the repository at this point in the history
…hat finetuning workflow interface examples (#936)

* hash verification

Signed-off-by: kta-intel <[email protected]>

* seeding cell got removed, re-adding it

Signed-off-by: kta-intel <[email protected]>

* lint fix

Signed-off-by: kta-intel <[email protected]>

* sha384 and simplying hash computing

Signed-off-by: kta-intel <[email protected]>

* pep8 guideline fix

Signed-off-by: kta-intel <[email protected]>

---------

Signed-off-by: kta-intel <[email protected]>
  • Loading branch information
kta-intel authored Mar 4, 2024
1 parent 4b3aa6f commit 3d88229
Show file tree
Hide file tree
Showing 3 changed files with 111 additions and 47 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"id": "bd059520",
"metadata": {},
"source": [
"In this tutorial, we build on the ideas from the [first](https://github.com/intel/openfl/blob/develop/openfl-tutorials/experimental/Workflow_Interface_101_MNIST.ipynb) quick start notebook, and demonstrate how to fine-tune an LLM in a federated learning workflow. \n",
"In this tutorial, we build on the ideas from the [first](https://github.com/intel/openfl/blob/develop/openfl-tutorials/experimental/Workflow_Interface_101_MNIST.ipynb) quick start notebook, and demonstrate how to fine-tune a Large Language Model (LLM) in a federated learning workflow. \n",
"\n",
"We will fine-tune **Intel's [neural-chat-7b](https://huggingface.co/Intel/neural-chat-7b-v1)** model on the [MedQuAD](https://github.com/abachaa/MedQuAD) dataset, an open-source medical question-answer pair dataset collated from 12 NIH websites. To do this, we will leverage the **[Intel(R) Extension for Transformers](https://github.com/intel/intel-extension-for-transformers)**, which extends th [Hugging Face Transformers](https://github.com/huggingface/transformers) library with added features for optimal performance on Intel hardware.."
]
Expand All @@ -42,44 +42,30 @@
"metadata": {},
"source": [
"## Initial Setup\n",
"### Installing dependencies\n",
"Start by installing Intel(R) Extension for Transformers (for stability, we will use v1.2.2) and OpenFL"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56f4628e-7a1b-4576-bf6e-637757b2726d",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!pip install intel-extension-for-transformers==1.2.2\n",
"### Installing Intel(R) Extension for Transformers*\n",
"- Start by installing Intel(R) Extension for Transformers* and the required dependencies for the Neural Chat framework. \n",
"For successful installation, please follow the steps outlined in the [Installation Guide](https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/neural_chat#installation).\n",
"- For additional information, please refer to the [Official Documentation](https://intel.github.io/intel-extension-for-transformers/latest/docs/Welcome.html)\n",
"\n",
"# Requirements to run neuralchat on CPU\n",
"!wget https://raw.githubusercontent.com/intel/intel-extension-for-transformers/v1.2.2/intel_extension_for_transformers/neural_chat/requirements_cpu.txt\n",
"!pip install -r requirements_cpu.txt"
"*Note: This Jupyter Notebook has been tested and confirmed to work with `intel-extension-for-transformers==1.2.2`*"
]
},
{
"cell_type": "markdown",
"id": "124ae236-2e33-4349-9979-f506d796276d",
"metadata": {},
"source": [
"From here, we can install requirements needed to run OpenFL's workflow interface and Intel(R) Extension for Transformer's Neural Chat framework"
"### Installing OpenFL\n",
"- Lets now install OpenFL and the necessary dependencies for the workflow interface by running the cell below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63207a15-e1e3-4b7a-8a85-53618f8ec8ef",
"metadata": {
"scrolled": true
},
"id": "c808dd12-6795-4203-9221-0f6b43fc785f",
"metadata": {},
"outputs": [],
"source": [
"# Requirements to run workflow interface\n",
"!pip install git+https://github.com/intel/openfl.git\n",
"!pip install -r ../../requirements_workflow_interface.txt\n",
"!pip install numpy --upgrade"
Expand All @@ -101,9 +87,18 @@
"metadata": {},
"outputs": [],
"source": [
"!rm -rf MedQuAD\n",
"!git clone https://github.com/abachaa/MedQuAD.git"
]
},
{
"cell_type": "markdown",
"id": "98014201-01b6-4726-b483-6d7101a3aa51",
"metadata": {},
"source": [
"From here, we provide a preprocessing code to verify the dataset and prepare it to be readily ingestible by the fine-tuning pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -115,17 +110,11 @@
"\n",
"# User input for folder paths\n",
"input_base_folder = \"./MedQuAD/\"\n",
"subfolders = [\"1_CancerGov_QA\", \"2_GARD_QA\", \"3_GHR_QA\", \"4_MPlus_Health_Topics_QA\",\n",
" \"5_NIDDK_QA\", \"6_NINDS_QA\", \"7_SeniorHealth_QA\", \"8_NHLBI_QA_XML\", \"9_CDC_QA\"]\n",
"output_folder = \"./\"\n",
"\n",
"xml_to_json(input_base_folder, output_folder)"
]
},
{
"cell_type": "markdown",
"id": "98014201-01b6-4726-b483-6d7101a3aa51",
"metadata": {},
"source": [
"From here, we provide a preprocessing code to prepare the dataset to be readily ingestible by the fine-tuning pipeline"
"xml_to_json(input_base_folder, subfolders, output_folder, verify_hash=1)"
]
},
{
Expand Down Expand Up @@ -170,7 +159,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "1d5b078f-599a-4264-b575-9d15e14afb7e",
"id": "c9aa89c7-76f7-49a1-a50b-b4a8cabe22d3",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -214,7 +203,7 @@
"\n",
"data_args = DataArguments(\n",
" train_file=\"medquad_alpaca_train.json\",\n",
" validation_split_percentage=20\n",
" validation_split_percentage=20,\n",
")\n",
"\n",
"training_args = TrainingArguments(\n",
Expand Down Expand Up @@ -660,9 +649,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "NeuralChat Finetune",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "neuralchat_finetune"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand Down
49 changes: 46 additions & 3 deletions openfl-tutorials/experimental/LLM/neuralchat/preprocess_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,10 @@
import json
import os
import math
import hashlib


def xml_to_json(input_base_folder, output_folder):
def xml_to_json(input_base_folder, subfolders, output_folder, verify_hash=1):

if not os.path.exists(input_base_folder):
raise SystemExit(f"The folder '{input_base_folder}' does not exist.")
Expand All @@ -16,8 +17,11 @@ def xml_to_json(input_base_folder, output_folder):
test_data = []
train_count, test_count = 0, 0

subfolders = ["1_CancerGov_QA", "2_GARD_QA", "3_GHR_QA", "4_MPlus_Health_Topics_QA",
"5_NIDDK_QA", "6_NINDS_QA", "7_SeniorHealth_QA", "8_NHLBI_QA_XML", "9_CDC_QA"]
if verify_hash == 1:
expected_hash = ('9d645c469ba37eb9ec2e121ae6ac90fbebccfb91f2aff7f'
'faabc0531f2ede54ab4c91bea775922e5910b276340c040e8')
verify_aggregated_hashes(input_base_folder, subfolders,
expected_hash=expected_hash)

for subfolder in subfolders:
folder_path = os.path.join(input_base_folder, subfolder)
Expand All @@ -36,6 +40,8 @@ def xml_to_json(input_base_folder, output_folder):
new_data, count = process_xml_file(folder_path, xml_file)
test_data.extend(new_data)
test_count += count
else:
raise SystemError(f"{folder_path} does not exist")

# Save the data to JSON files
save_json(train_data, os.path.join(output_folder, 'medquad_alpaca_train.json'))
Expand All @@ -46,6 +52,8 @@ def xml_to_json(input_base_folder, output_folder):
f.write(f"Training data pairs: {train_count}\n")
f.write(f"Test data pairs: {test_count}\n")

print("Preprocessing complete")


def process_xml_file(folder, xml_file):
xml_path = os.path.join(folder, xml_file)
Expand Down Expand Up @@ -78,3 +86,38 @@ def process_xml_file(folder, xml_file):
def save_json(data, filename):
with open(filename, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)


def compute_hash(file_path, hash_name='sha384'):
"""Compute the hash of a single file using SHA-384."""
hash_func = getattr(hashlib, hash_name)()
with open(file_path, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
hash_func.update(chunk)
return hash_func.hexdigest()


def verify_aggregated_hashes(input_base_folder, dir_list, expected_hash):
"""Verify the aggregated hash of all files against a single, hardcoded hash."""
aggregated_hash_func = hashlib.sha384()

for sub_directory in dir_list:
directory = os.path.join(input_base_folder, sub_directory)
if os.path.isdir(directory):
for root, _, files in os.walk(directory):
for file in files:
file_path = os.path.join(root, file)
file_hash = compute_hash(file_path)
aggregated_hash_func.update(file_hash.encode('utf-8'))
else:
raise SystemError(f"{directory} does not exist")

# Compute the aggregated hash
aggregated_hash = aggregated_hash_func.hexdigest()

# Compare the aggregated hash with the expected, hardcoded hash
if aggregated_hash != expected_hash:
raise SystemError(
"Verification failed. Downloaded hash doesn\'t match expected hash.")
else:
print("Verification passed")
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,9 @@
"metadata": {},
"outputs": [],
"source": [
"# !pip install git+https://github.com/intel/openfl.git\n",
"# !pip install -r ../requirements_workflow_interface.txt\n",
"# !pip install -r requirements_vision_transformer.txt\n",
"!pip install git+https://github.com/intel/openfl.git\n",
"!pip install -r ../requirements_workflow_interface.txt\n",
"!pip install -r requirements_vision_transformer.txt\n",
"\n",
"# Uncomment this if running in Google Colab\n",
"#!pip install -r https://raw.githubusercontent.com/intel/openfl/develop/openfl-tutorials/experimental/requirements_workflow_interface.txt\n",
Expand Down Expand Up @@ -122,6 +122,38 @@
"DataClass = getattr(medmnist, info['python_class'])"
]
},
{
"cell_type": "markdown",
"id": "4b039af1-8806-4c90-839a-6919171ff181",
"metadata": {},
"source": [
"The cell below is download the PathMNIST dataset and perform hash verification. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e642d59-ce00-4490-a4a5-e8f4cc4118fd",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from urllib.request import urlretrieve\n",
"from openfl.utilities import validate_file_hash\n",
"\n",
"def download_and_verify_data():\n",
" datapath = os.path.join(os.path.expanduser('~'), '.medmnist')\n",
" os.makedirs(datapath, exist_ok=True)\n",
" \n",
" _ = urlretrieve('https://zenodo.org/records/6496656/files/pathmnist.npz', os.path.join(datapath, 'pathmnist.npz'))\n",
" \n",
" validate_file_hash(os.path.join(datapath, 'pathmnist.npz'), \n",
" '3f281f2cb6673bb06799d5d31ddbf6d87203e418970f92366d4fce3310749595c7e3b09798b98e0c3c50cc9a63012333')\n",
" print('Verification passed')\n",
"\n",
"download_and_verify_data()"
]
},
{
"cell_type": "markdown",
"id": "2ed5bba7",
Expand Down Expand Up @@ -166,8 +198,8 @@
"\n",
"\n",
"# load the data\n",
"medmnist_train = DataClass(split='train', transform=train_transforms, download=True)\n",
"medmnist_test = DataClass(split='test', transform=test_transforms, download=True)\n",
"medmnist_train = DataClass(split='train', transform=train_transforms)\n",
"medmnist_test = DataClass(split='test', transform=test_transforms)\n",
"\n",
"# For demonstration purposes, we take a subset to reduce overall size and training time\n",
"##################\n",
Expand Down Expand Up @@ -668,9 +700,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "openfl_ViT",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "openfl_vit"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand Down

0 comments on commit 3d88229

Please sign in to comment.