Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash verification for datasets used in vision transformer and neuralchat finetuning workflow interface examples #936

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"id": "bd059520",
"metadata": {},
"source": [
"In this tutorial, we build on the ideas from the [first](https://github.com/intel/openfl/blob/develop/openfl-tutorials/experimental/Workflow_Interface_101_MNIST.ipynb) quick start notebook, and demonstrate how to fine-tune an LLM in a federated learning workflow. \n",
"In this tutorial, we build on the ideas from the [first](https://github.com/intel/openfl/blob/develop/openfl-tutorials/experimental/Workflow_Interface_101_MNIST.ipynb) quick start notebook, and demonstrate how to fine-tune a Large Language Model (LLM) in a federated learning workflow. \n",
"\n",
"We will fine-tune **Intel's [neural-chat-7b](https://huggingface.co/Intel/neural-chat-7b-v1)** model on the [MedQuAD](https://github.com/abachaa/MedQuAD) dataset, an open-source medical question-answer pair dataset collated from 12 NIH websites. To do this, we will leverage the **[Intel(R) Extension for Transformers](https://github.com/intel/intel-extension-for-transformers)**, which extends th [Hugging Face Transformers](https://github.com/huggingface/transformers) library with added features for optimal performance on Intel hardware.."
]
Expand All @@ -42,44 +42,30 @@
"metadata": {},
"source": [
"## Initial Setup\n",
"### Installing dependencies\n",
"Start by installing Intel(R) Extension for Transformers (for stability, we will use v1.2.2) and OpenFL"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56f4628e-7a1b-4576-bf6e-637757b2726d",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!pip install intel-extension-for-transformers==1.2.2\n",
"### Installing Intel(R) Extension for Transformers*\n",
"- Start by installing Intel(R) Extension for Transformers* and the required dependencies for the Neural Chat framework. \n",
"For successful installation, please follow the steps outlined in the [Installation Guide](https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/neural_chat#installation).\n",
"- For additional information, please refer to the [Official Documentation](https://intel.github.io/intel-extension-for-transformers/latest/docs/Welcome.html)\n",
"\n",
"# Requirements to run neuralchat on CPU\n",
"!wget https://raw.githubusercontent.com/intel/intel-extension-for-transformers/v1.2.2/intel_extension_for_transformers/neural_chat/requirements_cpu.txt\n",
"!pip install -r requirements_cpu.txt"
"*Note: This Jupyter Notebook has been tested and confirmed to work with `intel-extension-for-transformers==1.2.2`*"
]
},
{
"cell_type": "markdown",
"id": "124ae236-2e33-4349-9979-f506d796276d",
"metadata": {},
"source": [
"From here, we can install requirements needed to run OpenFL's workflow interface and Intel(R) Extension for Transformer's Neural Chat framework"
"### Installing OpenFL\n",
"- Lets now install OpenFL and the necessary dependencies for the workflow interface by running the cell below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "63207a15-e1e3-4b7a-8a85-53618f8ec8ef",
"metadata": {
"scrolled": true
},
"id": "c808dd12-6795-4203-9221-0f6b43fc785f",
"metadata": {},
"outputs": [],
"source": [
"# Requirements to run workflow interface\n",
"!pip install git+https://github.com/intel/openfl.git\n",
"!pip install -r ../../requirements_workflow_interface.txt\n",
"!pip install numpy --upgrade"
Expand All @@ -101,9 +87,18 @@
"metadata": {},
"outputs": [],
"source": [
"!rm -rf MedQuAD\n",
"!git clone https://github.com/abachaa/MedQuAD.git"
]
},
{
"cell_type": "markdown",
"id": "98014201-01b6-4726-b483-6d7101a3aa51",
"metadata": {},
"source": [
"From here, we provide a preprocessing code to verify the dataset and prepare it to be readily ingestible by the fine-tuning pipeline"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -115,17 +110,11 @@
"\n",
"# User input for folder paths\n",
"input_base_folder = \"./MedQuAD/\"\n",
"subfolders = [\"1_CancerGov_QA\", \"2_GARD_QA\", \"3_GHR_QA\", \"4_MPlus_Health_Topics_QA\",\n",
" \"5_NIDDK_QA\", \"6_NINDS_QA\", \"7_SeniorHealth_QA\", \"8_NHLBI_QA_XML\", \"9_CDC_QA\"]\n",
"output_folder = \"./\"\n",
"\n",
"xml_to_json(input_base_folder, output_folder)"
]
},
{
"cell_type": "markdown",
"id": "98014201-01b6-4726-b483-6d7101a3aa51",
"metadata": {},
"source": [
"From here, we provide a preprocessing code to prepare the dataset to be readily ingestible by the fine-tuning pipeline"
"xml_to_json(input_base_folder, subfolders, output_folder, verify_hash=1)"
]
},
{
Expand Down Expand Up @@ -170,7 +159,7 @@
{
"cell_type": "code",
"execution_count": null,
"id": "1d5b078f-599a-4264-b575-9d15e14afb7e",
"id": "c9aa89c7-76f7-49a1-a50b-b4a8cabe22d3",
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -214,7 +203,7 @@
"\n",
"data_args = DataArguments(\n",
" train_file=\"medquad_alpaca_train.json\",\n",
" validation_split_percentage=20\n",
" validation_split_percentage=20,\n",
")\n",
"\n",
"training_args = TrainingArguments(\n",
Expand Down Expand Up @@ -660,9 +649,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "NeuralChat Finetune",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "neuralchat_finetune"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand Down
49 changes: 46 additions & 3 deletions openfl-tutorials/experimental/LLM/neuralchat/preprocess_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,10 @@
import json
import os
import math
import hashlib


def xml_to_json(input_base_folder, output_folder):
def xml_to_json(input_base_folder, subfolders, output_folder, verify_hash=1):

if not os.path.exists(input_base_folder):
raise SystemExit(f"The folder '{input_base_folder}' does not exist.")
Expand All @@ -16,8 +17,11 @@ def xml_to_json(input_base_folder, output_folder):
test_data = []
train_count, test_count = 0, 0

subfolders = ["1_CancerGov_QA", "2_GARD_QA", "3_GHR_QA", "4_MPlus_Health_Topics_QA",
"5_NIDDK_QA", "6_NINDS_QA", "7_SeniorHealth_QA", "8_NHLBI_QA_XML", "9_CDC_QA"]
if verify_hash == 1:
expected_hash = ('9d645c469ba37eb9ec2e121ae6ac90fbebccfb91f2aff7f'
'faabc0531f2ede54ab4c91bea775922e5910b276340c040e8')
verify_aggregated_hashes(input_base_folder, subfolders,
expected_hash=expected_hash)

for subfolder in subfolders:
folder_path = os.path.join(input_base_folder, subfolder)
Expand All @@ -36,6 +40,8 @@ def xml_to_json(input_base_folder, output_folder):
new_data, count = process_xml_file(folder_path, xml_file)
test_data.extend(new_data)
test_count += count
else:
raise SystemError(f"{folder_path} does not exist")

# Save the data to JSON files
save_json(train_data, os.path.join(output_folder, 'medquad_alpaca_train.json'))
Expand All @@ -46,6 +52,8 @@ def xml_to_json(input_base_folder, output_folder):
f.write(f"Training data pairs: {train_count}\n")
f.write(f"Test data pairs: {test_count}\n")

print("Preprocessing complete")


def process_xml_file(folder, xml_file):
xml_path = os.path.join(folder, xml_file)
Expand Down Expand Up @@ -78,3 +86,38 @@ def process_xml_file(folder, xml_file):
def save_json(data, filename):
with open(filename, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=4)


def compute_hash(file_path, hash_name='sha384'):
"""Compute the hash of a single file using SHA-384."""
hash_func = getattr(hashlib, hash_name)()
with open(file_path, 'rb') as f:
for chunk in iter(lambda: f.read(8192), b''):
hash_func.update(chunk)
return hash_func.hexdigest()


def verify_aggregated_hashes(input_base_folder, dir_list, expected_hash):
"""Verify the aggregated hash of all files against a single, hardcoded hash."""
aggregated_hash_func = hashlib.sha384()

for sub_directory in dir_list:
directory = os.path.join(input_base_folder, sub_directory)
if os.path.isdir(directory):
for root, _, files in os.walk(directory):
for file in files:
file_path = os.path.join(root, file)
file_hash = compute_hash(file_path)
aggregated_hash_func.update(file_hash.encode('utf-8'))
else:
raise SystemError(f"{directory} does not exist")

# Compute the aggregated hash
aggregated_hash = aggregated_hash_func.hexdigest()

# Compare the aggregated hash with the expected, hardcoded hash
if aggregated_hash != expected_hash:
raise SystemError(
"Verification failed. Downloaded hash doesn\'t match expected hash.")
else:
print("Verification passed")
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,9 @@
"metadata": {},
"outputs": [],
"source": [
"# !pip install git+https://github.com/intel/openfl.git\n",
"# !pip install -r ../requirements_workflow_interface.txt\n",
"# !pip install -r requirements_vision_transformer.txt\n",
"!pip install git+https://github.com/intel/openfl.git\n",
"!pip install -r ../requirements_workflow_interface.txt\n",
"!pip install -r requirements_vision_transformer.txt\n",
"\n",
"# Uncomment this if running in Google Colab\n",
"#!pip install -r https://raw.githubusercontent.com/intel/openfl/develop/openfl-tutorials/experimental/requirements_workflow_interface.txt\n",
Expand Down Expand Up @@ -122,6 +122,38 @@
"DataClass = getattr(medmnist, info['python_class'])"
]
},
{
"cell_type": "markdown",
"id": "4b039af1-8806-4c90-839a-6919171ff181",
"metadata": {},
"source": [
"The cell below is download the PathMNIST dataset and perform hash verification. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6e642d59-ce00-4490-a4a5-e8f4cc4118fd",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from urllib.request import urlretrieve\n",
"from openfl.utilities import validate_file_hash\n",
"\n",
"def download_and_verify_data():\n",
" datapath = os.path.join(os.path.expanduser('~'), '.medmnist')\n",
" os.makedirs(datapath, exist_ok=True)\n",
" \n",
" _ = urlretrieve('https://zenodo.org/records/6496656/files/pathmnist.npz', os.path.join(datapath, 'pathmnist.npz'))\n",
" \n",
" validate_file_hash(os.path.join(datapath, 'pathmnist.npz'), \n",
" '3f281f2cb6673bb06799d5d31ddbf6d87203e418970f92366d4fce3310749595c7e3b09798b98e0c3c50cc9a63012333')\n",
" print('Verification passed')\n",
"\n",
"download_and_verify_data()"
]
},
{
"cell_type": "markdown",
"id": "2ed5bba7",
Expand Down Expand Up @@ -166,8 +198,8 @@
"\n",
"\n",
"# load the data\n",
"medmnist_train = DataClass(split='train', transform=train_transforms, download=True)\n",
"medmnist_test = DataClass(split='test', transform=test_transforms, download=True)\n",
"medmnist_train = DataClass(split='train', transform=train_transforms)\n",
"medmnist_test = DataClass(split='test', transform=test_transforms)\n",
"\n",
"# For demonstration purposes, we take a subset to reduce overall size and training time\n",
"##################\n",
Expand Down Expand Up @@ -668,9 +700,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "openfl_ViT",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "openfl_vit"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand Down
Loading