Relevance Detector (#45)

* new changes Signed-off-by: tanishq-ids <[email protected]> * changes Signed-off-by: tanishq-ids <[email protected]> * changes Signed-off-by: tanishq-ids <[email protected]> * changes in dependency Signed-off-by: tanishq-ids <[email protected]> * Chore: pre-commit autoupdate * changes in dependency Signed-off-by: tanishq-ids <[email protected]> --------- Signed-off-by: tanishq-ids <[email protected]> Co-authored-by: tanishq-ids <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
os-climate · Aug 5, 2024 · 4063a26 · 4063a26
1 parent 6ec250a
commit 4063a26
Show file tree

Hide file tree

Showing 12 changed files with 603 additions and 446 deletions.
diff --git a/README.rst b/README.rst
@@ -13,7 +13,7 @@ OS-Climate Data Extraction Tool
 
 This project provides an CLI tool and python scripts to train a HuggingFace Transformer model or a local Transformer model and perform inference with it. The primary goal of the inference is to determine the relevance between a given question and context.
 
-Installation
+Quick Start
 ^^^^^^^^^^^^^
 
 To install the OSC Transformer Based Extractor CLI, use pip:
@@ -22,21 +22,75 @@ To install the OSC Transformer Based Extractor CLI, use pip:
 
     $ pip install osc-transformer-based-extractor
 
-Alternatively, you can clone the repository from GitHub for a quick start:
+Afterwards you can use the tooling as a CLI tool by simply typing:
 
 .. code-block:: shell
 
-    $ git clone https://github.com/os-climate/osc-transformer-based-extractor/
+We are using typer to have a nice CLI tool here. All details and help will be shown in the CLI
+tool itself and are not described here in more detail.
+
+**Example**: Assume the folder structure is like that:
+
+.. code-block:: text
+
+    project/
+    │
+    ├── kpi_mapping.csv
+    ├── training_data.csv
+    ├── data/
+    │   └── (json files for inference command)
+    ├── model/
+    │   └── (model-related files go here)
+    |── saved__model/
+    |   └── (output files trained models)
+    ├── output/
+    │   └── (ouput files from inference command)
+
+
+Then you can now simply run (after installation of osc-transformer-based-extractor)
+the following command to fine-tune the model on the data:
+
+.. code-block:: shell
+
+  $ osc-transformer-based-extractor relevance-detector fine-tune \
+    --data_path "project/training_data.csv" \
+    --model_name "bert-base-uncased" \
+    --num_labels 2 \
+    --max_length 128 \
+    --epochs 3 \
+    --batch_size 16 \
+    --output_dir "project/saved__model/" \
+    --save_steps 500
+
+Also, the following command can be run to perform inference:
+
+.. code-block:: shell
+
+  $ osc-transformer-based-extractor relevance-detector perform-inference \
+    --folder_path "project/data/" \
+    --kpi_mapping_path "project/kpi_mapping.csv" \
+    --output_path "project/output/" \
+    --model_path "project/model/" \
+    --tokenizer_path "project/model/" \
+    --threshold 0.5
 
 
 ***************
 Training Data
 ***************
-To train the model, you need data from the curator module. The data is in CSV format. You can train the model either using the CLI or by calling the function directly in Python.
-To train the model, you need data from the curator module. The data is in CSV format. You can train the model either using the CLI or by calling the function directly in Python.
+
+Training File
+^^^^^^^^^^^^^^^
+
+To train the model, you need a CSV file with columns:
+     * ``Question``
+     * ``Context``
+     * ``Label``
+
+Also additionally, the output of the https://github.com/os-climate/osc-transformer-presteps module can also be used. the output will look like following
 Sample Data:
 
-.. list-table:: Company Information
+.. list-table:: traning_Data.csv
    :header-rows: 1
 
    * - Question
@@ -65,178 +119,76 @@ Sample Data:
      - 1022
 
 
+KPI Mapping File
+^^^^^^^^^^^^^^^^^^^^^
+The Inference command will need a kpi-mapping.csv file, which looks like:
 
+.. list-table:: kpi_mapping.csv
+   :header-rows: 1
 
-***************
-CLI Usage
-***************
-
-The CLI command `osc-transformer-based-extractor` provides two main functions: training and inference. You can access detailed information about these functions by running the CLI command without any arguments.
-
-**Commands**
-
-
-
-* ``fine-tune``  :  Fine-tune a pre-trained Hugging Face model on a custom dataset.
-* ``perform-inference`` :  Perform inference using a pre-trained sequence classification model.
+   * - kpi_id
+     - question
+     - sectors
+     - add_year
+     - kpi_category
+   * - 1
+     - In which year was the annual report or the sustainability report published?
+     - OG, CM, CU
+     - FALSE
+     - TEXT
 
-* ``fine-tune``  :  Fine-tune a pre-trained Hugging Face model on a custom dataset.
-* ``perform-inference`` :  Perform inference using a pre-trained sequence classification model.
 
 
 
 ************************
-Using Github Repository
+Developer Notes
 ************************
 
-Setting Up the Environment
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-To set up the working environment for this repository, follow these steps:
-
-1. **Clone the repository**:
-
-.. code-block:: shell
-
-	$ git clone https://github.com/os-climate/osc-transformer-based-extractor/
-    $ cd osc-transformer-based-extractor
-
-
-
-2. **Create a new virtual environment and activate it**:
-
-.. code-block:: shell
-
-   		$ python -m venv venv
-   		$ source venv/bin/activate  # On Windows use `venv\Scripts\activate`
-
-
-
-3. **Install PDM**:
-
-.. code-block:: shell
-
-   		$ pip install pdm
-
-
-
-4. **Sync the environment using PDM**:
-
-.. code-block:: shell
-
-   		$ pdm sync
-
-
-
-5. **Add any new library**:
-
-.. code-block:: shell
-
-   		$ pdm add <library-name>
-
-
-Train the model
-^^^^^^^^^^^^^^^^^^^^^^^^^
-
-To train the model, you can use the following code snippet:
-
-.. code-block:: shell
-
-    $ python fine_tune.py \
-      --data_path "data/train_data.csv" \
-      --model_name "sentence-transformers/all-MiniLM-L6-v2" \
-      --num_labels 2 \
-      --max_length 512 \
-      --epochs 2 \
-      --batch_size 4 \
-      --output_dir "./saved_models_during_training" \
-      --save_steps 500
-
-OR use function calling:
+Use code directly without CLI via Github Repository
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-.. code-block:: python
+First clone the repository to your local environment::
 
-    from fine_tune import fine_tune_model
-
-
-    fine_tune_model(
-        data_path="data/train_data.csv",
-        model_name="sentence-transformers/all-MiniLM-L6-v2",
-        num_labels=2,
-        max_length=512,
-        epochs=2,
-        batch_size=4,
-        output_dir="./saved_models_during_training",
-        save_steps=500
-    )
-
-**Parameters**
-
-* ``data_path (str)`` : Path to the training data CSV file.
-* ``model_name (str)`` : Pre-trained model name from HuggingFace.
-* ``num_labels (int)`` : Number of labels for the classification task.
-* ``max_length (int)`` : Maximum sequence length.
-* ``epochs (int)`` : Number of training epochs.
-* ``batch_size (int)`` : Batch size for training.
-* ``output_dir (str)`` : Directory to save the trained models.
-* ``save_steps (int)`` : Number of steps between saving checkpoints.
-
-
-Performing Inference
-^^^^^^^^^^^^^^^^^^^^^^^^^
-
-To perform inference and determine the relevance between a question and context, use the following code snippet:
-
-.. code-block:: python
-
-  $ python inference.py
-      --question "What is the capital of France?"
-      --context "Paris is the capital of France."
-      --model_path /path/to/model
-      --tokenizer_path /path/to/tokenizer
-
-OR use function calling:
-
-.. code-block:: python
-
-  from inference import get_inference
+    $ git clone https://github.com/os-climate/osc-transformer-based-extractor/
 
+We are using pdm to manage the packages and tox for a stable test framework.
+Hence, first install pdm (possibly in a virtual environment) via::
 
-  result = get_inference(
-      question="What is the relevance?",
-      context="This is a sample paragraph.",
-      model_path="path/to/model",
-      tokenizer_path="path/to/tokenizer" )
+    $ pip install pdm
 
+Afterwards sync you system via::
 
-**Parameters**
+    $ pdm sync
 
-* ``question (str)`` : The question for inference.
-* ``context (str)`` : The paragraph to be analyzed.
-* ``model_path (str)`` : Path to the pre-trained model.
-* ``tokenizer_path (str)`` : Path to the tokenizer of the pre-trained model.
+Now you have multiple demos on how to go on. See folder
+[here](demo)
 
+pdm
+---
 
+For adding new dependencies use pdm. You could add new packages via pdm add.
+For example numpy via::
 
-************************
-Developer Notes
-************************
+    $ pdm add numpy
 
-For adding new dependencies use pdm. First install via pip::
+For a very detailed description check the homepage of the pdm project:
 
-    $ pip install pdm
+https://pdm-project.org/en/latest/
 
-And then you could add new packages via pdm add. For example numpy via::
 
-    $ pdm add numpy
+tox
+---
 
-For running linting tools just to the following::
+For running linting tools we use tox which you run outside of your virtual environment::
 
     $ pip install tox
     $ tox -e lint
     $ tox -e test
 
+This will automatically apply some checks on your code and run the provided pytests. See
+more details on tox on the homepage of the tox project:
 
+https://tox.wiki/en/4.16.0/
 
 ************************
 Contributing

diff --git a/demo/training_demo/train_sentence_transformer.ipynb b/demo/training_demo/train_sentence_transformer.ipynb
@@ -19,9 +19,18 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "c:\\Users\\Tanishq\\Desktop\\IDS_WORK\\relevance-detector\\osc-transformer-based-extractor\\env\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n"
+     ]
+    }
+   ],
    "source": [
     "import pandas as pd\n",
     "import torch\n",
@@ -113,9 +122,18 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/all-MiniLM-L6-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
+      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+     ]
+    }
+   ],
    "source": [
     "MODEL_NAME = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
     "NUM_LABELS = 2\n",
@@ -261,9 +279,24 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "('saved_model\\\\tokenizer_config.json',\n",
+       " 'saved_model\\\\special_tokens_map.json',\n",
+       " 'saved_model\\\\vocab.txt',\n",
+       " 'saved_model\\\\added_tokens.json',\n",
+       " 'saved_model\\\\tokenizer.json')"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
     "# from transformers import AutoModelForSequenceClassification, AutoTokenizer\n",
     "# Assuming \"saved_model\" is the directory to save your model & tokenizer\n",

diff --git a/pyproject.toml b/pyproject.toml
@@ -39,6 +39,7 @@ dependencies = [
     "typer[all]>=0.12.3",
     "rich>=13.7.1",
     "numpy<2.0.0",
+    "openpyxl>=3.1.5",
 ]
 
 [project.urls]

diff --git a/src/osc_transformer_based_extractor/__init__.py b/src/osc_transformer_based_extractor/__init__.py
@@ -6,8 +6,7 @@
 for OSC (Open Source Communications) data.
 
 Module contents:
-- fine_tune: Module for fine-tuning transformer models.
-- inference: Module for performing inference with transformer models.
+- relevance_detetctor
 - main: Main module for orchestrating the execution of the extractor.
 
 This module should be imported to use the OSC transformer-based extractor package.