Skip to content

Commit

Permalink
Relevance Detector (#45)
Browse files Browse the repository at this point in the history
* new changes

Signed-off-by: tanishq-ids <[email protected]>

* changes

Signed-off-by: tanishq-ids <[email protected]>

* changes

Signed-off-by: tanishq-ids <[email protected]>

* changes in dependency

Signed-off-by: tanishq-ids <[email protected]>

* Chore: pre-commit autoupdate

* changes in dependency

Signed-off-by: tanishq-ids <[email protected]>

---------

Signed-off-by: tanishq-ids <[email protected]>
Co-authored-by: tanishq-ids <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
3 people authored Aug 5, 2024
1 parent 6ec250a commit 4063a26
Show file tree
Hide file tree
Showing 12 changed files with 603 additions and 446 deletions.
248 changes: 100 additions & 148 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ OS-Climate Data Extraction Tool

This project provides an CLI tool and python scripts to train a HuggingFace Transformer model or a local Transformer model and perform inference with it. The primary goal of the inference is to determine the relevance between a given question and context.

Installation
Quick Start
^^^^^^^^^^^^^

To install the OSC Transformer Based Extractor CLI, use pip:
Expand All @@ -22,21 +22,75 @@ To install the OSC Transformer Based Extractor CLI, use pip:
$ pip install osc-transformer-based-extractor
Alternatively, you can clone the repository from GitHub for a quick start:
Afterwards you can use the tooling as a CLI tool by simply typing:

.. code-block:: shell
$ git clone https://github.com/os-climate/osc-transformer-based-extractor/
We are using typer to have a nice CLI tool here. All details and help will be shown in the CLI
tool itself and are not described here in more detail.

**Example**: Assume the folder structure is like that:

.. code-block:: text
project/
├── kpi_mapping.csv
├── training_data.csv
├── data/
│ └── (json files for inference command)
├── model/
│ └── (model-related files go here)
|── saved__model/
| └── (output files trained models)
├── output/
│ └── (ouput files from inference command)
Then you can now simply run (after installation of osc-transformer-based-extractor)
the following command to fine-tune the model on the data:

.. code-block:: shell
$ osc-transformer-based-extractor relevance-detector fine-tune \
--data_path "project/training_data.csv" \
--model_name "bert-base-uncased" \
--num_labels 2 \
--max_length 128 \
--epochs 3 \
--batch_size 16 \
--output_dir "project/saved__model/" \
--save_steps 500
Also, the following command can be run to perform inference:

.. code-block:: shell
$ osc-transformer-based-extractor relevance-detector perform-inference \
--folder_path "project/data/" \
--kpi_mapping_path "project/kpi_mapping.csv" \
--output_path "project/output/" \
--model_path "project/model/" \
--tokenizer_path "project/model/" \
--threshold 0.5
***************
Training Data
***************
To train the model, you need data from the curator module. The data is in CSV format. You can train the model either using the CLI or by calling the function directly in Python.
To train the model, you need data from the curator module. The data is in CSV format. You can train the model either using the CLI or by calling the function directly in Python.

Training File
^^^^^^^^^^^^^^^

To train the model, you need a CSV file with columns:
* ``Question``
* ``Context``
* ``Label``

Also additionally, the output of the https://github.com/os-climate/osc-transformer-presteps module can also be used. the output will look like following
Sample Data:

.. list-table:: Company Information
.. list-table:: traning_Data.csv
:header-rows: 1

* - Question
Expand Down Expand Up @@ -65,178 +119,76 @@ Sample Data:
- 1022


KPI Mapping File
^^^^^^^^^^^^^^^^^^^^^
The Inference command will need a kpi-mapping.csv file, which looks like:

.. list-table:: kpi_mapping.csv
:header-rows: 1

***************
CLI Usage
***************

The CLI command `osc-transformer-based-extractor` provides two main functions: training and inference. You can access detailed information about these functions by running the CLI command without any arguments.

**Commands**



* ``fine-tune`` : Fine-tune a pre-trained Hugging Face model on a custom dataset.
* ``perform-inference`` : Perform inference using a pre-trained sequence classification model.
* - kpi_id
- question
- sectors
- add_year
- kpi_category
* - 1
- In which year was the annual report or the sustainability report published?
- OG, CM, CU
- FALSE
- TEXT

* ``fine-tune`` : Fine-tune a pre-trained Hugging Face model on a custom dataset.
* ``perform-inference`` : Perform inference using a pre-trained sequence classification model.



************************
Using Github Repository
Developer Notes
************************

Setting Up the Environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To set up the working environment for this repository, follow these steps:

1. **Clone the repository**:

.. code-block:: shell
$ git clone https://github.com/os-climate/osc-transformer-based-extractor/
$ cd osc-transformer-based-extractor
2. **Create a new virtual environment and activate it**:

.. code-block:: shell
$ python -m venv venv
$ source venv/bin/activate # On Windows use `venv\Scripts\activate`
3. **Install PDM**:

.. code-block:: shell
$ pip install pdm
4. **Sync the environment using PDM**:

.. code-block:: shell
$ pdm sync
5. **Add any new library**:

.. code-block:: shell
$ pdm add <library-name>
Train the model
^^^^^^^^^^^^^^^^^^^^^^^^^

To train the model, you can use the following code snippet:

.. code-block:: shell
$ python fine_tune.py \
--data_path "data/train_data.csv" \
--model_name "sentence-transformers/all-MiniLM-L6-v2" \
--num_labels 2 \
--max_length 512 \
--epochs 2 \
--batch_size 4 \
--output_dir "./saved_models_during_training" \
--save_steps 500
OR use function calling:
Use code directly without CLI via Github Repository
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python
First clone the repository to your local environment::

from fine_tune import fine_tune_model
fine_tune_model(
data_path="data/train_data.csv",
model_name="sentence-transformers/all-MiniLM-L6-v2",
num_labels=2,
max_length=512,
epochs=2,
batch_size=4,
output_dir="./saved_models_during_training",
save_steps=500
)
**Parameters**

* ``data_path (str)`` : Path to the training data CSV file.
* ``model_name (str)`` : Pre-trained model name from HuggingFace.
* ``num_labels (int)`` : Number of labels for the classification task.
* ``max_length (int)`` : Maximum sequence length.
* ``epochs (int)`` : Number of training epochs.
* ``batch_size (int)`` : Batch size for training.
* ``output_dir (str)`` : Directory to save the trained models.
* ``save_steps (int)`` : Number of steps between saving checkpoints.


Performing Inference
^^^^^^^^^^^^^^^^^^^^^^^^^

To perform inference and determine the relevance between a question and context, use the following code snippet:

.. code-block:: python
$ python inference.py
--question "What is the capital of France?"
--context "Paris is the capital of France."
--model_path /path/to/model
--tokenizer_path /path/to/tokenizer
OR use function calling:

.. code-block:: python
from inference import get_inference
$ git clone https://github.com/os-climate/osc-transformer-based-extractor/

We are using pdm to manage the packages and tox for a stable test framework.
Hence, first install pdm (possibly in a virtual environment) via::

result = get_inference(
question="What is the relevance?",
context="This is a sample paragraph.",
model_path="path/to/model",
tokenizer_path="path/to/tokenizer" )
$ pip install pdm

Afterwards sync you system via::

**Parameters**
$ pdm sync

* ``question (str)`` : The question for inference.
* ``context (str)`` : The paragraph to be analyzed.
* ``model_path (str)`` : Path to the pre-trained model.
* ``tokenizer_path (str)`` : Path to the tokenizer of the pre-trained model.
Now you have multiple demos on how to go on. See folder
[here](demo)

pdm
---

For adding new dependencies use pdm. You could add new packages via pdm add.
For example numpy via::

************************
Developer Notes
************************
$ pdm add numpy

For adding new dependencies use pdm. First install via pip::
For a very detailed description check the homepage of the pdm project:

$ pip install pdm
https://pdm-project.org/en/latest/

And then you could add new packages via pdm add. For example numpy via::

$ pdm add numpy
tox
---

For running linting tools just to the following::
For running linting tools we use tox which you run outside of your virtual environment::

$ pip install tox
$ tox -e lint
$ tox -e test

This will automatically apply some checks on your code and run the provided pytests. See
more details on tox on the homepage of the tox project:

https://tox.wiki/en/4.16.0/

************************
Contributing
Expand Down
45 changes: 39 additions & 6 deletions demo/training_demo/train_sentence_transformer.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,18 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"c:\\Users\\Tanishq\\Desktop\\IDS_WORK\\relevance-detector\\osc-transformer-based-extractor\\env\\lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
" from .autonotebook import tqdm as notebook_tqdm\n"
]
}
],
"source": [
"import pandas as pd\n",
"import torch\n",
Expand Down Expand Up @@ -113,9 +122,18 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {},
"outputs": [],
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Some weights of BertForSequenceClassification were not initialized from the model checkpoint at sentence-transformers/all-MiniLM-L6-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']\n",
"You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
]
}
],
"source": [
"MODEL_NAME = \"sentence-transformers/all-MiniLM-L6-v2\"\n",
"NUM_LABELS = 2\n",
Expand Down Expand Up @@ -261,9 +279,24 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"('saved_model\\\\tokenizer_config.json',\n",
" 'saved_model\\\\special_tokens_map.json',\n",
" 'saved_model\\\\vocab.txt',\n",
" 'saved_model\\\\added_tokens.json',\n",
" 'saved_model\\\\tokenizer.json')"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# from transformers import AutoModelForSequenceClassification, AutoTokenizer\n",
"# Assuming \"saved_model\" is the directory to save your model & tokenizer\n",
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ dependencies = [
"typer[all]>=0.12.3",
"rich>=13.7.1",
"numpy<2.0.0",
"openpyxl>=3.1.5",
]

[project.urls]
Expand Down
3 changes: 1 addition & 2 deletions src/osc_transformer_based_extractor/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@
for OSC (Open Source Communications) data.
Module contents:
- fine_tune: Module for fine-tuning transformer models.
- inference: Module for performing inference with transformer models.
- relevance_detetctor
- main: Main module for orchestrating the execution of the extractor.
This module should be imported to use the OSC transformer-based extractor package.
Expand Down
Loading

0 comments on commit 4063a26

Please sign in to comment.