Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data relevance #41

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
0b7beb5
Update README.rst
tanishq-ids Jul 15, 2024
e3e245e
Delete src/osc_transformer_based_extractor/OSC directory
tanishq-ids Jul 15, 2024
dbdaaa4
Delete src/osc_transformer_based_extractor directory
tanishq-ids Jul 15, 2024
0fb69c3
Delete src/pytests directory
tanishq-ids Jul 15, 2024
2bd2f01
Create __init__.py
tanishq-ids Jul 15, 2024
0732004
Create fine_tune.py
tanishq-ids Jul 15, 2024
c2fa88b
added inference and main.py
tanishq-ids Jul 15, 2024
61df612
Create test_main.py
tanishq-ids Jul 15, 2024
019b70e
added test_fine_tune and test_inference.py
tanishq-ids Jul 15, 2024
6b0a711
Update pyproject.toml
tanishq-ids Jul 15, 2024
6eb4caa
Update tox.ini
tanishq-ids Jul 15, 2024
b6d23e1
Update .coveragerc
tanishq-ids Jul 15, 2024
1cdcce2
Create result.json
tanishq-ids Jul 15, 2024
0ff9aa5
uploaded data files in demo
tanishq-ids Jul 15, 2024
a683eb2
Create dummy.txt
tanishq-ids Jul 15, 2024
a712549
added inference_demo.ipynb
tanishq-ids Jul 15, 2024
1e987d4
Create dummy.txt
tanishq-ids Jul 15, 2024
0338251
added train_sentence_transformer.ipynb
tanishq-ids Jul 15, 2024
032b2be
Delete demo/inference_demo/dummy.txt
tanishq-ids Jul 15, 2024
3e67b4f
Delete demo/training_demo/dummy.txt
tanishq-ids Jul 15, 2024
082a157
Update test_inference.py
tanishq-ids Jul 15, 2024
28d1a6b
changes in file structure
Jul 15, 2024
6e718ba
Chore: pre-commit autoupdate
pre-commit-ci[bot] Jul 16, 2024
35fcd94
Update fine_tune.py
tanishq-ids Jul 16, 2024
91208d1
Update inference.py
tanishq-ids Jul 16, 2024
aceaf2c
Update test_fine_tune.py
tanishq-ids Jul 16, 2024
474a326
Chore: pre-commit autoupdate
pre-commit-ci[bot] Jul 16, 2024
03e24ab
Update test_main.py
tanishq-ids Jul 16, 2024
487b22c
Chore: pre-commit autoupdate
pre-commit-ci[bot] Jul 16, 2024
693d46f
Update test_inference.py
tanishq-ids Jul 16, 2024
13408d5
Chore: pre-commit autoupdate
pre-commit-ci[bot] Jul 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# .coveragerc to control coverage.py
[run]
branch = True
source = osc_data_extractor
source = osc_transformer_based_extractor
# omit = bad_file.py

[paths]
Expand Down
290 changes: 263 additions & 27 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,48 +1,284 @@
#############################################
OSC Transformer Based Extractor
#############################################

💬 Important
|osc-climate-project| |osc-climate-slack| |osc-climate-github| |pypi| |build-status| |pdm| |PyScaffold|

On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation (`FINOS <https://finos.org>`_), with OS-Climate, an open source community dedicated to building data technologies, modelling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the `FINOS governance framework <https://community.finos.org/docs/governance>`_; read more on `finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg <https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg>`_


.. image:: https://img.shields.io/badge/OS-Climate-blue
***********************************
OS-Climate Data Extraction Tool
***********************************


This project provides an CLI tool and python scripts to train a HuggingFace Transformer model or a local Transformer model and perform inference with it. The primary goal of the inference is to determine the relevance between a given question and context.

Installation
^^^^^^^^^^^^^

To install the OSC Transformer Based Extractor CLI, use pip:

.. code-block:: shell

$ pip install osc-transformer-based-extractor

Alternatively, you can clone the repository from GitHub for a quick start:

.. code-block:: shell

$ git clone https://github.com/os-climate/osc-transformer-based-extractor/


***************
Training Data
***************
To train the model, you need data from the curator module. The data is in CSV format. You can train the model either using the CLI or by calling the function directly in Python.
To train the model, you need data from the curator module. The data is in CSV format. You can train the model either using the CLI or by calling the function directly in Python.
Sample Data:

.. list-table:: Company Information
:header-rows: 1

* - Question
- Context
- Label
- Company
- Source File
- Source Page
- KPI ID
- Year
- Answer
- Data Type
- Annotator
- Index
* - What is the company name?
- The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash.
- 0
- NOVATEK
- 04_NOVATEK_AR_2016_ENG_11.pdf
- ['0']
- 0
- 2016
- PAO NOVATEK
- TEXT
- train_anno_large.xlsx
- 1022




***************
CLI Usage
***************

The CLI command `osc-transformer-based-extractor` provides two main functions: training and inference. You can access detailed information about these functions by running the CLI command without any arguments.

**Commands**



* ``fine-tune`` : Fine-tune a pre-trained Hugging Face model on a custom dataset.
* ``perform-inference`` : Perform inference using a pre-trained sequence classification model.

* ``fine-tune`` : Fine-tune a pre-trained Hugging Face model on a custom dataset.
* ``perform-inference`` : Perform inference using a pre-trained sequence classification model.



************************
Using Github Repository
************************

Setting Up the Environment
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To set up the working environment for this repository, follow these steps:

1. **Clone the repository**:

.. code-block:: shell

$ git clone https://github.com/os-climate/osc-transformer-based-extractor/
$ cd osc-transformer-based-extractor



2. **Create a new virtual environment and activate it**:

.. code-block:: shell

$ python -m venv venv
$ source venv/bin/activate # On Windows use `venv\Scripts\activate`



3. **Install PDM**:

.. code-block:: shell

$ pip install pdm



4. **Sync the environment using PDM**:

.. code-block:: shell

$ pdm sync



5. **Add any new library**:

.. code-block:: shell

$ pdm add <library-name>


Train the model
^^^^^^^^^^^^^^^^^^^^^^^^^

To train the model, you can use the following code snippet:

.. code-block:: shell

$ python fine_tune.py \
--data_path "data/train_data.csv" \
--model_name "sentence-transformers/all-MiniLM-L6-v2" \
--num_labels 2 \
--max_length 512 \
--epochs 2 \
--batch_size 4 \
--output_dir "./saved_models_during_training" \
--save_steps 500

OR use function calling:

.. code-block:: python

from fine_tune import fine_tune_model


fine_tune_model(
data_path="data/train_data.csv",
model_name="sentence-transformers/all-MiniLM-L6-v2",
num_labels=2,
max_length=512,
epochs=2,
batch_size=4,
output_dir="./saved_models_during_training",
save_steps=500
)

**Parameters**

* ``data_path (str)`` : Path to the training data CSV file.
* ``model_name (str)`` : Pre-trained model name from HuggingFace.
* ``num_labels (int)`` : Number of labels for the classification task.
* ``max_length (int)`` : Maximum sequence length.
* ``epochs (int)`` : Number of training epochs.
* ``batch_size (int)`` : Batch size for training.
* ``output_dir (str)`` : Directory to save the trained models.
* ``save_steps (int)`` : Number of steps between saving checkpoints.


Performing Inference
^^^^^^^^^^^^^^^^^^^^^^^^^

To perform inference and determine the relevance between a question and context, use the following code snippet:

.. code-block:: python

$ python inference.py
--question "What is the capital of France?"
--context "Paris is the capital of France."
--model_path /path/to/model
--tokenizer_path /path/to/tokenizer

OR use function calling:

.. code-block:: python

from inference import get_inference


result = get_inference(
question="What is the relevance?",
context="This is a sample paragraph.",
model_path="path/to/model",
tokenizer_path="path/to/tokenizer" )


**Parameters**

* ``question (str)`` : The question for inference.
* ``context (str)`` : The paragraph to be analyzed.
* ``model_path (str)`` : Path to the pre-trained model.
* ``tokenizer_path (str)`` : Path to the tokenizer of the pre-trained model.



************************
Developer Notes
************************

For adding new dependencies use pdm. First install via pip::

$ pip install pdm

And then you could add new packages via pdm add. For example numpy via::

$ pdm add numpy

For running linting tools just to the following::

$ pip install tox
$ tox -e lint
$ tox -e test



************************
Contributing
************************

Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.

All contributions (including pull requests) must agree to the Developer Certificate of Origin (DCO) version 1.1. This is exactly the same one created and used by the Linux kernel developers and posted on http://developercertificate.org/. This is a developer's certification that he or she has the right to submit the patch for inclusion into the project. Simply submitting a contribution implies this agreement, however, please include a "Signed-off-by" tag in every patch (this tag is a conventional way to confirm that you agree to the DCO).


On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation ([FINOS](https://finos.org)), with OS-Climate, an open source community dedicated to building data technologies, modeling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance); read more on [finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg](https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg)







.. |osc-climate-project| image:: https://img.shields.io/badge/OS-Climate-blue
:alt: An OS-Climate Project
:target: https://os-climate.org/

.. image:: https://img.shields.io/badge/slack-osclimate-brightgreen.svg?logo=slack
.. |osc-climate-slack| image:: https://img.shields.io/badge/slack-osclimate-brightgreen.svg?logo=slack
:alt: Join OS-Climate on Slack
:target: https://os-climate.slack.com

.. image:: https://img.shields.io/badge/GitHub-100000?logo=github&logoColor=white
.. |osc-climate-github| image:: https://img.shields.io/badge/GitHub-100000?logo=github&logoColor=white
:alt: Source code on GitHub
:target: https://github.com/ModeSevenIndustrialSolutions/osc-transformer-based-extractor
:target: https://github.com/ModeSevenIndustrialSolutions/osc-data-extractor

.. image:: https://img.shields.io/pypi/v/osc-transformer-based-extractor.svg
.. |pypi| image:: https://img.shields.io/pypi/v/osc-data-extractor.svg
:alt: PyPI package
:target: https://pypi.org/project/osc-transformer-based-extractor/
:target: https://pypi.org/project/osc-data-extractor/

.. image:: https://api.cirrus-ci.com/github/os-climate/osc-transformer-based-extractor.svg?branch=main
.. |build-status| image:: https://api.cirrus-ci.com/github/os-climate/osc-data-extractor.svg?branch=main
:alt: Built Status
:target: https://cirrus-ci.com/github/os-climate/osc-transformer-based-extractor
:target: https://cirrus-ci.com/github/os-climate/osc-data-extractor

.. image:: https://img.shields.io/badge/PDM-Project-purple
.. |pdm| image:: https://img.shields.io/badge/PDM-Project-purple
:alt: Built using PDM
:target: https://pdm-project.org/latest/

.. image:: https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold
.. |PyScaffold| image:: https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold
:alt: Project generated with PyScaffold
:target: https://pyscaffold.org/



===============================
osc-transformer-based-extractor
===============================

OS-Climate Data Extraction Tool

.. _notes:

Notes
=====

Placeholder notes content
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
"outputs": [],
"source": [
"# Load your dataset into a pandas DataFrame\n",
"df = pd.read_csv(\"data/train_data.csv\")"
"df = pd.read_csv(\"data/output_curator.csv\")"
]
},
{
Expand All @@ -62,7 +62,7 @@
" Args:\n",
" tokenizer (transformers.PreTrainedTokenizer): tokenizing input text.\n",
" questions (list): List of questions.\n",
" paragraphs (list): List of corresponding paragraphs.\n",
" contexts (list): List of corresponding contexts.\n",
" labels (list): List of labels.\n",
" max_length (int): Maximum length of input sequences.\n",
"\n",
Expand All @@ -72,13 +72,13 @@
"\n",
" Example:\n",
" dataset = CustomDataset(tokenizer, questions,\n",
" paragraphs, labels, max_length)\n",
" contexts, labels, max_length)\n",
" \"\"\"\n",
"\n",
" def __init__(self, tokenizer, questions, paragraphs, labels, max_length):\n",
" def __init__(self, tokenizer, questions, contexts, labels, max_length):\n",
" self.tokenizer = tokenizer\n",
" self.questions = questions\n",
" self.paragraphs = paragraphs\n",
" self.contexts = contexts\n",
" self.labels = labels\n",
" self.max_length = max_length\n",
"\n",
Expand All @@ -87,11 +87,11 @@
"\n",
" def __getitem__(self, idx):\n",
" question = str(self.questions[idx])\n",
" paragraph = str(self.paragraphs[idx])\n",
" context = str(self.contexts[idx])\n",
" label = self.labels[idx]\n",
"\n",
" inputs = self.tokenizer(\n",
" question, paragraph, truncation=True, padding=\"max_length\", max_length=self.max_length, return_tensors=\"pt\"\n",
" question, context, truncation=True, padding=\"max_length\", max_length=self.max_length, return_tensors=\"pt\"\n",
" )\n",
"\n",
" input_ids = inputs[\"input_ids\"].squeeze()\n",
Expand Down Expand Up @@ -143,10 +143,10 @@
"MAX_LENGTH = 512\n",
"\n",
"# Create training dataset\n",
"train_dataset = CustomDataset(tokenizer, train_df[\"question\"], train_df[\"paragraph\"], train_df[\"label\"], MAX_LENGTH)\n",
"train_dataset = CustomDataset(tokenizer, train_df[\"question\"], train_df[\"context\"], train_df[\"label\"], MAX_LENGTH)\n",
"\n",
"# Create evaluation dataset\n",
"eval_dataset = CustomDataset(tokenizer, eval_df[\"question\"], eval_df[\"paragraph\"], eval_df[\"label\"], MAX_LENGTH)"
"eval_dataset = CustomDataset(tokenizer, eval_df[\"question\"], eval_df[\"context\"], eval_df[\"label\"], MAX_LENGTH)"
]
},
{
Expand Down
Loading
Loading