os-climate · ModeSevenIndustrialSolutions · Jul 16, 2024 · Jul 15, 2024 · Jul 15, 2024 · Jul 15, 2024
diff --git a/.coveragerc b/.coveragerc
@@ -1,7 +1,7 @@
 # .coveragerc to control coverage.py
 [run]
 branch = True
-source = osc_data_extractor
+source = osc_transformer_based_extractor
 # omit = bad_file.py
 
 [paths]

diff --git a/README.rst b/README.rst
@@ -1,48 +1,284 @@
+#############################################
+OSC Transformer Based Extractor
+#############################################
 
-💬 Important
+|osc-climate-project| |osc-climate-slack| |osc-climate-github| |pypi| |build-status| |pdm| |PyScaffold|
 
-On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation (`FINOS <https://finos.org>`_), with OS-Climate, an open source community dedicated to building data technologies, modelling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the `FINOS governance framework <https://community.finos.org/docs/governance>`_; read more on `finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg <https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg>`_
 
 
-.. image:: https://img.shields.io/badge/OS-Climate-blue
+***********************************
+OS-Climate Data Extraction Tool
+***********************************
+
+
+This project provides an CLI tool and python scripts to train a HuggingFace Transformer model or a local Transformer model and perform inference with it. The primary goal of the inference is to determine the relevance between a given question and context.
+
+Installation
+^^^^^^^^^^^^^
+
+To install the OSC Transformer Based Extractor CLI, use pip:
+
+.. code-block:: shell
+
+    $ pip install osc-transformer-based-extractor
+
+Alternatively, you can clone the repository from GitHub for a quick start:
+
+.. code-block:: shell
+
+    $ git clone https://github.com/os-climate/osc-transformer-based-extractor/
+
+
+***************
+Training Data
+***************
+To train the model, you need data from the curator module. The data is in CSV format. You can train the model either using the CLI or by calling the function directly in Python.
+To train the model, you need data from the curator module. The data is in CSV format. You can train the model either using the CLI or by calling the function directly in Python.
+Sample Data:
+
+.. list-table:: Company Information
+   :header-rows: 1
+
+   * - Question
+     - Context
+     - Label
+     - Company
+     - Source File
+     - Source Page
+     - KPI ID
+     - Year
+     - Answer
+     - Data Type
+     - Annotator
+     - Index
+   * - What is the company name?
+     - The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash.
+     - 0
+     - NOVATEK
+     - 04_NOVATEK_AR_2016_ENG_11.pdf
+     - ['0']
+     - 0
+     - 2016
+     - PAO NOVATEK
+     - TEXT
+     - train_anno_large.xlsx
+     - 1022
+
+
+
+
+***************
+CLI Usage
+***************
+
+The CLI command `osc-transformer-based-extractor` provides two main functions: training and inference. You can access detailed information about these functions by running the CLI command without any arguments.
+
+**Commands**
+
+
+
+* ``fine-tune``  :  Fine-tune a pre-trained Hugging Face model on a custom dataset.
+* ``perform-inference`` :  Perform inference using a pre-trained sequence classification model.
+
+* ``fine-tune``  :  Fine-tune a pre-trained Hugging Face model on a custom dataset.
+* ``perform-inference`` :  Perform inference using a pre-trained sequence classification model.
+
+
+
+************************
+Using Github Repository
+************************
+
+Setting Up the Environment
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To set up the working environment for this repository, follow these steps:
+
+1. **Clone the repository**:
+
+.. code-block:: shell
+
+	$ git clone https://github.com/os-climate/osc-transformer-based-extractor/
+    $ cd osc-transformer-based-extractor
+
+
+
+2. **Create a new virtual environment and activate it**:
+
+.. code-block:: shell
+
+   		$ python -m venv venv
+   		$ source venv/bin/activate  # On Windows use `venv\Scripts\activate`
+
+
+
+3. **Install PDM**:
+
+.. code-block:: shell
+
+   		$ pip install pdm
+
+
+
+4. **Sync the environment using PDM**:
+
+.. code-block:: shell
+
+   		$ pdm sync
+
+
+
+5. **Add any new library**:
+
+.. code-block:: shell
+
+   		$ pdm add <library-name>
+
+
+Train the model
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To train the model, you can use the following code snippet:
+
+.. code-block:: shell
+
+    $ python fine_tune.py \
+      --data_path "data/train_data.csv" \
+      --model_name "sentence-transformers/all-MiniLM-L6-v2" \
+      --num_labels 2 \
+      --max_length 512 \
+      --epochs 2 \
+      --batch_size 4 \
+      --output_dir "./saved_models_during_training" \
+      --save_steps 500
+
+OR use function calling:
+
+.. code-block:: python
+
+    from fine_tune import fine_tune_model
+
+
+    fine_tune_model(
+        data_path="data/train_data.csv",
+        model_name="sentence-transformers/all-MiniLM-L6-v2",
+        num_labels=2,
+        max_length=512,
+        epochs=2,
+        batch_size=4,
+        output_dir="./saved_models_during_training",
+        save_steps=500
+    )
+
+**Parameters**
+
+* ``data_path (str)`` : Path to the training data CSV file.
+* ``model_name (str)`` : Pre-trained model name from HuggingFace.
+* ``num_labels (int)`` : Number of labels for the classification task.
+* ``max_length (int)`` : Maximum sequence length.
+* ``epochs (int)`` : Number of training epochs.
+* ``batch_size (int)`` : Batch size for training.
+* ``output_dir (str)`` : Directory to save the trained models.
+* ``save_steps (int)`` : Number of steps between saving checkpoints.
+
+
+Performing Inference
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To perform inference and determine the relevance between a question and context, use the following code snippet:
+
+.. code-block:: python
+
+  $ python inference.py
+      --question "What is the capital of France?"
+      --context "Paris is the capital of France."
+      --model_path /path/to/model
+      --tokenizer_path /path/to/tokenizer
+
+OR use function calling:
+
+.. code-block:: python
+
+  from inference import get_inference
+
+
+  result = get_inference(
+      question="What is the relevance?",
+      context="This is a sample paragraph.",
+      model_path="path/to/model",
+      tokenizer_path="path/to/tokenizer" )
+
+
+**Parameters**
+
+* ``question (str)`` : The question for inference.
+* ``context (str)`` : The paragraph to be analyzed.
+* ``model_path (str)`` : Path to the pre-trained model.
+* ``tokenizer_path (str)`` : Path to the tokenizer of the pre-trained model.
+
+
+
+************************
+Developer Notes
+************************
+
+For adding new dependencies use pdm. First install via pip::
+
+    $ pip install pdm
+
+And then you could add new packages via pdm add. For example numpy via::
+
+    $ pdm add numpy
+
+For running linting tools just to the following::
+
+    $ pip install tox
+    $ tox -e lint
+    $ tox -e test
+
+
+
+************************
+Contributing
+************************
+
+Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.
+
+All contributions (including pull requests) must agree to the Developer Certificate of Origin (DCO) version 1.1. This is exactly the same one created and used by the Linux kernel developers and posted on http://developercertificate.org/. This is a developer's certification that he or she has the right to submit the patch for inclusion into the project. Simply submitting a contribution implies this agreement, however, please include a "Signed-off-by" tag in every patch (this tag is a conventional way to confirm that you agree to the DCO).
+
+
+On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation ([FINOS](https://finos.org)), with OS-Climate, an open source community dedicated to building data technologies, modeling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance); read more on [finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg](https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg)
+
+
+
+
+
+
+
+.. |osc-climate-project| image:: https://img.shields.io/badge/OS-Climate-blue
   :alt: An OS-Climate Project
   :target: https://os-climate.org/
 
-.. image:: https://img.shields.io/badge/slack-osclimate-brightgreen.svg?logo=slack
+.. |osc-climate-slack| image:: https://img.shields.io/badge/slack-osclimate-brightgreen.svg?logo=slack
   :alt: Join OS-Climate on Slack
   :target: https://os-climate.slack.com
 
-.. image:: https://img.shields.io/badge/GitHub-100000?logo=github&logoColor=white
+.. |osc-climate-github| image:: https://img.shields.io/badge/GitHub-100000?logo=github&logoColor=white
   :alt: Source code on GitHub
-  :target: https://github.com/ModeSevenIndustrialSolutions/osc-transformer-based-extractor
+  :target: https://github.com/ModeSevenIndustrialSolutions/osc-data-extractor
 
-.. image:: https://img.shields.io/pypi/v/osc-transformer-based-extractor.svg
+.. |pypi| image:: https://img.shields.io/pypi/v/osc-data-extractor.svg
   :alt: PyPI package
-  :target: https://pypi.org/project/osc-transformer-based-extractor/
+  :target: https://pypi.org/project/osc-data-extractor/
 
-.. image:: https://api.cirrus-ci.com/github/os-climate/osc-transformer-based-extractor.svg?branch=main
+.. |build-status| image:: https://api.cirrus-ci.com/github/os-climate/osc-data-extractor.svg?branch=main
   :alt: Built Status
-  :target: https://cirrus-ci.com/github/os-climate/osc-transformer-based-extractor
+  :target: https://cirrus-ci.com/github/os-climate/osc-data-extractor
 
-.. image:: https://img.shields.io/badge/PDM-Project-purple
+.. |pdm| image:: https://img.shields.io/badge/PDM-Project-purple
   :alt: Built using PDM
   :target: https://pdm-project.org/latest/
 
-.. image:: https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold
+.. |PyScaffold| image:: https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold
   :alt: Project generated with PyScaffold
   :target: https://pyscaffold.org/
-
-
-
-===============================
-osc-transformer-based-extractor
-===============================
-
-OS-Climate Data Extraction Tool
-
-.. _notes:
-
-Notes
-=====
-
-Placeholder notes content
diff --git a/...d_extractor/OSC/annotations_training.xlsx → demo/data/annotations_training.xlsx b/...d_extractor/OSC/annotations_training.xlsx → demo/data/annotations_training.xlsx
diff --git a/...ormer_based_extractor/OSC/kpi_mapping.csv → demo/data/kpi_mapping.csv b/...ormer_based_extractor/OSC/kpi_mapping.csv → demo/data/kpi_mapping.csv
diff --git a/...er_based_extractor/OSC/output_curator.csv → demo/data/output_curator.csv b/...er_based_extractor/OSC/output_curator.csv → demo/data/output_curator.csv
diff --git a/...ansformer_based_extractor/OSC/result.json → demo/data/result.json b/...ansformer_based_extractor/OSC/result.json → demo/data/result.json
diff --git a/...xtractor/OSC/shell_annual_report_2019.pdf → demo/data/shell_annual_report_2019.pdf b/...xtractor/OSC/shell_annual_report_2019.pdf → demo/data/shell_annual_report_2019.pdf
diff --git a/...rmer_based_extractor/inference_demo.ipynb → demo/inference_demo/inference_demo.ipynb b/...rmer_based_extractor/inference_demo.ipynb → demo/inference_demo/inference_demo.ipynb
diff --git a/...xtractor/train_sentence_transformer.ipynb → ...ing_demo/train_sentence_transformer.ipynb b/...xtractor/train_sentence_transformer.ipynb → ...ing_demo/train_sentence_transformer.ipynb
@@ -46,7 +46,7 @@
    "outputs": [],
    "source": [
     "# Load your dataset into a pandas DataFrame\n",
-    "df = pd.read_csv(\"data/train_data.csv\")"
+    "df = pd.read_csv(\"data/output_curator.csv\")"
    ]
   },
   {
@@ -62,7 +62,7 @@
     "    Args:\n",
     "        tokenizer (transformers.PreTrainedTokenizer): tokenizing input text.\n",
     "        questions (list): List of questions.\n",
-    "        paragraphs (list): List of corresponding paragraphs.\n",
+    "        contexts (list): List of corresponding contexts.\n",
     "        labels (list): List of labels.\n",
     "        max_length (int): Maximum length of input sequences.\n",
     "\n",
@@ -72,13 +72,13 @@
     "\n",
     "    Example:\n",
     "        dataset = CustomDataset(tokenizer, questions,\n",
-    "                    paragraphs, labels, max_length)\n",
+    "                    contexts, labels, max_length)\n",
     "    \"\"\"\n",
     "\n",
-    "    def __init__(self, tokenizer, questions, paragraphs, labels, max_length):\n",
+    "    def __init__(self, tokenizer, questions, contexts, labels, max_length):\n",
     "        self.tokenizer = tokenizer\n",
     "        self.questions = questions\n",
-    "        self.paragraphs = paragraphs\n",
+    "        self.contexts = contexts\n",
     "        self.labels = labels\n",
     "        self.max_length = max_length\n",
     "\n",
@@ -87,11 +87,11 @@
     "\n",
     "    def __getitem__(self, idx):\n",
     "        question = str(self.questions[idx])\n",
-    "        paragraph = str(self.paragraphs[idx])\n",
+    "        context = str(self.contexts[idx])\n",
     "        label = self.labels[idx]\n",
     "\n",
     "        inputs = self.tokenizer(\n",
-    "            question, paragraph, truncation=True, padding=\"max_length\", max_length=self.max_length, return_tensors=\"pt\"\n",
+    "            question, context, truncation=True, padding=\"max_length\", max_length=self.max_length, return_tensors=\"pt\"\n",
     "        )\n",
     "\n",
     "        input_ids = inputs[\"input_ids\"].squeeze()\n",
@@ -143,10 +143,10 @@
     "MAX_LENGTH = 512\n",
     "\n",
     "# Create training dataset\n",
-    "train_dataset = CustomDataset(tokenizer, train_df[\"question\"], train_df[\"paragraph\"], train_df[\"label\"], MAX_LENGTH)\n",
+    "train_dataset = CustomDataset(tokenizer, train_df[\"question\"], train_df[\"context\"], train_df[\"label\"], MAX_LENGTH)\n",
     "\n",
     "# Create evaluation dataset\n",
-    "eval_dataset = CustomDataset(tokenizer, eval_df[\"question\"], eval_df[\"paragraph\"], eval_df[\"label\"], MAX_LENGTH)"
+    "eval_dataset = CustomDataset(tokenizer, eval_df[\"question\"], eval_df[\"context\"], eval_df[\"label\"], MAX_LENGTH)"
    ]
   },
   {