This project provides an CLI tool and python scripts to train a HuggingFace Transformer model or a local Transformer model and perform inference with it. The primary goal of the inference is to determine the relevance between a given question and context.
To install the OSC Transformer Based Extractor CLI, use pip:
$ pip install osc-transformer-based-extractor
Alternatively, you can clone the repository from GitHub for a quick start:
$ git clone https://github.com/os-climate/osc-transformer-based-extractor/
To train the model, you need data from the curator module. The data is in CSV format. You can train the model either using the CLI or by calling the function directly in Python. To train the model, you need data from the curator module. The data is in CSV format. You can train the model either using the CLI or by calling the function directly in Python. Sample Data:
Question | Context | Label | Company | Source File | Source Page | KPI ID | Year | Answer | Data Type | Annotator | Index |
---|---|---|---|---|---|---|---|---|---|---|---|
What is the company name? | The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash. | 0 | NOVATEK | 04_NOVATEK_AR_2016_ENG_11.pdf | ['0'] | 0 | 2016 | PAO NOVATEK | TEXT | train_anno_large.xlsx | 1022 |
The CLI command osc-transformer-based-extractor provides two main functions: training and inference. You can access detailed information about these functions by running the CLI command without any arguments.
Commands
fine-tune
: Fine-tune a pre-trained Hugging Face model on a custom dataset.perform-inference
: Perform inference using a pre-trained sequence classification model.fine-tune
: Fine-tune a pre-trained Hugging Face model on a custom dataset.perform-inference
: Perform inference using a pre-trained sequence classification model.
To set up the working environment for this repository, follow these steps:
- Clone the repository:
$ git clone https://github.com/os-climate/osc-transformer-based-extractor/
$ cd osc-transformer-based-extractor
- Create a new virtual environment and activate it:
$ python -m venv venv
$ source venv/bin/activate # On Windows use `venv\Scripts\activate`
- Install PDM:
$ pip install pdm
- Sync the environment using PDM:
$ pdm sync
- Add any new library:
$ pdm add <library-name>
To train the model, you can use the following code snippet:
$ python fine_tune.py \
--data_path "data/train_data.csv" \
--model_name "sentence-transformers/all-MiniLM-L6-v2" \
--num_labels 2 \
--max_length 512 \
--epochs 2 \
--batch_size 4 \
--output_dir "./saved_models_during_training" \
--save_steps 500
OR use function calling:
from fine_tune import fine_tune_model
fine_tune_model(
data_path="data/train_data.csv",
model_name="sentence-transformers/all-MiniLM-L6-v2",
num_labels=2,
max_length=512,
epochs=2,
batch_size=4,
output_dir="./saved_models_during_training",
save_steps=500
)
Parameters
data_path (str)
: Path to the training data CSV file.model_name (str)
: Pre-trained model name from HuggingFace.num_labels (int)
: Number of labels for the classification task.max_length (int)
: Maximum sequence length.epochs (int)
: Number of training epochs.batch_size (int)
: Batch size for training.output_dir (str)
: Directory to save the trained models.save_steps (int)
: Number of steps between saving checkpoints.
To perform inference and determine the relevance between a question and context, use the following code snippet:
$ python inference.py
--question "What is the capital of France?"
--context "Paris is the capital of France."
--model_path /path/to/model
--tokenizer_path /path/to/tokenizer
OR use function calling:
from inference import get_inference
result = get_inference(
question="What is the relevance?",
context="This is a sample paragraph.",
model_path="path/to/model",
tokenizer_path="path/to/tokenizer" )
Parameters
question (str)
: The question for inference.context (str)
: The paragraph to be analyzed.model_path (str)
: Path to the pre-trained model.tokenizer_path (str)
: Path to the tokenizer of the pre-trained model.
For adding new dependencies use pdm. First install via pip:
$ pip install pdm
And then you could add new packages via pdm add. For example numpy via:
$ pdm add numpy
For running linting tools just to the following:
$ pip install tox $ tox -e lint $ tox -e test
Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.
All contributions (including pull requests) must agree to the Developer Certificate of Origin (DCO) version 1.1. This is exactly the same one created and used by the Linux kernel developers and posted on http://developercertificate.org/. This is a developer's certification that he or she has the right to submit the patch for inclusion into the project. Simply submitting a contribution implies this agreement, however, please include a "Signed-off-by" tag in every patch (this tag is a conventional way to confirm that you agree to the DCO).
On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation ([FINOS](https://finos.org)), with OS-Climate, an open source community dedicated to building data technologies, modeling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance); read more on [finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg](https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg)