Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.github		.github
demo		demo
docs		docs
scripts		scripts
src/osc_transformer_based_extractor		src/osc_transformer_based_extractor
tests		tests
.coveragerc		.coveragerc
.devops-exclusions		.devops-exclusions
.flake8		.flake8
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.readthedocs.yml		.readthedocs.yml
AUTHORS.rst		AUTHORS.rst
CHANGELOG.rst		CHANGELOG.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
LICENSE.txt		LICENSE.txt
README.rst		README.rst
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
tox.ini		tox.ini

Repository files navigation

OSC Transformer Based Extractor

OS-Climate Data Extraction Tool

This project provides an CLI tool and python scripts to train a HuggingFace Transformer model or a local Transformer model and perform inference with it. The primary goal of the inference is to determine the relevance between a given question and context.

Installation

To install the OSC Transformer Based Extractor CLI, use pip:

$ pip install osc-transformer-based-extractor

Alternatively, you can clone the repository from GitHub for a quick start:

$ git clone https://github.com/os-climate/osc-transformer-based-extractor/

Training Data

To train the model, you need data from the curator module. The data is in CSV format. You can train the model either using the CLI or by calling the function directly in Python. To train the model, you need data from the curator module. The data is in CSV format. You can train the model either using the CLI or by calling the function directly in Python. Sample Data:

Question	Context	Label	Company	Source File	Source Page	KPI ID	Year	Answer	Data Type	Annotator	Index
What is the company name?	The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash.	0	NOVATEK	04_NOVATEK_AR_2016_ENG_11.pdf	['0']	0	2016	PAO NOVATEK	TEXT	train_anno_large.xlsx	1022

CLI Usage

The CLI command osc-transformer-based-extractor provides two main functions: training and inference. You can access detailed information about these functions by running the CLI command without any arguments.

Commands

fine-tune : Fine-tune a pre-trained Hugging Face model on a custom dataset.
perform-inference : Perform inference using a pre-trained sequence classification model.
fine-tune : Fine-tune a pre-trained Hugging Face model on a custom dataset.
perform-inference : Perform inference using a pre-trained sequence classification model.

Using Github Repository

Setting Up the Environment

To set up the working environment for this repository, follow these steps:

Clone the repository:

    $ git clone https://github.com/os-climate/osc-transformer-based-extractor/
$ cd osc-transformer-based-extractor

Create a new virtual environment and activate it:

$ python -m venv venv
$ source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install PDM:

$ pip install pdm

Sync the environment using PDM:

$ pdm sync

Add any new library:

$ pdm add <library-name>

Train the model

To train the model, you can use the following code snippet:

$ python fine_tune.py \
  --data_path "data/train_data.csv" \
  --model_name "sentence-transformers/all-MiniLM-L6-v2" \
  --num_labels 2 \
  --max_length 512 \
  --epochs 2 \
  --batch_size 4 \
  --output_dir "./saved_models_during_training" \
  --save_steps 500

OR use function calling:

from fine_tune import fine_tune_model


fine_tune_model(
    data_path="data/train_data.csv",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    num_labels=2,
    max_length=512,
    epochs=2,
    batch_size=4,
    output_dir="./saved_models_during_training",
    save_steps=500
)

Parameters

data_path (str) : Path to the training data CSV file.
model_name (str) : Pre-trained model name from HuggingFace.
num_labels (int) : Number of labels for the classification task.
max_length (int) : Maximum sequence length.
epochs (int) : Number of training epochs.
batch_size (int) : Batch size for training.
output_dir (str) : Directory to save the trained models.
save_steps (int) : Number of steps between saving checkpoints.

Performing Inference

To perform inference and determine the relevance between a question and context, use the following code snippet:

$ python inference.py
    --question "What is the capital of France?"
    --context "Paris is the capital of France."
    --model_path /path/to/model
    --tokenizer_path /path/to/tokenizer

OR use function calling:

from inference import get_inference


result = get_inference(
    question="What is the relevance?",
    context="This is a sample paragraph.",
    model_path="path/to/model",
    tokenizer_path="path/to/tokenizer" )

Parameters

question (str) : The question for inference.
context (str) : The paragraph to be analyzed.
model_path (str) : Path to the pre-trained model.
tokenizer_path (str) : Path to the tokenizer of the pre-trained model.

Developer Notes

For adding new dependencies use pdm. First install via pip:

$ pip install pdm

And then you could add new packages via pdm add. For example numpy via:

$ pdm add numpy

For running linting tools just to the following:

$ pip install tox
$ tox -e lint
$ tox -e test

Contributing

Contributions are welcome! Please fork the repository and submit a pull request for any enhancements or bug fixes.

All contributions (including pull requests) must agree to the Developer Certificate of Origin (DCO) version 1.1. This is exactly the same one created and used by the Linux kernel developers and posted on http://developercertificate.org/. This is a developer's certification that he or she has the right to submit the patch for inclusion into the project. Simply submitting a contribution implies this agreement, however, please include a "Signed-off-by" tag in every patch (this tag is a conventional way to confirm that you agree to the DCO).

On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation ([FINOS](https://finos.org)), with OS-Climate, an open source community dedicated to building data technologies, modeling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance); read more on [finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg](https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSC Transformer Based Extractor

OS-Climate Data Extraction Tool

Installation

Training Data

CLI Usage

Using Github Repository

Setting Up the Environment

Train the model

Performing Inference

Developer Notes

Contributing

About

Releases 5

Packages

Contributors 6

Languages

License

os-climate/osc-transformer-based-extractor

Folders and files

Latest commit

History

Repository files navigation

OSC Transformer Based Extractor

OS-Climate Data Extraction Tool

Installation

Training Data

CLI Usage

Using Github Repository

Setting Up the Environment

Train the model

Performing Inference

Developer Notes

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 6

Languages

Packages