Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Broken links in README.rst banner text [skip ci] #33

Merged
merged 2 commits into from
Jul 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@

[IMPORTANT]
💬 Important

On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation ([FINOS](https://finos.org)), with OS-Climate, an open source community dedicated to building data technologies, modeling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the [FINOS governance framework](https://community.finos.org/docs/governance); read more on [finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg](https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg)
On June 26 2024, Linux Foundation announced the merger of its financial services umbrella, the Fintech Open Source Foundation (`FINOS <https://finos.org>`_), with OS-Climate, an open source community dedicated to building data technologies, modelling, and analytic tools that will drive global capital flows into climate change mitigation and resilience; OS-Climate projects are in the process of transitioning to the `FINOS governance framework <https://community.finos.org/docs/governance>`_; read more on `finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg <https://finos.org/press/finos-join-forces-os-open-source-climate-sustainability-esg>`_


.. image:: https://img.shields.io/badge/OS-Climate-blue
Expand Down
93 changes: 55 additions & 38 deletions src/osc_transformer_based_extractor/README.md
Original file line number Diff line number Diff line change
@@ -1,52 +1,55 @@
---

# Relevance Detector

This folder contains a set of scripts and notebooks designed to process data, train a sentence transformer model, and perform inferences to detect the relevance of folder contents. Below is a detailed description of each file and folder included in this repository.

## How to Use This Repository

1. **Prepare Training Data**:

- One must have data from the curator module, which is used for training of the model. The data from the curator module is a CSV file as follows:

### Example Snippet
### Example Snippet

| question | context | company | source_file | source_page | kpi_id | year | answer | data_type | relevant_paragraphs | annotator | Index | label |
|-------------------------------|----------------------------------------------------------------------------------------------------------------------------|---------|-----------------------------------|-------------|--------|------|--------------|-----------|------------------------------------------------|---------------------|-------|-------|
| What is the company name? | The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash. | NOVATEK | 04_NOVATEK_AR_2016_ENG_11.pdf | ['0'] | 0 | 2016 | PAO NOVATEK | TEXT | ["PAO NOVATEK ANNUAL REPORT 2016"] | train_anno_large.xlsx | 1022 | 0 |
| question | context | company | source_file | source_page | kpi_id | year | answer | data_type | relevant_paragraphs | annotator | Index | label |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | ----------------------------- | ----------- | ------ | ---- | ----------- | --------- | ---------------------------------- | --------------------- | ----- | ----- |
| What is the company name? | The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash. | NOVATEK | 04_NOVATEK_AR_2016_ENG_11.pdf | ['0'] | 0 | 2016 | PAO NOVATEK | TEXT | ["PAO NOVATEK ANNUAL REPORT 2016"] | train_anno_large.xlsx | 1022 | 0 |

- If you have CSV data from the curator module, run `make_training_data_from_curator.py` to process and save it in the `Data` folder.
- Alternatively, you can use `make_sample_training_data.ipynb` to generate sample data from a sample CSV file.


2. **Train the Model**:

- Use `train_sentence_transformer.ipynb` or `train_sentence_transformer.py` to train a sentence transformer model with the processed data from the `Data` folder and save it locally. Follow the steps in the notebook or script to configure and start the training process.

- To train the model using function calling
```python

```python
from train_sentence_transformer import fine_tune_model
fine_tune_model(
data_path="data/train_data.csv",
model_name="sentence-transformers/all-MiniLM-L6-v2",
num_labels=2,
max_length=512,
epochs=2,
batch_size=4,
output_dir="./saved_models_during_training",
save_steps=500
data_path="data/train_data.csv",
model_name="sentence-transformers/all-MiniLM-L6-v2",
num_labels=2,
max_length=512,
epochs=2,
batch_size=4,
output_dir="./saved_models_during_training",
save_steps=500
)
```

**Parameters**:
- `data_path (str)`: Path to the training data CSV file.
- `model_name (str)`: Pre-trained model name from HuggingFace.
- `num_labels (int)`: Number of labels for the classification task.
- `max_length (int)`: Maximum sequence length.
- `epochs (int)`: Number of training epochs.
- `batch_size (int)`: Batch size for training.
- `output_dir (str)`: Directory to save the trained models.
- `save_steps (int)`: Number of steps between saving checkpoints.

- `data_path (str)`: Path to the training data CSV file.
- `model_name (str)`: Pre-trained model name from HuggingFace.
- `num_labels (int)`: Number of labels for the classification task.
- `max_length (int)`: Maximum sequence length.
- `epochs (int)`: Number of training epochs.
- `batch_size (int)`: Batch size for training.
- `output_dir (str)`: Directory to save the trained models.
- `save_steps (int)`: Number of steps between saving checkpoints.

- To train the model from the command line, run `fine_tune.py` with the required arguments:

```bash
python fine_tune.py \
--data_path "data/train_data.csv" \
Expand All @@ -62,38 +65,44 @@ This folder contains a set of scripts and notebooks designed to process data, tr
3. **Perform Inference**:
- Use `inference_demo.ipynb` to perform inferences with your trained model. Specify the model and tokenizer paths (either local or from HuggingFace) and run the notebook cells to see the results.
- For programmatic inference, you can use the function provided in `inference.py`:

```python
from inference import get_inference
result = get_inference(question="What is the relevance?", paragraph="This is a sample paragraph.", model_path="path/to/model", tokenizer_path="path/to/tokenizer")
```


## Repository Contents

### Python Scripts

1. **`inference.py`**

- This script contains the function to make inferences using the trained model.
- **Usage**: Import this script and use the provided function to predict the relevance of new data.
- **Example**:

```python
from inference import get_inference
result = get_inference(question="What is the relevance?", paragraph="This is a sample paragraph.", model_path="path/to/model", tokenizer_path="path/to/tokenizer")
```

**Parameters**:
- `question (str)`: The question for inference.
- `paragraph (str)`: The paragraph to be analyzed.
- `model_path (str)`: Path to the pre-trained model.
- `tokenizer_path (str)`: Path to the tokenizer of the pre-trained model.
- `question (str)`: The question for inference.
- `paragraph (str)`: The paragraph to be analyzed.
- `model_path (str)`: Path to the pre-trained model.
- `tokenizer_path (str)`: Path to the tokenizer of the pre-trained model.

2. **`make_training_data_from_curator.py`**

- This script processes CSV data obtained from a module named `curator` to make it suitable for training the model.
- **Usage**: Run this script to generate training data from the curator's output and save it in the `Data` folder.

3. **`train_sentence_transformer.py`**

- This script defines a function to train a sentence transformer model, which can be called from other scripts or notebooks.
- **Usage**: Import and call the `fine_tune_model` function to train your model.
- **Example**:

```python
from train_sentence_transformer import fine_tune_model
fine_tune_model(
Expand All @@ -107,20 +116,22 @@ This folder contains a set of scripts and notebooks designed to process data, tr
save_steps=500
)
```

**Parameters**:
- `data_path (str)`: Path to the training data CSV file.
- `model_name (str)`: Pre-trained model name from HuggingFace.
- `num_labels (int)`: Number of labels for the classification task.
- `max_length (int)`: Maximum sequence length.
- `epochs (int)`: Number of training epochs.
- `batch_size (int)`: Batch size for training.
- `output_dir (str)`: Directory to save the trained models.
- `save_steps (int)`: Number of steps between saving checkpoints.
- `data_path (str)`: Path to the training data CSV file.
- `model_name (str)`: Pre-trained model name from HuggingFace.
- `num_labels (int)`: Number of labels for the classification task.
- `max_length (int)`: Maximum sequence length.
- `epochs (int)`: Number of training epochs.
- `batch_size (int)`: Batch size for training.
- `output_dir (str)`: Directory to save the trained models.
- `save_steps (int)`: Number of steps between saving checkpoints.

4. **`fine_tune.py`**
- This script allows you to train a sentence transformer model from the command line.
- **Usage**: Run this script from the command line with the necessary arguments.
- **Example**:

```bash
python fine_tune.py \
--data_path "data/train_data.csv" \
Expand All @@ -136,11 +147,13 @@ This folder contains a set of scripts and notebooks designed to process data, tr
### Jupyter Notebooks

1. **`inference_demo.ipynb`**

- A notebook to demonstrate how to perform inferences using a custom model and tokenizer.
- **Features**: Allows specifying model and tokenizer paths, which can be local paths or HuggingFace paths.
- **Usage**: Open this notebook and follow the instructions to test inference with your own models.

2. **`make_sample_training_data.ipynb`**

- This notebook was used to create sample training data from a sample CSV file.
- **Usage**: Open and run this notebook to understand the process of creating sample data for training.

Expand All @@ -153,34 +166,38 @@ This folder contains a set of scripts and notebooks designed to process data, tr
- **`Data/`**
- This folder contains the processed training data obtained from the `curator` module. It serves as the input for training the sentence transformer model.


## Setting Up the Environment

To set up the working environment for this repository, follow these steps:

1. **Clone the repository**:

```bash
git clone https://github.com/yourusername/folder-relevance-detector.git
cd folder-relevance-detector
```

2. **Create a new virtual environment and activate it**:

```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```

3. **Install PDM**:

```bash
pip install pdm
```

4. **Sync the environment using PDM**:

```bash
pdm sync
```

5. **Add any new library**:

```bash
pdm add <library-name>
```
Expand Down
Loading