Skip to content

Commit

Permalink
Fix: Specify correct repo to pull when updating branch [skip ci] (#37)
Browse files Browse the repository at this point in the history
* Fix: Specify correct repo to pull when updating branch [skip ci]

Signed-off-by: Matthew Watkins <[email protected]>

* Fix: Specify correct repo to pull when updating branch

Signed-off-by: Matthew Watkins <[email protected]>

---------

Signed-off-by: Matthew Watkins <[email protected]>
  • Loading branch information
ModeSevenIndustrialSolutions authored Jul 8, 2024
1 parent e13ac99 commit a993b40
Show file tree
Hide file tree
Showing 4 changed files with 68 additions and 65 deletions.
3 changes: 1 addition & 2 deletions .github/workflows/bootstrap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -214,8 +214,7 @@ jobs:
else
# The -B flag swaps branch and creates it if NOT present
git checkout -B "$AUTOMATION_BRANCH"
git pull upstream "$AUTOMATION_BRANCH"
git pull
git pull origin "$AUTOMATION_BRANCH"
fi
# Only if NOT running in GitHub
Expand Down
128 changes: 66 additions & 62 deletions src/osc_transformer_based_extractor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,72 +4,73 @@ This folder contains a set of scripts and notebooks designed to process data, tr

## How to Use This Repository

1. **Prepare Training Data**:
**Prepare Training Data**:

- One must have data from the curator module, which is used for training of the model. The data from the curator module is a CSV file as follows:
- One must have data from the curator module, which is used for training of the model. The data from the curator module is a CSV file as follows:

### Example Snippet

| question | context | company | source_file | source_page | kpi_id | year | answer | data_type | relevant_paragraphs | annotator | Index | label |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | ----------------------------- | ----------- | ------ | ---- | ----------- | --------- | ---------------------------------- | --------------------- | ----- | ----- |
| What is the company name? | The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash. | NOVATEK | 04_NOVATEK_AR_2016_ENG_11.pdf | ['0'] | 0 | 2016 | PAO NOVATEK | TEXT | ["PAO NOVATEK ANNUAL REPORT 2016"] | train_anno_large.xlsx | 1022 | 0 |

- If you have CSV data from the curator module, run `make_training_data_from_curator.py` to process and save it in the `Data` folder.
- Alternatively, you can use `make_sample_training_data.ipynb` to generate sample data from a sample CSV file.

2. **Train the Model**:

- Use `train_sentence_transformer.ipynb` or `train_sentence_transformer.py` to train a sentence transformer model with the processed data from the `Data` folder and save it locally. Follow the steps in the notebook or script to configure and start the training process.

- To train the model using function calling

```python
from train_sentence_transformer import fine_tune_model
fine_tune_model(
data_path="data/train_data.csv",
model_name="sentence-transformers/all-MiniLM-L6-v2",
num_labels=2,
max_length=512,
epochs=2,
batch_size=4,
output_dir="./saved_models_during_training",
save_steps=500
)
```

**Parameters**:

- `data_path (str)`: Path to the training data CSV file.
- `model_name (str)`: Pre-trained model name from HuggingFace.
- `num_labels (int)`: Number of labels for the classification task.
- `max_length (int)`: Maximum sequence length.
- `epochs (int)`: Number of training epochs.
- `batch_size (int)`: Batch size for training.
- `output_dir (str)`: Directory to save the trained models.
- `save_steps (int)`: Number of steps between saving checkpoints.

- To train the model from the command line, run `fine_tune.py` with the required arguments:

```bash
python fine_tune.py \
--data_path "data/train_data.csv" \
--model_name "sentence-transformers/all-MiniLM-L6-v2" \
--num_labels 2 \
--max_length 512 \
--epochs 2 \
--batch_size 4 \
--output_dir "./saved_models_during_training" \
--save_steps 500
```

3. **Perform Inference**:
- Use `inference_demo.ipynb` to perform inferences with your trained model. Specify the model and tokenizer paths (either local or from HuggingFace) and run the notebook cells to see the results.
- For programmatic inference, you can use the function provided in `inference.py`:

```python
from inference import get_inference
result = get_inference(question="What is the relevance?", paragraph="This is a sample paragraph.", model_path="path/to/model", tokenizer_path="path/to/tokenizer")
```
| question | context | company | source_file | source_page | kpi_id | year | answer | data_type | relevant_paragraphs | annotator | Index | label |
| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | ----------------------------- | ----------- | ------ | ---- | ----------- | --------- | ---------------------------------- | --------------------- | ----- | ----- |
| What is the company name? | The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash. | NOVATEK | 04_NOVATEK_AR_2016_ENG_11.pdf | ['0'] | 0 | 2016 | PAO NOVATEK | TEXT | ["PAO NOVATEK ANNUAL REPORT 2016"] | train_anno_large.xlsx | 1022 | 0 |

- If you have CSV data from the curator module, run `make_training_data_from_curator.py` to process and save it in the `Data` folder.
- Alternatively, you can use `make_sample_training_data.ipynb` to generate sample data from a sample CSV file.

**Train the Model**:

- Use `train_sentence_transformer.ipynb` or `train_sentence_transformer.py` to train a sentence transformer model with the processed data from the `Data` folder and save it locally. Follow the steps in the notebook or script to configure and start the training process.

- To train the model using function calling

```python
from train_sentence_transformer import fine_tune_model
fine_tune_model(
data_path="data/train_data.csv",
model_name="sentence-transformers/all-MiniLM-L6-v2",
num_labels=2,
max_length=512,
epochs=2,
batch_size=4,
output_dir="./saved_models_during_training",
save_steps=500
)
```

**Parameters**:

- `data_path (str)`: Path to the training data CSV file.
- `model_name (str)`: Pre-trained model name from HuggingFace.
- `num_labels (int)`: Number of labels for the classification task.
- `max_length (int)`: Maximum sequence length.
- `epochs (int)`: Number of training epochs.
- `batch_size (int)`: Batch size for training.
- `output_dir (str)`: Directory to save the trained models.
- `save_steps (int)`: Number of steps between saving checkpoints.

- To train the model from the command line, run `fine_tune.py` with the required arguments:

```bash
python fine_tune.py \
--data_path "data/train_data.csv" \
--model_name "sentence-transformers/all-MiniLM-L6-v2" \
--num_labels 2 \
--max_length 512 \
--epochs 2 \
--batch_size 4 \
--output_dir "./saved_models_during_training" \
--save_steps 500
```

**Perform Inference**:

- Use `inference_demo.ipynb` to perform inferences with your trained model. Specify the model and tokenizer paths (either local or from HuggingFace) and run the notebook cells to see the results.
- For programmatic inference, you can use the function provided in `inference.py`:

```python
from inference import get_inference
result = get_inference(question="What is the relevance?", paragraph="This is a sample paragraph.", model_path="path/to/model", tokenizer_path="path/to/tokenizer")
```

## Repository Contents

Expand All @@ -87,6 +88,7 @@ This folder contains a set of scripts and notebooks designed to process data, tr
```

**Parameters**:

- `question (str)`: The question for inference.
- `paragraph (str)`: The paragraph to be analyzed.
- `model_path (str)`: Path to the pre-trained model.
Expand Down Expand Up @@ -118,6 +120,7 @@ This folder contains a set of scripts and notebooks designed to process data, tr
```

**Parameters**:

- `data_path (str)`: Path to the training data CSV file.
- `model_name (str)`: Pre-trained model name from HuggingFace.
- `num_labels (int)`: Number of labels for the classification task.
Expand All @@ -128,6 +131,7 @@ This folder contains a set of scripts and notebooks designed to process data, tr
- `save_steps (int)`: Number of steps between saving checkpoints.

4. **`fine_tune.py`**

- This script allows you to train a sentence transformer model from the command line.
- **Usage**: Run this script from the command line with the necessary arguments.
- **Example**:
Expand Down
1 change: 0 additions & 1 deletion src/osc_transformer_based_extractor/inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,6 @@ def get_inference(question: str, context: str, model_path: str, tokenizer_path:
print(f"Predicted Label ID: {result}")



'''python inference.py
--question "What is the capital of France?"
--context "Paris is the capital of France."
Expand Down
1 change: 1 addition & 0 deletions src/pytests/tess.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ def train(self):
def evaluate(self, dataset):
return {"eval_loss": 0.1, "eval_accuracy": 0.95}


@pytest.fixture
def mock_trainer():
return MockTrainer()
Expand Down

0 comments on commit a993b40

Please sign in to comment.