Fix: Specify correct repo to pull when updating branch [skip ci] (#37)

* Fix: Specify correct repo to pull when updating branch [skip ci] Signed-off-by: Matthew Watkins <[email protected]> * Fix: Specify correct repo to pull when updating branch Signed-off-by: Matthew Watkins <[email protected]> --------- Signed-off-by: Matthew Watkins <[email protected]>
os-climate · Jul 8, 2024 · a993b40 · a993b40
1 parent e13ac99
commit a993b40
Show file tree

Hide file tree

Showing 4 changed files with 68 additions and 65 deletions.
diff --git a/.github/workflows/bootstrap.yaml b/.github/workflows/bootstrap.yaml
@@ -214,8 +214,7 @@ jobs:
           else
             # The -B flag swaps branch and creates it if NOT present
             git checkout -B "$AUTOMATION_BRANCH"
-            git pull upstream "$AUTOMATION_BRANCH"
-            git pull
+            git pull origin "$AUTOMATION_BRANCH"
           fi
 
           # Only if NOT running in GitHub

diff --git a/src/osc_transformer_based_extractor/README.md b/src/osc_transformer_based_extractor/README.md
@@ -4,72 +4,73 @@ This folder contains a set of scripts and notebooks designed to process data, tr
 
 ## How to Use This Repository
 
-1. **Prepare Training Data**:
+**Prepare Training Data**:
 
-   - One must have data from the curator module, which is used for training of the model. The data from the curator module is a CSV file as follows:
+- One must have data from the curator module, which is used for training of the model. The data from the curator module is a CSV file as follows:
 
 ### Example Snippet
 
-   | question                  | context                                                                                                                                                                                                       | company | source_file                   | source_page | kpi_id | year | answer      | data_type | relevant_paragraphs                | annotator             | Index | label |
-   | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | ----------------------------- | ----------- | ------ | ---- | ----------- | --------- | ---------------------------------- | --------------------- | ----- | ----- |
-   | What is the company name? | The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash. | NOVATEK | 04_NOVATEK_AR_2016_ENG_11.pdf | ['0']       | 0      | 2016 | PAO NOVATEK | TEXT      | ["PAO NOVATEK ANNUAL REPORT 2016"] | train_anno_large.xlsx | 1022  | 0     |
-
-   - If you have CSV data from the curator module, run `make_training_data_from_curator.py` to process and save it in the `Data` folder.
-   - Alternatively, you can use `make_sample_training_data.ipynb` to generate sample data from a sample CSV file.
-
-2. **Train the Model**:
-
-   - Use `train_sentence_transformer.ipynb` or `train_sentence_transformer.py` to train a sentence transformer model with the processed data from the `Data` folder and save it locally. Follow the steps in the notebook or script to configure and start the training process.
-
-   - To train the model using function calling
-
-     ```python
-     from train_sentence_transformer import fine_tune_model
-     fine_tune_model(
-        data_path="data/train_data.csv",
-        model_name="sentence-transformers/all-MiniLM-L6-v2",
-        num_labels=2,
-        max_length=512,
-        epochs=2,
-        batch_size=4,
-        output_dir="./saved_models_during_training",
-        save_steps=500
-     )
-     ```
-
-     **Parameters**:
-
-     - `data_path (str)`: Path to the training data CSV file.
-     - `model_name (str)`: Pre-trained model name from HuggingFace.
-     - `num_labels (int)`: Number of labels for the classification task.
-     - `max_length (int)`: Maximum sequence length.
-     - `epochs (int)`: Number of training epochs.
-     - `batch_size (int)`: Batch size for training.
-     - `output_dir (str)`: Directory to save the trained models.
-     - `save_steps (int)`: Number of steps between saving checkpoints.
-
-   - To train the model from the command line, run `fine_tune.py` with the required arguments:
-
-     ```bash
-     python fine_tune.py \
-       --data_path "data/train_data.csv" \
-       --model_name "sentence-transformers/all-MiniLM-L6-v2" \
-       --num_labels 2 \
-       --max_length 512 \
-       --epochs 2 \
-       --batch_size 4 \
-       --output_dir "./saved_models_during_training" \
-       --save_steps 500
-     ```
-
-3. **Perform Inference**:
-   - Use `inference_demo.ipynb` to perform inferences with your trained model. Specify the model and tokenizer paths (either local or from HuggingFace) and run the notebook cells to see the results.
-   - For programmatic inference, you can use the function provided in `inference.py`:
-
-     ```python
-     from inference import get_inference
-     result = get_inference(question="What is the relevance?", paragraph="This is a sample paragraph.", model_path="path/to/model", tokenizer_path="path/to/tokenizer")
-     ```
+| question                  | context                                                                                                                                                                                                       | company | source_file                   | source_page | kpi_id | year | answer      | data_type | relevant_paragraphs                | annotator             | Index | label |
+| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | ----------------------------- | ----------- | ------ | ---- | ----------- | --------- | ---------------------------------- | --------------------- | ----- | ----- |
+| What is the company name? | The Company is exposed to a risk of by losses counterparties their contractual financial obligations when due, and in particular depends on the reliability of banks the Company deposits its available cash. | NOVATEK | 04_NOVATEK_AR_2016_ENG_11.pdf | ['0']       | 0      | 2016 | PAO NOVATEK | TEXT      | ["PAO NOVATEK ANNUAL REPORT 2016"] | train_anno_large.xlsx | 1022  | 0     |
+
+- If you have CSV data from the curator module, run `make_training_data_from_curator.py` to process and save it in the `Data` folder.
+- Alternatively, you can use `make_sample_training_data.ipynb` to generate sample data from a sample CSV file.
+
+**Train the Model**:
+
+- Use `train_sentence_transformer.ipynb` or `train_sentence_transformer.py` to train a sentence transformer model with the processed data from the `Data` folder and save it locally. Follow the steps in the notebook or script to configure and start the training process.
+
+- To train the model using function calling
+
+  ```python
+  from train_sentence_transformer import fine_tune_model
+  fine_tune_model(
+     data_path="data/train_data.csv",
+     model_name="sentence-transformers/all-MiniLM-L6-v2",
+     num_labels=2,
+     max_length=512,
+     epochs=2,
+     batch_size=4,
+     output_dir="./saved_models_during_training",
+     save_steps=500
+  )
+  ```
+
+  **Parameters**:
+
+  - `data_path (str)`: Path to the training data CSV file.
+  - `model_name (str)`: Pre-trained model name from HuggingFace.
+  - `num_labels (int)`: Number of labels for the classification task.
+  - `max_length (int)`: Maximum sequence length.
+  - `epochs (int)`: Number of training epochs.
+  - `batch_size (int)`: Batch size for training.
+  - `output_dir (str)`: Directory to save the trained models.
+  - `save_steps (int)`: Number of steps between saving checkpoints.
+
+- To train the model from the command line, run `fine_tune.py` with the required arguments:
+
+  ```bash
+  python fine_tune.py \
+    --data_path "data/train_data.csv" \
+    --model_name "sentence-transformers/all-MiniLM-L6-v2" \
+    --num_labels 2 \
+    --max_length 512 \
+    --epochs 2 \
+    --batch_size 4 \
+    --output_dir "./saved_models_during_training" \
+    --save_steps 500
+  ```
+
+**Perform Inference**:
+
+- Use `inference_demo.ipynb` to perform inferences with your trained model. Specify the model and tokenizer paths (either local or from HuggingFace) and run the notebook cells to see the results.
+- For programmatic inference, you can use the function provided in `inference.py`:
+
+  ```python
+  from inference import get_inference
+  result = get_inference(question="What is the relevance?", paragraph="This is a sample paragraph.", model_path="path/to/model", tokenizer_path="path/to/tokenizer")
+  ```
 
 ## Repository Contents
 
@@ -87,6 +88,7 @@ This folder contains a set of scripts and notebooks designed to process data, tr
      ```
 
      **Parameters**:
+
      - `question (str)`: The question for inference.
      - `paragraph (str)`: The paragraph to be analyzed.
      - `model_path (str)`: Path to the pre-trained model.
@@ -118,6 +120,7 @@ This folder contains a set of scripts and notebooks designed to process data, tr
      ```
 
      **Parameters**:
+
      - `data_path (str)`: Path to the training data CSV file.
      - `model_name (str)`: Pre-trained model name from HuggingFace.
      - `num_labels (int)`: Number of labels for the classification task.
@@ -128,6 +131,7 @@ This folder contains a set of scripts and notebooks designed to process data, tr
      - `save_steps (int)`: Number of steps between saving checkpoints.
 
 4. **`fine_tune.py`**
+
    - This script allows you to train a sentence transformer model from the command line.
    - **Usage**: Run this script from the command line with the necessary arguments.
    - **Example**:

diff --git a/src/osc_transformer_based_extractor/inference.py b/src/osc_transformer_based_extractor/inference.py
@@ -92,7 +92,6 @@ def get_inference(question: str, context: str, model_path: str, tokenizer_path:
     print(f"Predicted Label ID: {result}")
 
 
-
 '''python inference.py
     --question "What is the capital of France?"
     --context "Paris is the capital of France."

diff --git a/src/pytests/tess.py b/src/pytests/tess.py
@@ -13,6 +13,7 @@ def train(self):
     def evaluate(self, dataset):
         return {"eval_loss": 0.1, "eval_accuracy": 0.95}
 
+
 @pytest.fixture
 def mock_trainer():
     return MockTrainer()