This repository is for the purpose of uploading code and outputs related to the Master thesis. Since the data folder is too large to upload, it is accessible via Google Drive.
This repository contains the following folders:
-
code: a folder contains the code for the thesis project
-
confusion_matrix: a folder contains all the confusion matrices for both dev and test sets across various temperature settings and confidence threshold
-
gold_data: a folder contains the original gold data made by the expert and the preprocessed ones, as well as the Word List (the true transcriptions)
gold_data_dev.xlsx and gold_test_data_with_features.xlsx are the final Excel files that can be used to check the gold labels for the dev and test data
-
graph: a folder contains the graphs for visualizing the outputs, which can be obtained by running notebook in the code folder
-
json_files: a folder that stores all the outputs in several subfolders and a json file:
-
metrics_output: contains the metrics results of the model under different settings (e.g., metrics01_cs03.json means the Temperature is set to 0.1 with confidence threshold 0.3)
-
test: contains all kinds of json files of the test set
- transcription_output: a folder stores the raw and processed transcriptions generated by the Whisper model
- TP_to_FN.json (extracted instances that were actually Ture Positives(being correctly classified) of the unintelligible class, but turned Ture Negatives after applying the confidence threshold)
- false_negatives_transcriptions.json (where all FN instances are stored)
- false_positives_negatives.json (where all the errors are stored, both FP and FN)
- false_positives_transcriptions.json (where all FP instances are stored)
- misclassified_word_frequencies.json
- real_false_negatives.json (stores all FN that were still FN before applying confidence threshold)
- test_NAs_info.json (a json file for storing nonsensical transcriptions of the test set)
- test_metrics_output.json (results of test)
-
transcription_output: a folder stores all the outputs of the dev set
-
processed_output: a folder contains the processed outputs across different temperature settings (from 0 to 1 with an increment of 0.1 ). For instance, output00.json means output obtained when the Temperature is 0
-
raw_tanscription_info: a folder contains the raw transcription data generated by the Whisper model across different temperature settings
-
raw_text: a folder contains the raw transcription (only the text part) across different temperature settings
-
-
NAs_info.json(a json file for store nonsensical transcriptions of the dev set)
-