You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently, the autolabel function performs a strict comparison between the output generated by the language model and a predefined list of potential labels. This approach leads to the issuance of an OUTPUT_GUIDELINES_NOT_FOLLOWED_ERROR when the generated output closely aligns with a label but does not match it exactly. For instance, when using the Llama-7B model with the banking dataset, outputs such as "Sure! Here is the label for your input: Input: I want to close my account. Output: terminate_account" are generated, which, despite the llm having guessed the correct label, do not strictly match any label, thus triggering an error.
Describe the solution you'd like
To address this issue, I propose providing the option to choose a more flexible label comparison mechanism while running a labelling task. The following are some flexible label comparison mechanisms:
Inclusion Check: Evaluate whether any of the predefined labels are contained within the language model's output. If a single label is found within the output, it should be designated as the generated label. However, this approach may be less effective with commonly used words as labels due to the risk of false positives.
Similarity Assessment: Utilizing a similarity metric, such as the ROUGE Score, could offer a more nuanced evaluation of the relationship between the model's output and the potential labels. The label with the highest similarity score would be deemed the most appropriate. This approach should only categorize an output as successfully labeled if the top similarity score significantly surpasses a set threshold or is distinctly higher than other scores.
The second approach significantly enhanced classification accuracy in a project where I fine-tuned Llama-2-70B for categorizing physicians' diagnostic notes into specific types of cancer. Given the complexity and length of oncological terminology, the model was prone to minor spelling inaccuracies in its classifications. Implementing this method resulted in a marked improvement in the precision of the model's categorizations.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Currently, the autolabel function performs a strict comparison between the output generated by the language model and a predefined list of potential labels. This approach leads to the issuance of an OUTPUT_GUIDELINES_NOT_FOLLOWED_ERROR when the generated output closely aligns with a label but does not match it exactly. For instance, when using the Llama-7B model with the banking dataset, outputs such as
"Sure! Here is the label for your input: Input: I want to close my account. Output: terminate_account"
are generated, which, despite the llm having guessed the correct label, do not strictly match any label, thus triggering an error.Describe the solution you'd like
To address this issue, I propose providing the option to choose a more flexible label comparison mechanism while running a labelling task. The following are some flexible label comparison mechanisms:
Inclusion Check: Evaluate whether any of the predefined labels are contained within the language model's output. If a single label is found within the output, it should be designated as the generated label. However, this approach may be less effective with commonly used words as labels due to the risk of false positives.
Similarity Assessment: Utilizing a similarity metric, such as the ROUGE Score, could offer a more nuanced evaluation of the relationship between the model's output and the potential labels. The label with the highest similarity score would be deemed the most appropriate. This approach should only categorize an output as successfully labeled if the top similarity score significantly surpasses a set threshold or is distinctly higher than other scores.
The second approach significantly enhanced classification accuracy in a project where I fine-tuned
Llama-2-70B
for categorizing physicians' diagnostic notes into specific types of cancer. Given the complexity and length of oncological terminology, the model was prone to minor spelling inaccuracies in its classifications. Implementing this method resulted in a marked improvement in the precision of the model's categorizations.The text was updated successfully, but these errors were encountered: