Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eng 815 updated tests #13

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
7345cc9
fixed disambiguation and negation tests
twerkmeister Feb 1, 2024
2b8d8f3
fixed utterance name
twerkmeister Feb 1, 2024
e45f65b
test fixes
twerkmeister Feb 4, 2024
33d8da2
removed cancel confirmation for non transfer money flow
twerkmeister Feb 4, 2024
17d09a0
removed second time string from test
twerkmeister Feb 4, 2024
1b91278
reinstated misunderstood test
twerkmeister Feb 5, 2024
78c8891
moved back two tests to passing where they originally where
twerkmeister Feb 12, 2024
e3ff5ba
using intentless instead of enterprise search for testability
twerkmeister Feb 14, 2024
5eba7a4
reinstated removed test
twerkmeister Feb 14, 2024
2627af5
Add info about setting up RASA_DUCKLING_HTTP_URL environment variable…
OgnjenFrancuski Feb 14, 2024
78cca6d
Merge branch 'main' into ENG-766-test-fixes-multi-prompting-spike
twerkmeister Feb 16, 2024
ca3319a
readded slot sets for restaurant time and date but without specified …
twerkmeister Feb 16, 2024
5a880f5
added checks for setting to none
twerkmeister Feb 16, 2024
5374401
moved knowledge tests to passing bc using intentless makes them possible
twerkmeister Feb 16, 2024
7b004db
added duckling url secret to env
twerkmeister Feb 16, 2024
1ab5ef7
add prompts
varunshankar Feb 19, 2024
a94feba
moved slot sets to the right user message
twerkmeister Feb 19, 2024
76eb79a
setting temp and top p explicitly
twerkmeister Feb 19, 2024
2bd8e9c
add prompts
varunshankar Feb 20, 2024
720daa1
fix prompts
varunshankar Feb 23, 2024
2dd63b3
fix review comments
varunshankar Feb 27, 2024
09953ae
update readme
varunshankar Feb 28, 2024
eeadf77
Update PROMPT_README.md
varunshankar Feb 29, 2024
8081121
Merge pull request #9 from RasaHQ/ENG-814-add-additional-prompts
varunshankar Feb 29, 2024
290a6a9
Update rasa and rasa-plus to 3.7.8 (#11)
OgnjenFrancuski Mar 4, 2024
dd4b16c
Merge branch 'main' into ENG-766-test-fixes-multi-prompting-spike
twerkmeister Mar 4, 2024
1244eff
added flaky test category, ci step, and PR template
twerkmeister Mar 5, 2024
4524b4c
moved another test to flaky
twerkmeister Mar 5, 2024
bceb133
passing step for flaky tests
twerkmeister Mar 5, 2024
3c8c38e
moved tests that are actually passing out of failing category
twerkmeister Mar 6, 2024
a830b45
moved test to flaky
twerkmeister Mar 6, 2024
d64266b
Update .github/pull_request_template.md
twerkmeister Mar 6, 2024
1435f9f
Merge pull request #6 from RasaHQ/ENG-766-test-fixes-multi-prompting-…
twerkmeister Mar 6, 2024
4ba2a5c
merge main into annotated commands
twerkmeister Mar 8, 2024
e85b129
ordering of clarification options, replace none with null
twerkmeister Mar 11, 2024
a7716a1
triggering knowledge search
twerkmeister Mar 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
## Description

## TODOs
[ ] compared flaky tests with the [known list of flaky tests steps](https://www.notion.so/rasa/Flaky-E2E-Test-Steps-63864d3d8c7b4427a0f3df8052e39f21)
16 changes: 14 additions & 2 deletions .github/workflows/continous-integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ jobs:
OPENAI_API_KEY: ${{secrets.OPENAI_API_KEY}}
RASA_PRO_LICENSE: ${{secrets.RASA_PRO_LICENSE}}
RASA_PRO_BETA_INTENTLESS: true
DUCKLING_URL: ${{secrets.DUCKLING_URL}}
RASA_DUCKLING_HTTP_URL: ${{secrets.DUCKLING_URL}}
run: |
make train

Expand Down Expand Up @@ -160,16 +160,28 @@ jobs:
env:
OPENAI_API_KEY: ${{secrets.OPENAI_API_KEY}}
RASA_PRO_LICENSE: ${{secrets.RASA_PRO_LICENSE}}
RASA_DUCKLING_HTTP_URL: ${{secrets.DUCKLING_URL}}
RASA_PRO_BETA_INTENTLESS: true
run: |
make actions &
make test-passing

- name: Run e2e flaky tests
env:
OPENAI_API_KEY: ${{secrets.OPENAI_API_KEY}}
RASA_PRO_LICENSE: ${{secrets.RASA_PRO_LICENSE}}
RASA_DUCKLING_HTTP_URL: ${{secrets.DUCKLING_URL}}
RASA_PRO_BETA_INTENTLESS: true
run: |
make actions &
make test-flaky || true

- name: Run e2e failing tests
env:
OPENAI_API_KEY: ${{secrets.OPENAI_API_KEY}}
RASA_PRO_LICENSE: ${{secrets.RASA_PRO_LICENSE}}
RASA_DUCKLING_HTTP_URL: ${{secrets.DUCKLING_URL}}
RASA_PRO_BETA_INTENTLESS: true
run: |
make actions &
make test-failing | grep '0 passed'
make test-failing | grep '0 passed'
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ actions:
test-passing: .EXPORT_ALL_VARIABLES
poetry run rasa test e2e e2e_tests/passing

test-flaky: .EXPORT_ALL_VARIABLES
poetry run rasa test e2e e2e_tests/flaky

test-failing: .EXPORT_ALL_VARIABLES
poetry run rasa test e2e e2e_tests/failing

Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ Prerequisites:
`poetry self update`
- python (3.10.12), e.g. using [pyenv](https://github.com/pyenv/pyenv)
`pyenv install 3.10.12`
- set up and running [Duckling](https://github.com/facebook/duckling) server

After you cloned the repository and are authenticated, follow the installation steps:

Expand All @@ -191,6 +192,7 @@ After you cloned the repository and are authenticated, follow the installation s
```bash
RASA_PRO_LICENSE=<your rasa pro license key>
OPENAI_API_KEY=<your openai api key>
RASA_DUCKLING_HTTP_URL=<url to the duckling server>
```

### Training the bot
Expand Down
2 changes: 1 addition & 1 deletion actions/entity_extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from rasa.nlu.extractors.duckling_entity_extractor import DucklingEntityExtractor

load_dotenv()
duckling_url = os.environ.get("DUCKLING_URL")
duckling_url = os.environ.get("RASA_DUCKLING_HTTP_URL")

duckling_config = {
**DucklingEntityExtractor.get_default_config(),
Expand Down
2 changes: 2 additions & 0 deletions config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ pipeline:
llm:
model_name: gpt-4
request_timeout: 7
temperature: 0.0
top_p: 0.0

policies:
- name: FlowPolicy
Expand Down
6 changes: 6 additions & 0 deletions data/flows/add_card.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
flows:
add_card:
description: add a card to your account
name: add a card
steps:
- action: utter_card_added
4 changes: 3 additions & 1 deletion data/flows/patterns.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,13 @@ flows:
steps:
- action: action_trigger_chitchat

# using chitchat here so that intentless is used for better testability
pattern_search:
description: handle knowledge-based requests using enterprise search
steps:
# - action: action_trigger_chitchat
- action: action_trigger_search

pattern_cancel_flow:
description: A meta flow that's started when a flow is cancelled.
steps:
Expand Down
28 changes: 28 additions & 0 deletions data/prompts/PROMPT_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Prompts README

This README provides information on how to use the prompts and includes the results of the end-to-end (e2e) tests for different models.

## Usage

```
name: LLMCommandGenerator
prompt: path_to_prompt_file.jinja2
```
This [component](https://rasa.com/docs/rasa-pro/concepts/dialogue-understanding/#using-the-llmcommandgenerator) generates commands using a LLM based on the given prompt file and should be included in the `pipeline` section of the `config.yml` file.

## E2E Test Results

The `e2e_tests` folder contains the test cases across different conversation categories that are used to evaluate the models.

The conversations are modeled using `flows`. The `domain` file contains the definition of bot utterances, slots, and actions that are used in the test cases.

The following are the results of the e2e tests conducted for different models using designated prompts.

| Model | Accuracy | Prompt file |
|---------|----------|-------------|
| gpt-4 | 88.09% | default |
| gpt-4-1106-preview | 71.42% | default |
| gpt-4-0125-preview | 67.86% | default |
| gpt-3.5-turbo | 63.1% | [data/prompts/gpt_3-5_turbo_cmd_gen_prompt.jinja2](gpt_3-5_turbo_cmd_gen_prompt.jinja2) |
| gpt-3.5-turbo-1106 | 52.38% | [data/prompts/gpt_3-5_turbo_1106_cmd_gen_prompt.jinja2](gpt_3-5_turbo_1106_cmd_gen_prompt.jinja2) |
| mistral-medium | 44.05% | [data/prompts/mistral_medium_cmd_gen_prompt.jinja2](mistral_medium_cmd_gen_prompt.jinja2) |
60 changes: 60 additions & 0 deletions data/prompts/gpt_3-5_turbo_1106_cmd_gen_prompt.jinja2
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
Your task is to analyze the current conversation context and generate a list of actions to start new business processes that we call flows, to extract slots, or respond to small talk and knowledge requests.
Believe in your abilities and strive for excellence. Your hard work will yield remarkable results. You can do it!

These are the flows that can be started, with their description and slots:
{% for flow in available_flows %}
{{ flow.name }}: {{ flow.description }}
{% for slot in flow.slots -%}
slot: {{ slot.name }}{% if slot.description %} ({{ slot.description }}){% endif %}{% if slot.allowed_values %}, allowed values: {{ slot.allowed_values }}{% endif %}
{% endfor %}
{%- endfor %}

===
{% if current_flow != None %}
You are currently in the flow "{{ current_flow }}".
You have just asked the user for the slot "{{ current_slot }}"{% if current_slot_description %} ({{ current_slot_description }}){% endif %}.

{% if flow_slots|length > 0 %}
Here are the slots of the currently active flow:
{% for slot in flow_slots -%}
- name: {{ slot.name }}, value: {{ slot.value }}, type: {{ slot.type }}, description: {{ slot.description}}{% if slot.allowed_values %}, allowed values: {{ slot.allowed_values }}{% endif %}
{% endfor %}
{% endif %}
{% else %}
You are currently not in any flow and so there are no active slots.
This means you can only set a slot if you first start a flow that requires that slot.
{% endif %}
If you start a flow, first start the flow and then optionally fill that flow's slots with information the user provided in their message.

===
Based on this information generate a list of actions you want to take. Any logic of what happens afterwards is handled by the flow engine. These are your available actions:
* Slot setting, described by "SetSlot(slot_name, slot_value)". An example would be "SetSlot(recipient, Freddy)". Only set a slot when it is explicitly mentioned by the user, do not set a slot with abstract or unspecific values.
* Starting a flow, described by "StartFlow(flow_name)". An example would be "StartFlow(transfer_money)".
* Canceling/Stopping the current flow, described by "CancelFlow()". Examples of user canceling flow phrases are: "stop that", "cancel this".
* Clarifying which flow should be started. An example would be Clarify(list_contacts, add_contact, remove_contact) if the user just wrote "contacts" and there are multiple potential candidates. It also works with a single flow name to confirm you understood correctly, as in Clarify(transfer_money).
* Intercepting and handle user messages with the intent to bypass the current step in the flow, described by "SkipQuestion()". Examples of user skip phrases are: "Go to the next question", "Ask me something else".
* Responding to knowledge-oriented user messages, that needs further information from a knowledge base, described by "SearchAndReply()".
* Responding to a casual, non-task-oriented user message, described by "ChitChat()". Do not predict "ChitChat()" if the message contains valuable information, such as slots.
* Handing off to a human, in case the user seems frustrated or explicitly asks to speak to one, described by "HumanHandoff()".

===
Do not fill slots with abstract values or placeholders.
You can only fill a slot when a flow is active.
Only use information provided by the user.
If the user asks for two things which seem contradictory, clarify before starting a flow.
If it's not clear whether the user wants to skip the step or to cancel the flow, cancel the flow.
Strictly adhere to the provided action types listed above.
Focus on the last message and take it one step at a time.
Use the previous conversation steps only to aid understanding.
Only predict "ChitChat()" if there is no other action to take.
A flow can be interrupted by another flow.

===
Here is what happened previously in the conversation:
{{ current_conversation }}

The user just said """{{ user_message }}""".

===
Think this through step by step manner, go through the context, surfacing important information that could be useful, and first write an analysis of the last user message. Pay close attention to the descriptions of slots. Do not fill slots with abstract values before the user has mentioned or referenced the values. Do not add any unnecessary actions.
Afterwards, write out the actions you want to take, one per line.
59 changes: 59 additions & 0 deletions data/prompts/gpt_3-5_turbo_cmd_gen_prompt.jinja2
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
Your task is to analyze the current conversation context and generate a list of actions to start new business processes that we call flows, to extract slots, or respond to small talk and knowledge requests.

These are the flows that can be started, with their description and slots:
{% for flow in available_flows %}
{{ flow.name }}: {{ flow.description }}
{% for slot in flow.slots -%}
slot: {{ slot.name }}{% if slot.description %} ({{ slot.description }}){% endif %}{% if slot.allowed_values %}, allowed values: {{ slot.allowed_values }}{% endif %}
{% endfor %}
{%- endfor %}

===
{% if current_flow != None %}
You are currently in the flow "{{ current_flow }}".
You have just asked the user for the slot "{{ current_slot }}"{% if current_slot_description %} ({{ current_slot_description }}){% endif %}.

{% if flow_slots|length > 0 %}
Here are the slots of the currently active flow:
{% for slot in flow_slots -%}
- name: {{ slot.name }}, value: {{ slot.value }}, type: {{ slot.type }}, description: {{ slot.description}}{% if slot.allowed_values %}, allowed values: {{ slot.allowed_values }}{% endif %}
{% endfor %}
{% endif %}
{% else %}
You are currently not in any flow and so there are no active slots.
This means you can only set a slot if you first start a flow that requires that slot.
{% endif %}
If you start a flow, first start the flow and then optionally fill that flow's slots with information the user provided in their message.

===
Based on this information generate a list of actions you want to take. Any logic of what happens afterwards is handled by the flow engine. These are your available actions:
* Slot setting, described by "SetSlot(slot_name, slot_value)". An example would be "SetSlot(recipient, Freddy)". Only set a slot when it is explicitly mentioned by the user, do not set a slot with abstract or unspecific values.
* Starting a flow, described by "StartFlow(flow_name)". An example would be "StartFlow(transfer_money)".
* Canceling/Stopping the current flow, described by "CancelFlow()". Examples of user canceling flow phrases are: "stop that", "cancel this".
* Clarifying which flow should be started. An example would be Clarify(list_contacts, add_contact, remove_contact) if the user just wrote "contacts" and there are multiple potential candidates. It also works with a single flow name to confirm you understood correctly, as in Clarify(transfer_money).
* Intercepting and handle user messages with the intent to bypass the current step in the flow, described by "SkipQuestion()". Examples of user skip phrases are: "Go to the next question", "Ask me something else".
* Responding to knowledge-oriented user messages, that needs further information from a knowledge base, described by "SearchAndReply()".
* Responding to a casual, non-task-oriented user message, described by "ChitChat()". Do not predict "ChitChat()" if the message contains valuable information, such as slots.
* Handing off to a human, in case the user seems frustrated or explicitly asks to speak to one, described by "HumanHandoff()".

===
Do not fill slots with abstract values or placeholders.
You can only fill a slot when a flow is active.
Only use information provided by the user.
If the user asks for two things which seem contradictory, clarify before starting a flow.
If it's not clear whether the user wants to skip the step or to cancel the flow, cancel the flow.
Strictly adhere to the provided action types listed above.
Focus on the last message and take it one step at a time.
Use the previous conversation steps only to aid understanding.
Only predict "ChitChat()" if there is no other action to take.
A flow can be interrupted by another flow.

===
Here is what happened previously in the conversation:
{{ current_conversation }}

The user just said """{{ user_message }}""".

===
Think this through step by step manner, go through the context, surfacing important information that could be useful, and first write an analysis of the last user message. Pay close attention to the descriptions of slots. Do not fill slots with abstract values before the user has mentioned or referenced the values. Do not add any unnecessary actions.
Afterwards, write out the actions you want to take, one per line.
62 changes: 62 additions & 0 deletions data/prompts/mistral_medium_cmd_gen_prompt.jinja2
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
<s>[INST] Your task is to analyze the current conversation context and generate a list of actions to start new business processes that we call flows, to extract slots, or respond to small talk and knowledge requests.
Believe in your abilities and strive for excellence. Your hard work will yield remarkable results. You can do it!

These are the flows that can be started, with their description and slots:
{% for flow in available_flows %}
{{ flow.name }}: {{ flow.description }}
{% for slot in flow.slots -%}
slot: {{ slot.name }}{% if slot.description %} ({{ slot.description }}){% endif %}{% if slot.allowed_values %}, allowed values: {{ slot.allowed_values }}{% endif %}
{% endfor %}
{%- endfor %}

===
{% if current_flow != None %}
You are currently in the flow "{{ current_flow }}".
You have just asked the user for the slot "{{ current_slot }}"{% if current_slot_description %} ({{ current_slot_description }}){% endif %}.

{% if flow_slots|length > 0 %}
Here are the slots of the currently active flow:
{% for slot in flow_slots -%}
- name: {{ slot.name }}, value: {{ slot.value }}, type: {{ slot.type }}, description: {{ slot.description}}{% if slot.allowed_values %}, allowed values: {{ slot.allowed_values }}{% endif %}
{% endfor %}
{% endif %}
{% else %}
You are currently not in any flow and so there are no active slots.
This means you can only set a slot if you first start a flow that requires that slot.
{% endif %}
If you start a flow, first start the flow and then optionally fill that flow's slots with information the user provided in their message.

===
Based on this information generate a list of actions you want to take. Any logic of what happens afterwards is handled by the flow engine. These are your available actions:
* Slot setting, described by "SetSlot(slot_name, slot_value)". An example would be "SetSlot(recipient, Freddy)". Only set a slot when it is explicitly mentioned by the user, do not set a slot with abstract or unspecific values.
* Starting a flow, described by "StartFlow(flow_name)". An example would be "StartFlow(transfer_money)".
* Canceling/Stopping the current flow, described by "CancelFlow()". Examples of user canceling flow phrases are: "stop that", "cancel this".
* Clarifying which flow should be started. An example would be Clarify(list_contacts, add_contact, remove_contact) if the user just wrote "contacts" and there are multiple potential candidates. It also works with a single flow name to confirm you understood correctly, as in Clarify(transfer_money).
* Intercepting and handle user messages with the intent to bypass the current step in the flow, described by "SkipQuestion()". Examples of user skip phrases are: "Go to the next question", "Ask me something else".
* Responding to knowledge-oriented user messages, that needs further information from a knowledge base, described by "SearchAndReply()".
* Responding to a casual, non-task-oriented user message, described by "ChitChat()". Do not predict "ChitChat()" if the message contains valuable information, such as slots.
* Handing off to a human, in case the user seems frustrated or explicitly asks to speak to one, described by "HumanHandoff()".

===
Do not fill slots with abstract values or placeholders.
You can only fill a slot when a flow is active.
Only use information provided by the user.
If the user asks for two things which seem contradictory, clarify before starting a flow.
If it's not clear whether the user wants to skip the step or to cancel the flow, cancel the flow.
Strictly adhere to the provided action types listed above.
Focus on the last message and take it one step at a time.
Use the previous conversation steps only to aid understanding.
Only predict "ChitChat()" if there is no other action to take.
A flow can be interrupted by another flow.
[/INST]
===
Here is what happened previously in the conversation:
{{ current_conversation }}

The user just said """{{ user_message }}""".
</s>
===
[INST]
Think this through step by step manner, go through the context, surfacing important information that could be useful, and first write an analysis of the last user message. Pay close attention to the descriptions of slots. Do not fill slots with abstract values before the user has mentioned or referenced the values. Do not add any unnecessary actions.
Afterwards, write out the actions you want to take, one per line.
[/INST]
5 changes: 5 additions & 0 deletions domain/add_card.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
version: "3.1"

responses:
utter_card_added:
- text: "Okay, added another card."
7 changes: 3 additions & 4 deletions domain/patterns.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,11 @@ responses:
title: Yes
- payload: no
title: No, please keep the previous information
metadata:
metadata:
rephrase: True
template: jinja

utter_not_corrected_previous_input:
- text: "Ok, I did not correct the previous input."
metadata:
metadata:
rephrase: True

Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ test_cases:
- set_slot:
- transfer_money_amount_of_money: "100"
- utter: utter_ask_transfer_money_final_confirmation
- user: Yes
- user: "Yes"
- commands:
- set_slot:
- transfer_money_final_confirmation: "True"
Expand Down
12 changes: 12 additions & 0 deletions e2e_tests/flaky/happy_path/user_sets_up_recurrent_payment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
test_cases:
- test_case: user wants to set up a new recurrent payment, but specifies the type incompletely, example 3
steps:
- user: I want to set up a new recurrent payment
- commands:
- start_flow: setup_recurrent_payment
- utter: utter_ask_recurrent_payment_type
- user: stand order
- commands:
- set_slot:
- recurrent_payment_type: "standing order"
- utter: utter_ask_recipient
Loading
Loading