Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator-to-robot Text-to-Speech #64

Merged
merged 20 commits into from
Jul 15, 2024
Merged

Conversation

hello-amal
Copy link
Collaborator

@hello-amal hello-amal commented Jul 5, 2024

Description

This PR adds operator to robot text-to-speech capabilities. Specifically, it adds:

  1. Backend:
    1. A TextToSpeechEngine abstract class that can be used to support multiple engines in a plug-and-play fashion. gTTS and pyttsx3 are implemented.
      1. This abstract class allows multiple voices, two speeds (slow and default), and interrupting an ongoing utterance.
    2. A ROS2 node (and corresponding custom message) that takes in text and additional metadata (voice, speed, whether to interrupt) from a topic and executes it using the specified engine (currently gTTS).
  2. Frontend:
    1. A new basic component, DropdownInput, that behaves like Dropdown but has a textarea to the left of the dropdown arrow.
    2. A web app component, on the same level as "Movement Recorder," that allows users to type arbitrary text, save/delete it, play it on the robot, and stop a robot's utterance.
    3. The data flows through WebRTC and ROSLibJS to enable the above to work.

Select design decision

  • On the web app side, if the user clicks "Play" while an utterance is currently playing, it queues up the second utterance. This is for two reasons: (a) if the operator wants to interrupt the first utterance, they can click "Stop" followed by "Play." (b) if the operator wants the robot to speak a long utterance, this allows them to enter it one sentence at a time, to avoid a lond pause as they are typing.
  • On the web app, when the user clicks on the text area, it highlights all the text. This is to make it easier for them to delete text if they are typing one-sentence-at-a-time and speed is important (e.g., live conversaiton).

Testing procedure

  • Pull this PR, install new requirements (pip3 install -r requirements.txt), rebuild the workspace, and launch the web app (./launch_interface.sh).
  • Test the ROS nodes:
    • gTTS:
      • Test a single slow utterance in an American accent: ros2 topic pub /text_to_speech stretch_web_teleop/msg/TextToSpeech "{text: 'Hello, my name is Stretch, and I am a robot here to assist you', voice: 'com', is_slow: true}" --times 1
      • Test an interrupt with a fast utterance in a British accent: Run the above publication, and then immediately run. ros2 topic pub /text_to_speech stretch_web_teleop/msg/TextToSpeech "{text: 'Shut up, you are talking too slowly', voice: 'co.uk', override_behavior: 1}" --times 1. Verify that the former interrupts, the latter plays, it is faster than the former, and it has a British accent.
      • Test queuing: Run the first command, then run ros2 topic pub /text_to_speech stretch_web_teleop/msg/TextToSpeech "{text: 'I was waiting for you to finish before speaking', voice: 'co.in', is_slow: false}" --times 1. Verify that the latter does not interrupt the former.
      • Test a long queue: Run the first and third commands above 3 times each (6 total commands). Verify that they all execute in order, and the voice and speed changes as expected.
    • Switch to pyTTSx3 by changing this line to engine_type=TextToSpeechEngineType.PYTTSX3.
      • Test a single slow utterance in a male voice: ros2 topic pub /text_to_speech stretch_web_teleop/msg/TextToSpeech "{text: 'Hello, my name is Stretch, and I am a robot here to assist you', voice: 'english+m1', is_slow: true}" --times 1
      • Try an interrupt with a fast utterance in a female: Run the above publication, and then immediately run. ros2 topic pub /text_to_speech stretch_web_teleop/msg/TextToSpeech "{text: 'Shut up, you are talking too slowly', voice: 'english+f1', override_behavior: 1}" --times 1. Verify that the former does not interrupt (because its not supported in pyttsx3), it is faster than the former, and it has a British accent.
      • Test queuing: Run the first command, then run ros2 topic pub /text_to_speech stretch_web_teleop/msg/TextToSpeech "{text: 'I was waiting for you to finish before speaking', voice: 'english+f4', is_slow: false}" --times 1. Verify that the latter does not interrupt the former.
      • Test a long queue: Run the first and third commands above 3 times each (6 total commands). Verify that they all execute in order, and the voice and speed changes as expected.
  • Test the web app:
    • Verify that initially, text-to-speech does not display.
    • Go into "Customize" and click "On" to display text-to-speech.
    • Enter an utterance into the input and click "Play". Verify it plays.
    • Click back on the text area. Verify that when you click on it, initially all the text highlights.
    • Enter an utterance into the input, click "Play," but as soon as it starts playing click "Stop." Verify it stops.
    • Save the utterance you entered. Verify that the "Save" button has turned to "Delete." Click the dropdown and verify the utterance appears.
    • Type a new utterance and play it. Immediately, load the saved utterance and play that. Verify that the saved utterance is spoken after the first utterance, but does not interrupt it.
    • Repeat the above, but press "Stop" after clicking "Play" for the second utterance. Verify that the first utterance stops and the second utterance doesn't play.
    • Load the saved utterance. Type a non-whitespace character after the utterance. Verify that the "Delete" button goes back to "Save." Delete the character. Verify it goes back to "Delete."
    • Click "Delete." Verify that the utterance no longer exists in the dropdown.
    • Save a multi-line utterance (~5 lines). Verify that in the dropdown, it shows all of the text.
    • Save enough multi-line utterances such that the dropdown popup goes beyond the bottom of the screen. Verify that a scrollbar appears and you can scroll through all utterances.
    • Resize the screen while the dropdown popup is open, verify the scrolling still allows you to go from the top to bottom of the entire popup.
    • Unmute the robot-to-operator speech and then play an utterance. Ensure it is audible with low latency.
    • Load movement recorder while text-to-speech is open, verify that both appear. Resize the screen. Verify both stay there and adjust their position responsively.
    • Save a couple of poses in movement recorder. Open the dropdown. Verify it still renders correctly.
  • On a "new" Stretch (e.g., without this PR), install the dependencies and verify that everything works properly.

Before opening a pull request

From the top-level of this repository, run:

  • pre-commit run --all-files

To merge

  • Squash & Merge

@hello-amal hello-amal marked this pull request as draft July 5, 2024 00:27
@hello-amal hello-amal marked this pull request as ready for review July 10, 2024 20:54
@hello-amal hello-amal changed the title [WIP] Operator-to-robot Text-to-Speech Operator-to-robot Text-to-Speech Jul 10, 2024
@hello-amal hello-amal requested a review from hello-vinitha July 10, 2024 23:50
@hello-amal
Copy link
Collaborator Author

Ran all tests on 3030, but the test to ensure the requirements are complete. @hello-vinitha can you run that on your robot, since it is a "clean" install (e.g., it shouldn't have any of these audio libraries?)

@hello-vinitha
Copy link
Collaborator

@hello-amal All the tests pass on 2051. A couple of questions/suggestions:

  • Is there any clear benefit of having pyTTS/is there a scenario where a user would want to use pyTTS over gTTS? If yes, then I recommend adding this as an optional flag to the launch file and launch script. If not, I would recommend removing pyTTs.
  • When an utterance is playing, I recommend changing the text from "Play" to "Add to Queue" so that the operator knows that they can queue utterances.
  • I recommend hiding the "Stop" button unless something is playing so we can reduce the number of buttons when possible. Also, it will draw the operators attention when it does appear.
  • Make the delete icon in movement recorder the lighter red and the stop button in TTS the deeper red to be consistent with the color scheme throughout the rest of the interface.

@hello-amal
Copy link
Collaborator Author

  1. Well, gTTS uses Google's unofficial Google Translate API, which they may stop supporting at any time. So I think it is important to have pyttsx3, even if its voices are not good. I'll add a launchfile flag for that.
  2. Going back to our earlier discussion, changing "Play" to "Add to Queue" and only showing "Stop" when an utterance is playing would require the text to speech node to provide feedback back to the app, which requires changing it to an action and is a pretty involved change on both the web app and ROS node side. I have created an issue for this to be done as a separate PR (TTS: Adaptivity based on whether the robot is speaking #73 ).
  3. Will make the color change.

@hello-vinitha
Copy link
Collaborator

Ah yes, I completely forgot that we had discussed (2). That sounds good, we can revisit that.

@hello-amal
Copy link
Collaborator Author

Addressed the changes. Here is a screenshot of the updated color scheme.
Screenshot 2024-07-15 at 1 50 15 PM

@hello-amal hello-amal merged commit c53a7c0 into master Jul 15, 2024
1 check passed
@hello-amal hello-amal deleted the amaln/operator_to_robot_tts branch July 15, 2024 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants