This repository is a small proof of concept used to demonstrate what would be required to achieve voice-to-image in accordance with our application, Plus One, that seeks to explore computer-mediated-conversation and act as a form of enactive assistance in the context of a design charrette.
This small proof-of-concept was used to demonstrate the challenges involved, both technically and from the perspective of UX (user experience), in leveraging current paradigm's in diffusion modeling for image generation and speech recognition.
In order to return more helpful or relevant imagery, it would be necessary to parse and interpret sentiment, context and structure in real time information returned from a voice-to-text model.
Our takeaways are, that in the current state of voice-to-text technology, It would be necessary to asynchronously analyze the content of a conversation, in order to approach what an image of that summary could be described as.
Amelia Gan, Quoc Dang and Blaine Western.
In the project directory, you can run:
This will run the server and the app in the development mode.
Open http://localhost:3000 to view it in the browser.
You will need an AssemblyAI (title yours ASSEMBLYAI_API_KEY=[your key]) and OpenAI (title yours OPENAI_API_KEY=[your key]) access key stored within a .env file at the root of the project.
The page will reload if you make edits.
You will also see any lint errors in the console.
