-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[enhancement] sync ebooks and audiobooks via processing audiobook to text (Pie in the sky idea) #189
Comments
So basically a selfmade version of Amazon's whispersync feature. |
While I agree that this would be an incredible feature, it is definitely a very long-term goal, and would require an incredible amount of work. |
This project also seems relevant. I haven't tried it out yet but I've been meaning to. I'll report back on what I find if I do end up trying it out in the next couple of months. A huge issue with this feature is going to be incorporating support for a reading experience of some kind. For that we could probably look at porting Epub3 Media Overlay functionality out from minstrel but all of that code is pretty dated and therefore likely not in the best of shape, and it also locks you into requiring users to create an EPUB3 file with a media overlay instead of any other possible format we might choose. I've definitely looked at implementing something like this in the past and then didn't keep up on it because I didn't have anywhere near enough free time to dedicate to something of this scale. I agree though, this would be an absolutely incredible feature. |
andrewls, wow, that makes this seem a lot more possible than the pie in the sky idea I thought it was. |
+1, this would be the killer feature |
Would love to see this as well! |
Just found out about audiobookshelf googling for "Whispersync for Voice open source alternatives". Would be so cool to make this happen somehow. |
Came across this on Hacker News this morning, wonder if it's something that could be integrated, or use the epubs that it creates? From their docs: It's an self-hosted platform for taking an audiobook (either as an m4b/mp4 file, or as a zip of mp3 files) and an ebook (as an epub file) and producing a new epub file with synced narration support. This follows the media overlay spec for epubs. |
I've been experimenting locally with using whisper.cpp to make transcripts of my audiobooks. The reason transcripts rather than just an epub version is that it includes timestamps, which can be easily used to:
I suspect it wouldn't be terribly hard to build a "whispersync" type of thing on top of this (once it exists of course). If somebody wants to implement this sooner than I have availability, I'm happy to yield it. Let me know and I'll try to knowledge dump what I have. Also happy to brainstorm the idea. I'm @FreedomBen in the Matrix chat |
This is actually how Media Overlays work, as well (I'm the author of Storyteller, the project that @sphars linked to). A Media Overlay is just an XML file that maps XHTML elements to segments of audio files. The Storyteller reader apps can (and do!), for example, highlight the current sentence while it's being read: And they could also allow you to find the written text based on the timestamp (that's essentially the premise that the Storyteller reader apps are predicated on)! For any given timestamp, you can always find the location in the EPUB text that corresponds to it. |
Is it also possible to finetune the highlighting even more? It think with Amazon whispersync it highlights it word by word. And I am so used to that by now, so I wondered if it would be possible to do that aswell with storyteller |
It's possible! Storyteller has word-level timestamps available, but its reliance on fuzzy search for alignment (to account for inaccuracies in the transcription) might make word-level highlights challenging to get right. If it's a feature you're interested in, feel free to make an Issue on the Storyteller project! It's on GitLab (gitlab.com/smoores/storyteller), but there's a mirror on GitHub if you don't have a GitLab account; I'll copy any Issues created there over to GitLab. |
I'm thinking through how Storyteller and Audiobookshelf could be fairly tightly integrated to create "whispersync as a service" and combine the library management of ABS, and the media overlay setup of ST. Essentially the flow would look like:
An extension would be to handle conversion of non epubs to epub transparently as well for convenience. Better yet, on top of all this, with a little bit of fuzzy matching the entire library could be ported into ST directly and auto-pair all the audio and ebooks so no manual pairing is necessary. |
That flow sounds excellent to me! I think it would definitely make sense to be able to create a book entity in Storyteller from existing files, in addition to the current upload flow. An automated matching system sounds a little fraught, but I'm open to exploring it; the manual matching system you have laid out here sounds great as a start. |
I was playing around Storyteller, it looks so amazing for this! Media overlays don't look super easy to access with epub.js, although there's a pull request for that, but something like this snippet, inserted here, can extract the timestamp to cfi mappings from the epubs output from Storyteller
Since the current epub reader needs the whole epub to be sent to the client, it might be a good idea to use either the original epub since the marked up epub includes embedded audio files, or strip the audio files from Storyteller output. If using the existing audio files instead of embedding them, another consideration is that the timestamps generated by Storyteller are relative to the audiobook chapters instead of the whole audio. If going down that path, I'm not sure if it would make more sense to modify Storyteller to include some metadata to map the chapter offsets back to the original file, or have audiobookshelf do some post processing after running Storyteller. |
With the latest iOS 17.4 update, Apple introduced a new transcript feature which is useful and quite intuitive. I know it's not exactly like what this issue is about, but there might interesting ideas, especially in terms of UX. |
Have you experimented with live transcription using Whisper? As in, using whisper to transcribe what is currently being played and "buffering" 30 seconds ahead or so. Even using CPU alone, it sounds like faster-whisper can easily outpace an audiobook playing at original speed (1x). Would essentially be Immersive Reading (and would localize to the individual word as well, rather than just the whole sentence). And I suppose this transcription could be cached for future use and fed into the fuzzy search to attempt to sync with an ebook as well. Basically an on-demand, live transcription version of Storyteller, cutting out need for pre-processing. |
This idea would be amazing and outsourcing the sync to a dedicated tool like storyteller is a great idea. If you want to go down the route of an internal service however, I've already mentioned this on storyteller's project but I think https://github.com/echogarden-project/echogarden is an amazing backend for speech to transcript alignment that works with many more language than English, I did some test on Swedish and it was very conclusive, based on their doc it can go down to word-level alignment with great accuracy. Audiobook/epub alignment is always better than TTS as the reader often make great effort to change their tone of voice to each character and make a good job at expressing the persons' feeling. Maybe one day whisper will reach this stage but we're not there yet. Lastly, good luck on the player part. It's a nightmare to find a good epub reader with media overlay support, at least on android. Some don't work with specific file format (like ogg vorbis), some add weird delay in the playback, making you think the alignment is off while it is in fact perfect when checked on other platforms like windows. |
I have written a local system which transcribes an audiobook to text, converts an epub to text, and then performs matching on the two pieces of text to match timestamps in the audiobook to a "percentage" in the epub text. I do not have an understanding of an accurate way to reference a location in the epub, which is restricting my ability to do anything better than this. On my server - a pretty low powered NUC - it will perform the matching at approx 15x the speed of the audiobook, meaning a 15 hour audiobook would take around an hour to process. I haven't spent any time trying to optimise this, it's just a first pass. I see this being an on demand tool that a user could perform on an item, much like the "Embed Metadata" tool which exists for audiobooks. |
Chiming in here because this flow is exactly like what I'm looking for. What are the steps needed to make this work? |
Snipd just released audiobook transcriptions. would love to see this in ABS |
+1 |
once ebook's are a lot more mature it would be awesome to be able to identify when an ebook and an audiobook are the same book and automagically text to speech the audiobook so that the audiobook and the ebook can be kept in sync.
The text was updated successfully, but these errors were encountered: