Fix Via's default YouTube transcript selection #1110

seanh · 2023-07-20T15:39:32Z

Problem

The third-party library (https://github.com/jdepoix/youtube-transcript-api) that Via uses to select the transcript to use for a given YouTube video (and also to then download the selected transcript) does a poor job of transcript selection when the video has multiple English transcripts. The library's code incorrectly assumes that a YouTube video can only have one transcript for a given language code (e.g. en), but in fact videos can have more than one transcript with the same language code. See jdepoix/youtube-transcript-api#150.

This results in poor transcript selection behaviour as the library's faulty code does something accidental that ends up picking one of the transcripts. For example this video (https://www.youtube.com/watch?v=rSTqpRYzJbo) has four English transcripts:

English
English - CC1
English - DTVCC1
English (United States)

The best transcript to choose by default is clearly either English or English (United States). YouTube's own UI chooses English (United States) by default. But Via currently chooses English - DTVCC1 for this video.

We'd like to replace this transcript selection algorithm with one that we can design ourselves, for example so that we can make it prefer the English (United States) transcript over English - CC1 and English - DTVCC1.

In addition, https://github.com/jdepoix/youtube-transcript-api looks to be a poor-quality dependency so we'd like to replace it with our own code that will be easier for us to maintain. https://github.com/jdepoix/youtube-transcript-api also scrapes the HTML of YouTube pages in order to discover a JSON blob that contains the list of transcripts, whereas we think we can do this better by calling a (legacy, undocumented) YouTube/Google API to get this same JSON.

Solution

We'd like to replace the body of Via's get_transcript() method with one that uses our own code to get the list of transcripts for a video, select a default transcript to use, and download that transcript. This should allow us to remove https://github.com/jdepoix/youtube-transcript-api from Via's Python requirements (dependencies).

Scope

This issue is just to replace the body of Via's get_transcript() method with one that uses our own code instead of https://github.com/jdepoix/youtube-transcript-api to select and return a default English transcript, without making any other changes.

Later on we're going to need other things such as a method for returning the list of transcripts, and possibly extracting our YouTube API code into a library so that both Via and LMS can use it. LMS will also need a way of persisting users transcript selections in its DB and of communicating transcript selections from LMS to Via. These other changes are out of scope for this issue.

The text was updated successfully, but these errors were encountered:

seanh added the Backend label Jul 25, 2023

seanh changed the title ~~Fix Via's default YouTube transcript selection algorithm~~ Fix Via's default YouTube transcript selection Jul 25, 2023

This was referenced Jul 25, 2023

Replace Via's youtube-transcript-api dependency #1011

Closed

Improve Via's automatic YouTube transcript language selection #1013

Closed

seanh assigned seanh and jon-betts Aug 2, 2023

leedenison unassigned jon-betts Aug 8, 2023

seanh mentioned this issue Aug 15, 2023

Replace the third-party youtube-transcript-api library with our own code #1162

Merged

marcospri closed this as completed in #1162 Aug 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Via's default YouTube transcript selection #1110

Fix Via's default YouTube transcript selection #1110

seanh commented Jul 20, 2023 •

edited

Loading

Fix Via's default YouTube transcript selection #1110

Fix Via's default YouTube transcript selection #1110

Comments

seanh commented Jul 20, 2023 • edited Loading

Problem

Solution

Scope

seanh commented Jul 20, 2023 •

edited

Loading