Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Via's default YouTube transcript selection #1110

Closed
seanh opened this issue Jul 20, 2023 · 0 comments · Fixed by #1162
Closed

Fix Via's default YouTube transcript selection #1110

seanh opened this issue Jul 20, 2023 · 0 comments · Fixed by #1162
Assignees
Labels

Comments

@seanh
Copy link
Contributor

seanh commented Jul 20, 2023

Problem

The third-party library (https://github.com/jdepoix/youtube-transcript-api) that Via uses to select the transcript to use for a given YouTube video (and also to then download the selected transcript) does a poor job of transcript selection when the video has multiple English transcripts. The library's code incorrectly assumes that a YouTube video can only have one transcript for a given language code (e.g. en), but in fact videos can have more than one transcript with the same language code. See jdepoix/youtube-transcript-api#150.

This results in poor transcript selection behaviour as the library's faulty code does something accidental that ends up picking one of the transcripts. For example this video (https://www.youtube.com/watch?v=rSTqpRYzJbo) has four English transcripts:

  1. English
  2. English - CC1
  3. English - DTVCC1
  4. English (United States)

The best transcript to choose by default is clearly either English or English (United States). YouTube's own UI chooses English (United States) by default. But Via currently chooses English - DTVCC1 for this video.

We'd like to replace this transcript selection algorithm with one that we can design ourselves, for example so that we can make it prefer the English (United States) transcript over English - CC1 and English - DTVCC1.

In addition, https://github.com/jdepoix/youtube-transcript-api looks to be a poor-quality dependency so we'd like to replace it with our own code that will be easier for us to maintain. https://github.com/jdepoix/youtube-transcript-api also scrapes the HTML of YouTube pages in order to discover a JSON blob that contains the list of transcripts, whereas we think we can do this better by calling a (legacy, undocumented) YouTube/Google API to get this same JSON.

Solution

We'd like to replace the body of Via's get_transcript() method with one that uses our own code to get the list of transcripts for a video, select a default transcript to use, and download that transcript. This should allow us to remove https://github.com/jdepoix/youtube-transcript-api from Via's Python requirements (dependencies).

Scope

This issue is just to replace the body of Via's get_transcript() method with one that uses our own code instead of https://github.com/jdepoix/youtube-transcript-api to select and return a default English transcript, without making any other changes.

Later on we're going to need other things such as a method for returning the list of transcripts, and possibly extracting our YouTube API code into a library so that both Via and LMS can use it. LMS will also need a way of persisting users transcript selections in its DB and of communicating transcript selections from LMS to Via. These other changes are out of scope for this issue.

@seanh seanh added the Backend label Jul 25, 2023
@seanh seanh changed the title Fix Via's default YouTube transcript selection algorithm Fix Via's default YouTube transcript selection Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants