You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The third-party library (https://github.com/jdepoix/youtube-transcript-api) that Via uses to select the transcript to use for a given YouTube video (and also to then download the selected transcript) does a poor job of transcript selection when the video has multiple English transcripts. The library's code incorrectly assumes that a YouTube video can only have one transcript for a given language code (e.g. en), but in fact videos can have more than one transcript with the same language code. See jdepoix/youtube-transcript-api#150.
This results in poor transcript selection behaviour as the library's faulty code does something accidental that ends up picking one of the transcripts. For example this video (https://www.youtube.com/watch?v=rSTqpRYzJbo) has four English transcripts:
English
English - CC1
English - DTVCC1
English (United States)
The best transcript to choose by default is clearly either English or English (United States). YouTube's own UI chooses English (United States) by default. But Via currently chooses English - DTVCC1 for this video.
We'd like to replace this transcript selection algorithm with one that we can design ourselves, for example so that we can make it prefer the English (United States) transcript over English - CC1 and English - DTVCC1.
In addition, https://github.com/jdepoix/youtube-transcript-api looks to be a poor-quality dependency so we'd like to replace it with our own code that will be easier for us to maintain. https://github.com/jdepoix/youtube-transcript-api also scrapes the HTML of YouTube pages in order to discover a JSON blob that contains the list of transcripts, whereas we think we can do this better by calling a (legacy, undocumented) YouTube/Google API to get this same JSON.
Solution
We'd like to replace the body of Via's get_transcript() method with one that uses our own code to get the list of transcripts for a video, select a default transcript to use, and download that transcript. This should allow us to remove https://github.com/jdepoix/youtube-transcript-api from Via's Python requirements (dependencies).
Later on we're going to need other things such as a method for returning the list of transcripts, and possibly extracting our YouTube API code into a library so that both Via and LMS can use it. LMS will also need a way of persisting users transcript selections in its DB and of communicating transcript selections from LMS to Via. These other changes are out of scope for this issue.
The text was updated successfully, but these errors were encountered:
Problem
The third-party library (https://github.com/jdepoix/youtube-transcript-api) that Via uses to select the transcript to use for a given YouTube video (and also to then download the selected transcript) does a poor job of transcript selection when the video has multiple English transcripts. The library's code incorrectly assumes that a YouTube video can only have one transcript for a given language code (e.g.
en
), but in fact videos can have more than one transcript with the same language code. See jdepoix/youtube-transcript-api#150.This results in poor transcript selection behaviour as the library's faulty code does something accidental that ends up picking one of the transcripts. For example this video (https://www.youtube.com/watch?v=rSTqpRYzJbo) has four English transcripts:
The best transcript to choose by default is clearly either English or English (United States). YouTube's own UI chooses English (United States) by default. But Via currently chooses English - DTVCC1 for this video.
We'd like to replace this transcript selection algorithm with one that we can design ourselves, for example so that we can make it prefer the English (United States) transcript over English - CC1 and English - DTVCC1.
In addition, https://github.com/jdepoix/youtube-transcript-api looks to be a poor-quality dependency so we'd like to replace it with our own code that will be easier for us to maintain. https://github.com/jdepoix/youtube-transcript-api also scrapes the HTML of YouTube pages in order to discover a JSON blob that contains the list of transcripts, whereas we think we can do this better by calling a (legacy, undocumented) YouTube/Google API to get this same JSON.
Solution
We'd like to replace the body of Via's
get_transcript()
method with one that uses our own code to get the list of transcripts for a video, select a default transcript to use, and download that transcript. This should allow us to remove https://github.com/jdepoix/youtube-transcript-api from Via's Python requirements (dependencies).Scope
This issue is just to replace the body of Via's
get_transcript()
method with one that uses our own code instead of https://github.com/jdepoix/youtube-transcript-api to select and return a default English transcript, without making any other changes.Later on we're going to need other things such as a method for returning the list of transcripts, and possibly extracting our YouTube API code into a library so that both Via and LMS can use it. LMS will also need a way of persisting users transcript selections in its DB and of communicating transcript selections from LMS to Via. These other changes are out of scope for this issue.
The text was updated successfully, but these errors were encountered: