-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call the real YouTube transcript API #1010
Conversation
715d532
to
78d7217
Compare
@pytest.fixture(autouse=True) | ||
def YouTubeTranscriptApi(patch): | ||
return patch("via.services.youtube.YouTubeTranscriptApi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mocking a third-party library isn't ideal but I think this one might be a pain to test in integration with: it sends multiple requests to YouTube for HTML pages that it parses, pulls JSON out of the HTML and parses, re-requests pages with cookies, makes requests to XML APIs, parses the XML... We'd have to fake all those responses from Google which would be a pain to do and largely nullifies the "at least we know it really works" benefit of testing in integration.
I think we're okay here at least for now because:
- We have autospeccing
- We're only making one trivial call to the library's first-advertised public method:
YouTubeTranscriptApi.get_transcript(video_id)
It seems unlikely that the contract of that method call are going to change in a way that would break us without the method signature changing at all (which autospeccing would catch).
We also depend on the format of the transcripts that this method returns, which we pass directly to the frontend, so if that format changed it'd break our frontend. I'm not sure there's a whole lot we can do about that either (there seems little point in the backend validating the shape of the data that we get from YouTube: we could detect the change and crash, but that seems no better than letting the frontend crash).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine. It's not like it's a super complex interaction surface like Postgres. The API seems pretty clean and I think this is a good place to break for us. It's kind of a black box from here on in.
78d7217
to
c36f5e0
Compare
Call the real YouTube transcript API to get video transcripts, instead of always returning a hard-coded transcript.
c36f5e0
to
5353b40
Compare
except Exception as exc: # pylint: disable=broad-exception-caught | ||
report_exception_to_sentry(exc) | ||
logger.exception(exc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The youtube-transcript-api doesn't document what types of exceptions it might raise. Regardless, I think it's actually appropriate for us to catch all types of exception here.
It'll all get reported to both Papertrail and Sentry. This may result in some noise in Papertrail and Sentry because if anything requests /api/youtube/transcript/foo
with a non-valid YouTube video ID then it'll trigger an error response and an exception to get sent to Papertrail and Sentry. But I think it's probably best to catch too many exceptions in monitoring than too few, we want to know what's going on with our API calls to YouTube if they're failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree about having a catch all at the end. For example, if you send it real nonsense (like an int or None) it raises assertion errors. Presumably other things + coding errors would also result in unpredictable things.
It doesn't have it in the docs, but the exceptions are exported in the main namespace. The hierarchy is also pretty flat:
Everything "handled" is a child of CouldNotRetrieveTranscript
, although there are lots of sub-cases in there which we might, or might not care. So we could choose to supress all of those and log other things.
The exported errors are:
TranscriptsDisabled,
NoTranscriptFound,
CouldNotRetrieveTranscript,
VideoUnavailable,
TooManyRequests,
NotTranslatable,
TranslationLanguageNotAvailable,
NoTranscriptAvailable,
CookiePathInvalid,
CookiesInvalid,
FailedToCreateConsentCookie,
YouTubeRequestFailed,
InvalidVideoId,
I think being in the main name space is as close as you're going to get to an indication these are for public consumption.
Looking at the above, we might want to pay special attention to TooManyRequests
as this is the one which is likely to get us in trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this in Slack but I think I'd like to handle this in the Sentry web UI. For example we can use that UI to filter out exceptions like InvalidVideoId
that we don't want to be alerted for (but still have it alert us if we suddenly start getting a lot of these exceptions), and we can set alarms for exceptions like TooManyRequests
that we want to get more noisily alerted for. So I think we don't need any special handling of different types of youtube-transcript-api
exceptions in the code itself
via/views/api/youtube.py
Outdated
return { | ||
"errors": [ | ||
{ | ||
"status": request.response.status_int, | ||
"code": "failed_to_get_transcript", | ||
"title": "Failed to get transcript from YouTube", | ||
"detail": str(exc), | ||
} | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #990 for notes on the JSON:API-based error responses. This view is currently doing all the JSON:API stuff itself in the view as it's currently the one and only API view that Via has, if/when we come to add a second API view to Via it might make sense to make some of the JSON:API formatting reusable. For example it'd be easy to move this into an exception view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we could take the SerializableError
approach we have in LMS (ignore the service
part there, that's kind of nonsense).
The way this operates is you can raise it anywhere. If you end up with it in a view the error handler shows you a nice HTML page summarising it. If you get it in an API context the error handler returns JSON.
You are of course free to catch it, or more likely more specific sub-classes, and do something custom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, so we'd just add a global exception view that applies to all /api/
requests and that sends a valid JSON:API error response, and one of the types of exception that exception view would be able to handle would be youtube-transcript-api
's exceptions, that way nothing in the services or views layers needs to catch these exceptions they can just let the exception view handle them.
I don't think we need a specific SerializableError
class such that something in the services or views would have to catch exceptions from youtube-transcript-api
and re-raise them as SerializableError
's. I think we just make the exception view(s) be able to handle the youtube-transcript-api
exceptions directly (and we'd probably write that in a generic way so that it can in fact handle any exception with a cause
and have a fallback behaviour for when there's no cause
). Then we don't need any exception-handling code in the services or views.
It may even be worth doing that now just to get the code in the correct shape, even though we don't even know if we'll ever add a second API view. Separate PR though
via/views/api/youtube.py
Outdated
"status": request.response.status_int, | ||
"code": "failed_to_get_transcript", | ||
"title": "Failed to get transcript from YouTube", | ||
"detail": str(exc), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know anything about what types of exception object the youtube-transcript-api might raise, so just stringify the whole exception into the error response, leaning on the side of more information rather than less.
Note that I don't expect the frontend to show the detail
string to the user except perhaps hidden in a "Technical Details" box, mostly the frontend is going to either have to use the generic title
from our API response or its own hard-coded generic error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do a bit more than this if we choose to. All the structured errors this library sends are children of CouldNotRetrieveTranscript
which have some predicable features.
We can do something like:
...
except CouldNotRetrieveTranscript as err:
return {
"errors": [
{
"code": err.__class__.__name__,
"title": err.cause,
"detail": str(err).strip()
}
]
}
This results in a message like:
{
"errors": [
{
"code": "TranscriptsDisabled",
"title": "Subtitles are disabled for this video",
"detail": "Could not retrieve a transcript ... already describe your problem!"
}
]
}
The full detail string in this case was:
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=klsdnjklhjsdfjklsdnfjklsdjklsdnf! This is most likely caused by:\n\nSubtitles are disabled for this video\n\nIf you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think this is worth doing just to provide slightly more informative / dynamic error responses to the frontend: d91fd95
With my two test cases from the PR description this now produces:
$ http GET 'http://localhost:9083/api/youtube/transcript/foo' Authorization:"Bearer eyJ***V-g"
HTTP/1.1 500 Internal Server Error
Connection: keep-alive
Content-Length: 691
Content-Type: application/json
Date: Wed, 28 Jun 2023 11:39:54 GMT
Server: nginx/1.17.6
X-Abuse-Policy: https://web.hypothes.is/abuse-policy/
X-Complaints-To: https://web.hypothes.is/report-abuse/
X-Robots-Tag: noindex, nofollow
{
"errors": [
{
"code": "TranscriptsDisabled",
"detail": "Could not retrieve a transcript for the video https://www.youtube.com/watch?v=foo! This is most likely caused by:\n\nSubtitles are disabled for this video\n\nIf you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!",
"status": 500,
"title": "Subtitles are disabled for this video"
}
]
}
$ http GET 'http://localhost:9083/api/youtube/transcript/TbzZIMQC6vk' Authorization:"Bearer eyJ***V-g"
HTTP/1.1 500 Internal Server Error
Connection: keep-alive
Content-Length: 409
Content-Type: application/json
Date: Wed, 28 Jun 2023 11:40:43 GMT
Server: nginx/1.17.6
X-Abuse-Policy: https://web.hypothes.is/abuse-policy/
X-Complaints-To: https://web.hypothes.is/report-abuse/
X-Robots-Tag: noindex, nofollow
{
"errors": [
{
"code": "ConnectionError",
"detail": "HTTPSConnectionPool(host='www.youtube.com', port=443): Max retries exceeded with url: /watch?v=TbzZIMQC6vk (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f3a043e1eb0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))",
"status": 500,
"title": "Failed to get transcript from YouTube"
}
]
}
Note that the "status": 500
in the error body is required by the JSON:API spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just finished an eyeball of the code and wanted to provide some feedback in case you wanted to work on it while I test it.
I think you're right about having a broad exception to catch things which call out of the call, but I think we can do better than that being the only catching we do.
I think in it's own lo-fi way the library is documenting what it raises by exporting them into the main namespace. There's:
- A single root exception you can catch
CouldNotRetrieveTranscript
- A decent amount of detail and specificity in the sub-types
- A common interface with
err.cause
as a property
I think it's worth thinking about which of those errors you'd like to fully suppress as a normal part of day to day work, instead of sending through to sentry.
At a minimum, I think we should be handling TooManyRequests
and making a big fuss about it. This is the one that seems to me like it could sink the feature if we get throttled. This has happened to us before with Google Drive stuff, and I think it's very likely with this as it's not an official API.
except Exception as exc: # pylint: disable=broad-exception-caught | ||
report_exception_to_sentry(exc) | ||
logger.exception(exc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree about having a catch all at the end. For example, if you send it real nonsense (like an int or None) it raises assertion errors. Presumably other things + coding errors would also result in unpredictable things.
It doesn't have it in the docs, but the exceptions are exported in the main namespace. The hierarchy is also pretty flat:
Everything "handled" is a child of CouldNotRetrieveTranscript
, although there are lots of sub-cases in there which we might, or might not care. So we could choose to supress all of those and log other things.
The exported errors are:
TranscriptsDisabled,
NoTranscriptFound,
CouldNotRetrieveTranscript,
VideoUnavailable,
TooManyRequests,
NotTranslatable,
TranslationLanguageNotAvailable,
NoTranscriptAvailable,
CookiePathInvalid,
CookiesInvalid,
FailedToCreateConsentCookie,
YouTubeRequestFailed,
InvalidVideoId,
I think being in the main name space is as close as you're going to get to an indication these are for public consumption.
Looking at the above, we might want to pay special attention to TooManyRequests
as this is the one which is likely to get us in trouble.
via/views/api/youtube.py
Outdated
"status": request.response.status_int, | ||
"code": "failed_to_get_transcript", | ||
"title": "Failed to get transcript from YouTube", | ||
"detail": str(exc), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do a bit more than this if we choose to. All the structured errors this library sends are children of CouldNotRetrieveTranscript
which have some predicable features.
We can do something like:
...
except CouldNotRetrieveTranscript as err:
return {
"errors": [
{
"code": err.__class__.__name__,
"title": err.cause,
"detail": str(err).strip()
}
]
}
This results in a message like:
{
"errors": [
{
"code": "TranscriptsDisabled",
"title": "Subtitles are disabled for this video",
"detail": "Could not retrieve a transcript ... already describe your problem!"
}
]
}
The full detail string in this case was:
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=klsdnjklhjsdfjklsdnfjklsdjklsdnf! This is most likely caused by:\n\nSubtitles are disabled for this video\n\nIf you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!
"status": 500, | ||
"code": "failed_to_get_transcript", | ||
"title": "Failed to get transcript from YouTube", | ||
"detail": str(exception), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"detail": str(exception), | |
"detail": str(exception).strip(), |
The exceptions from the library have leading space for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 92a5b8f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this and it works, but I do have some suggestions.
- I think using active verb phrases for methods is a good idea "get_thing" rather than "thing"
- The lack of any caching around our calls, and no protocol around "TooManyRequests" feels like a phonecall in the middle of the night waiting to happen
- If we need to get this out as a proto-type or to let other people work, I think that's fine, but I think we should revisit this very soon
- The library docs aren't great, but the library does have an exception hierarchy we can make some use of
@pytest.fixture(autouse=True) | ||
def YouTubeTranscriptApi(patch): | ||
return patch("via.services.youtube.YouTubeTranscriptApi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine. It's not like it's a super complex interaction surface like Postgres. The API seems pretty clean and I think this is a good place to break for us. It's kind of a black box from here on in.
via/views/api/youtube.py
Outdated
from via.services import YouTubeService | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
|
||
@view_config(route_name="api.youtube.transcript", permission="api", renderer="json") | ||
def transcript(request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think method names are generally best as verb phrases not nouns: e.g. "get_transcript". This is particularly helpful when the noun is also a verb "report" vs. "get_report" which is quite semantically different.
This would also resolve the problem this creates later on in the code where we can't use the obvious name for the noun being operated on ("transcript") because shadows the method name causing us to have to use transcript_
.
If you think about it, this would be extremely common as "get_boo" methods are likely to operate on a "boo" object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 fedb4b9
Yeah I agree with this: the {noun}_{verb}()
method/function naming pattern is nice and clear, and it also makes space for a single URL like /api/youtube/transcript/{video_id}
to have different views for different HTTP methods like get_transcript()
, put_transript()
, patch_trancript()
and delete_transcript()
.
This also reminded me that the current API only supports GET
, so I added a predicate for that so that other HTTP verbs (even DELETE
) now 404 rather than responding by getting and returning the transcript :) a6fe739
via/views/api/youtube.py
Outdated
return { | ||
"errors": [ | ||
{ | ||
"status": request.response.status_int, | ||
"code": "failed_to_get_transcript", | ||
"title": "Failed to get transcript from YouTube", | ||
"detail": str(exc), | ||
} | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we could take the SerializableError
approach we have in LMS (ignore the service
part there, that's kind of nonsense).
The way this operates is you can raise it anywhere. If you end up with it in a view the error handler shows you a nice HTML page summarising it. If you get it in an API context the error handler returns JSON.
You are of course free to catch it, or more likely more specific sub-classes, and do something custom.
This API currently accepts other HTTP verbs (such as DELETE) and responds as if it were a GET. Make it respond to GETs only and have Pyramid 404 other verbs for us.
a6fe739
to
bdb60d2
Compare
Change Via's new YouTube transcript API endpoint to call the real YouTube transcript API to get video transcripts, instead of always returning a hard-coded transcript.
Todo
This adds the third-party library youtube-transcript-api for calling the YouTube transcript API. Is that the best way to call the YouTube transcript API, or would we rather call it ourselves with
requests
?Let's go with this library for now because it looks like it's not going to be quick and easy to write our own replacement. Longer-term I think we probably want to move away from this library: it doesn't look like a high-quality dependency. Slack thread and another Slack thread. Separate issue to replace the library later: Replace Via's
youtube-transcript-api
dependency #1011Even if we do want to use
youtube-transcript-api
, is the simpleYouTubeTranscriptApi.get_transcript(video_id)
the best way to call it to get a video ID?Yes, according to the first example in the library's README
YouTubeTranscriptApi.get_transcript(video_id)
is the easiest way to just a the transcript for a given video ID (and it defaults to English). I think we're going to have to replace this with something more complicated once we get to selecting transcript languages, but we can deal with that when we get to it.Use our API key when making YouTube transcript API requestsWe're not planning to use an API key to call this API, at least not at first. This definitely seems risky but theyoutube-transcript-api
library doesn't appear to support using an API key, and I don't know whether the API itself supports it.Handle errors from YouTube transcript API requests Done, see testing notes below
How should we choose which transcript language to get?
Note that there's an intention in future to let instructors see a list of the languages and select one when creating an assignment. That's for a future PR, not for this PR. But for now we should at least use some sensible algorithm to automatically choose a language, for example by preferring English first.
If we just call
YouTubeTranscriptApi.get_transcript(video_id)
then it:languages=('en',)
I don't think this is the algorithm that we ultimately want but I think it's good enough for a first pass. I've created an issue to improve the algorithm later: Improve Via's automatic YouTube transcript language selection #1013
Testing
The frontend doesn't actually call the backend's API to get the YouTube transcripts yet (until #1006 is merged, currently it uses a transcript that's hard-coded in the frontend) so you won't see the real transcripts in the UI.
The easiest way to test this is to call the backend's transcript API from a terminal:
First get an authorization token:
Now use the token to make a request to the API, for example using HTTPie:
Testing error responses
The easiest way to test what happens when we get an error from youtube-transcript-api is to call our API with an invalid video ID. You should get a 500 Server Error response with a JSON:API-style error body, and it just blindly stringifies the entire exception from youtube-transcript-api into the
detail
field of the error body:You should also see that the exception gets logged to the terminal with a traceback and the full exception string, so we'll be able to see these in Papertrail.
The exception will also get reported to Sentry. Here's an example from my dev env.
Another thing you can try is disconnecting your internet connection so that it won't be able to reach YouTube, and then trying to request a valid video ID: