-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow sample_rate
parameter to audio decoder
#551
Conversation
frames_44100_native = decoder.get_samples_played_in_range( | ||
start_seconds=start_seconds, stop_seconds=stop_seconds | ||
) | ||
assert frames_44100_native.sample_rate == 44_100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're treating the above frames as the reference. We could check-in their corresponding .pt
reference frames instead, but I prefer doing it this way because it really clearly illustrate that the reference have a sample rate of 44_100 (from their metadata, as opposed to some checked-in value)
sample_rate
parameter to audio decodersample_rate
parameter to audio decoder
@@ -25,10 +25,13 @@ def __init__( | |||
source: Union[str, Path, bytes, Tensor], | |||
*, | |||
stream_index: Optional[int] = None, | |||
sample_rate: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also consider exposing this as desired_sample_rate
parameter? I don't have a strong opinion, the docs would make it clear what this means in any case.
torch::Tensor lastSamples = maybeFlushSwrBuffers(); | ||
if (lastSamples.numel() > 0) { | ||
frames.push_back(lastSamples); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not particularly fond of the above. Maybe we could let maybeFlushSwrBuffers
return a tensor of shape (numChannels, 0)
, which could probably be pushed_back()
unconditionally. Not sure that's better, there are probably nicer patterns I'm not seeing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything I can think of that does things unconditionally in this function is way too cute (returning a vector of tensors; using std::copy()
). It may be more clear about intent if maybeFlushSwrBuffers()
returns an optional so that then we don't need to use an empty tensor to indicate nothing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using optional sounds better, thanks
|
||
std::optional<int> sampleRate; | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the indirection of having an options struct. I know it mirrors the pattern established on the video side, but I also don't think it's a good practice there, either. It's harder to get rid of the video options because we accept a string, and then we do a bunch of work parsing the string. Getting rid of VideoStreamOptions
will mean updating a bunch of callers to pass real arguments instead of a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of the stringy video options either - that's why I didn't implement a string constructor for audio options. I also opened #577 to entirely remove string video options (it's not as hard as we thought).
I don't feel very strongly about this, but if we were to collapse both video options and audio options into the video decoder (or the StreamInfo), then we would have a lot of video-only and audio-only fields within the same struct. I personally find it cleaner to separate those into separate structs. Additionally, using option structs makes it very clear which fields/values come from user-specified parameter, in constrast to e.g. metadata or video properties, which is often useful to immediately understand the source of the values, as e.g. here:
torchcodec/src/torchcodec/decoders/_core/VideoDecoder.cpp
Lines 1395 to 1398 in 7cb2271
int sourceSampleRate = srcAVFrame->sample_rate; | |
int desiredSampleRate = | |
streamInfos_[activeStreamIndex_].audioStreamOptions.sampleRate.value_or( | |
sourceSampleRate); |
LMK you thoughts, I'm fine with collapsing sampleRate as a StreamInfo field if you prefer, but we'd potentially be losing the 2 benefits mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a good point, we actually store the options. I was thinking just from the addAudioStream()
API perspective. Let's keep this then, and eventually get rid of the stringy options for video.
streamInfo.swrContext.get(), | ||
&lastSamplesData, | ||
numRemainingSamples, | ||
NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/NULL
/nullptr
/g
Towards #549
This PR allows the user to specify a a desired output sample rate. If the source stream isn't already in that sample rate, the sample are converted using swresample, which we already use for format conversion.
The PR addresses both core and public
AudioDecoder
APIs.Converting sample rates isn't as trivial as converting formats, because this changes the number of output samples. And, importantly, we now need to account for libswresample's internal buffers:
(from the docs)