Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow sample_rate parameter to audio decoder #551

Merged
merged 19 commits into from
Mar 20, 2025

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Mar 12, 2025

Towards #549

This PR allows the user to specify a a desired output sample rate. If the source stream isn't already in that sample rate, the sample are converted using swresample, which we already use for format conversion.

The PR addresses both core and public AudioDecoder APIs.

Converting sample rates isn't as trivial as converting formats, because this changes the number of output samples. And, importantly, we now need to account for libswresample's internal buffers:

Note that the samples may get buffered in swr if you provide insufficient output space or if sample rate conversion is done, which requires "future" samples. [...] At the end of conversion the resampling buffer can be flushed by calling swr_convert() with NULL in and 0 in_count

(from the docs)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 12, 2025
frames_44100_native = decoder.get_samples_played_in_range(
start_seconds=start_seconds, stop_seconds=stop_seconds
)
assert frames_44100_native.sample_rate == 44_100
Copy link
Member Author

@NicolasHug NicolasHug Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're treating the above frames as the reference. We could check-in their corresponding .pt reference frames instead, but I prefer doing it this way because it really clearly illustrate that the reference have a sample rate of 44_100 (from their metadata, as opposed to some checked-in value)

@NicolasHug NicolasHug mentioned this pull request Mar 19, 2025
7 tasks
@NicolasHug NicolasHug changed the title [WIP] Allow sample_rate parameter to audio decoder Allow sample_rate parameter to audio decoder Mar 19, 2025
@NicolasHug NicolasHug marked this pull request as ready for review March 19, 2025 15:21
@@ -25,10 +25,13 @@ def __init__(
source: Union[str, Path, bytes, Tensor],
*,
stream_index: Optional[int] = None,
sample_rate: Optional[int] = None,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also consider exposing this as desired_sample_rate parameter? I don't have a strong opinion, the docs would make it clear what this means in any case.

@NicolasHug NicolasHug requested a review from scotts March 19, 2025 17:08
torch::Tensor lastSamples = maybeFlushSwrBuffers();
if (lastSamples.numel() > 0) {
frames.push_back(lastSamples);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not particularly fond of the above. Maybe we could let maybeFlushSwrBuffers return a tensor of shape (numChannels, 0), which could probably be pushed_back() unconditionally. Not sure that's better, there are probably nicer patterns I'm not seeing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything I can think of that does things unconditionally in this function is way too cute (returning a vector of tensors; using std::copy()). It may be more clear about intent if maybeFlushSwrBuffers() returns an optional so that then we don't need to use an empty tensor to indicate nothing to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using optional sounds better, thanks


std::optional<int> sampleRate;
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the indirection of having an options struct. I know it mirrors the pattern established on the video side, but I also don't think it's a good practice there, either. It's harder to get rid of the video options because we accept a string, and then we do a bunch of work parsing the string. Getting rid of VideoStreamOptions will mean updating a bunch of callers to pass real arguments instead of a string.

Copy link
Member Author

@NicolasHug NicolasHug Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of the stringy video options either - that's why I didn't implement a string constructor for audio options. I also opened #577 to entirely remove string video options (it's not as hard as we thought).

I don't feel very strongly about this, but if we were to collapse both video options and audio options into the video decoder (or the StreamInfo), then we would have a lot of video-only and audio-only fields within the same struct. I personally find it cleaner to separate those into separate structs. Additionally, using option structs makes it very clear which fields/values come from user-specified parameter, in constrast to e.g. metadata or video properties, which is often useful to immediately understand the source of the values, as e.g. here:

int sourceSampleRate = srcAVFrame->sample_rate;
int desiredSampleRate =
streamInfos_[activeStreamIndex_].audioStreamOptions.sampleRate.value_or(
sourceSampleRate);

LMK you thoughts, I'm fine with collapsing sampleRate as a StreamInfo field if you prefer, but we'd potentially be losing the 2 benefits mentioned above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good point, we actually store the options. I was thinking just from the addAudioStream() API perspective. Let's keep this then, and eventually get rid of the stringy options for video.

streamInfo.swrContext.get(),
&lastSamplesData,
numRemainingSamples,
NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/NULL/nullptr/g

@NicolasHug NicolasHug merged commit 611421e into pytorch:main Mar 20, 2025
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants