Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Handle seekhead for cues on mkv files. #2268

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

KingLucius
Copy link

@KingLucius KingLucius commented Mar 25, 2025

Closes #1143

This fixes the issue with non seekable mkv files, by checking the SEEK_HEAD, as media3 does not currently use SEEK_HEAD at all. A test was also added showing a non seekable mkv file, this was generated by mkvpropedit sample.mkv --chapters large_chapters.xml.

This means that when we parse an MKV file what really happens is:

    find KaxSeekHead
    seek to KaxSeekHead position
    find KaxCues
    seek to KaxCues position
    parse cues
    seek back from KaxCues
    seek back from KaxSeekHead
    parse rest normally

What happened before was:

    find KaxCues <- this step fails
    seek to KaxCues position
    parse cues
    seek back from KaxCues
    parse rest normally

All the credit for fire-light42

Copy link

google-cla bot commented Mar 25, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@icbaker icbaker self-assigned this Mar 26, 2025
@icbaker icbaker self-requested a review March 26, 2025 10:24
KingLucius and others added 3 commits March 26, 2025 17:19
This file was crafted by hand in a hex editor by modifying
`sample.mkv`:
1. Adding two SeekHead elements at the end, each with a single Seek
   entry, with the first pointing at the Cues element and the second
   pointing at the Tags element.
2. Update the original SeekHead element so the Seek entries that
   previously pointed to the Cues and Tags element now point out the two
   new SeekHead elements we added.

The resulting file is successfully parsed by `mkvinfo` and playable with
VLC.

The dump files were generated with all changes in MatroskaExtractor
reverted (which is why they're unseekable).

Bringing the MatroskaExtractor changes back and running the test results
in a larger than expected diff in the dump file. Specifically all the
samples are dropped.
@icbaker
Copy link
Collaborator

icbaker commented Mar 26, 2025

I played around a bit with this, and added a new test file with three SeekHead elements (f5a6ed0). When I run the test with this new file without the MatroskaExtractor changes in this PR it is parsed successfully (but not seekable, as expected). However when I run the test with the changes in this PR the dump file now has no samples in it (you can see this by pulling the latest changes in this PR and running MatroskaExtractorTest).

In general I think this PR is an incomplete implementation of the full flexibility that the matroska spec supports with recursive SeekHead elements.

This fixes the seek once behavior by adding a list of pending seekheads that we can visist, if we do not find cues.
@KingLucius
Copy link
Author

This fixes the seek once behavior by adding a list of pending seekheads that we can visist, if we do not find cues.

@icbaker
Copy link
Collaborator

icbaker commented Mar 28, 2025

Thanks for the follow-up fixes. I've re-run MatroskaExtractorTest.mkvSample_threeSeekHeads() with DumpFileAsserts.DUMP_FILE_ACTION = WRITE_TO_LOCAL to update the dump files (and uploaded this commit). Re-running the test with DumpFileAsserts.DUMP_FILE_ACTION = COMPARE_WITH_EXISTING results in the following tests failing:

mkvSample_threeSeekHeads[sniff=true,ioErr=false,unknownLen=true,partRead=false,subtitlesParsedDuringExtraction=true]
mkvSample_threeSeekHeads[sniff=true,ioErr=false,unknownLen=true,partRead=false,subtitlesParsedDuringExtraction=false]
mkvSample_threeSeekHeads[sniff=true,ioErr=false,unknownLen=true,partRead=true,subtitlesParsedDuringExtraction=true]
mkvSample_threeSeekHeads[sniff=true,ioErr=false,unknownLen=true,partRead=true,subtitlesParsedDuringExtraction=false]

And the diff is (i.e. the 'passing' run results in unseekable output, while the 'failing' run is seekable):

     seekMap:
    -  isSeekable = false
    +  isSeekable = true
       duration = 1072000
    -  getPosition(0) = [[timeUs=0, position=0]]
    +  getPosition(0) = [[timeUs=67000, position=5576]]
    +  getPosition(1) = [[timeUs=67000, position=5576]]
    +  getPosition(536000) = [[timeUs=534000, position=84155], [timeUs=547000, position=77334]]
    +  getPosition(1072000) = [[timeUs=1035000, position=106570]]

The pattern is that the tests pass with unknownLen=true and ioErr=true pass, but those with unknownLen=true and ioErr=false fail. This is happening because we re-use the same .unknown_length. dump files for these cases (since the results should be identical). When using WRITE_TO_LOCAL the last configuration to run ends up being persisted on disk (since all previous runs are overwritten). So the fact that it's ioErr=true passing and ioErr=false failing (rather than the other way round) isn't interesting. The interesting point is that the results are not the same.

ioErr=true means the test harness simulates an IOException at every read position, forcing the Extractor to resume where it left off (which is part of the Extractor contract). It seems that in this case, the new code fails to find the Cues element in the ioErr=true case.

The same failure pattern occurs in the mkvSample_withRecursiveSeekHead() test too:

mkvSample_withRecursiveSeekHead[sniff=true,ioErr=false,unknownLen=true,partRead=false,subtitlesParsedDuringExtraction=true]
mkvSample_withRecursiveSeekHead[sniff=true,ioErr=false,unknownLen=true,partRead=false,subtitlesParsedDuringExtraction=false]
mkvSample_withRecursiveSeekHead[sniff=true,ioErr=false,unknownLen=true,partRead=true,subtitlesParsedDuringExtraction=true]
mkvSample_withRecursiveSeekHead[sniff=true,ioErr=false,unknownLen=true,partRead=true,subtitlesParsedDuringExtraction=false]

Please could you take a look?

@fire-light42
Copy link

Please could you take a look?

This makes no sense at all, but the bug is fixed by removing this anti recursion check. I have no idea how and why. Is endMasterElement called twice with the exact same values? Can you please detail what really happends when an IO exception is thrown with an unknown length, does it restart from the beginning?

By removing this all tests return isSeekable = true

if (visitedSeekHeads.add(seekEntryPosition)) {
    pendingSeekHeads.add(seekEntryPosition);
}

->

pendingSeekHeads.add(seekEntryPosition);

@fire-light42
Copy link

does it restart from the beginning

Looks like this is the case, just calling clear will fix it, if we know it has reset. What function is called when the stream resets? We can put it in startMasterElement ID_SEGMENT, but that could be abused for infinite recursion.

visitedSeekHeads.clear();
pendingSeekHeads.clear();

@icbaker
Copy link
Collaborator

icbaker commented Mar 28, 2025

The Extractor javadoc describes what happens when an IO error is encountered:

When this method throws an IOException, extraction may continue by providing an ExtractorInput with an unchanged read position to a subsequent call to this method.

@fire-light42
Copy link

The Extractor javadoc describes what happens when an IO error is encountered:

When this method throws an IOException, extraction may continue by providing an ExtractorInput with an unchanged read position to a subsequent call to this method.

Ye, I read that, but are you sure it does not reset from 0 when unknown length=true? Because that is the only way this makes sense. I need to somehow know when that occurs to reset all the variables keeping track of what is visited, otherwise we cant revisit them due to infinite recursion.

@fire-light42
Copy link

fire-light42 commented Mar 28, 2025

I checked now, the docs are wrong. It definitely restarts if an IOException is thrown and unknown length = true. So the documentation should be something like:

When this method throws an IOException, extraction may continue by providing an ExtractorInput with an unchanged read position, or reset it to a subsequent call to this method.

As just logging

System.out.println("VISITED ID_SEGMENT: ("+contentPosition+") current count = "+ visistedCount++);
VISITED ID_SEGMENT: (52) current count = 0
VISITED ID_SEGMENT: (52) current count = 1
VISITED ID_SEGMENT: (52) current count = 2
VISITED ID_SEGMENT: (52) current count = 3
VISITED ID_SEGMENT: (52) current count = 4
VISITED ID_SEGMENT: (52) current count = 5
VISITED ID_SEGMENT: (52) current count = 6
VISITED ID_SEGMENT: (52) current count = 7
VISITED ID_SEGMENT: (52) current count = 8
VISITED ID_SEGMENT: (52) current count = 9
...

@icbaker
Copy link
Collaborator

icbaker commented Mar 28, 2025

It looks like the 'restart from the beginning' behaviour is happening when retryFromStartIfLive = true [1], and this stream is determined to be 'live' (!isOnDemand) before the Cues element is parsed because ExtractorOutput.seekMap == null:

boolean isOnDemand =
input.getLength() != C.LENGTH_UNSET
|| (output.seekMap != null && output.seekMap.getDurationUs() != C.TIME_UNSET);


[1]

FakeExtractorOutput extractorOutput =
consumeTestData(extractor, input, 0, true, deduplicateConsecutiveFormats);

@fire-light42
Copy link

Well then the docs should be updated. However I still need to know when this happens, as I cant differentiate this behavior from a a malicious recursive mkv. Between resets I have to reset the variables to keep track of current seekheads, otherwise it is impossible to properly parse seekheads. How would you suggest I implement this clearing?

@icbaker
Copy link
Collaborator

icbaker commented Mar 28, 2025

With some logging added locally [1], in the retryFromStartIfLive case, I see MatroskaExtractor.seek(long position, long timeUs) called with position=0 before the read(...), to indicate that the subsequent read(...) will not be at the same position - which seems consistent with the current docs:

Following a call to this method, the ExtractorInput passed to the next invocation of read is required to provide data starting from position in the stream.

[1]

MatroskaExtractor.read() throwing exception, expecting to be called again with input.position=0
MatroskaExtractor.seek(): position=0, timeUs=0
MatroskaExtractor.read(): input.position=0
MatroskaExtractor.read() throwing exception, expecting to be called again with input.position=1
MatroskaExtractor.seek(): position=0, timeUs=0
MatroskaExtractor.read(): input.position=0
MatroskaExtractor.read() throwing exception, expecting to be called again with input.position=4
MatroskaExtractor.seek(): position=0, timeUs=0
MatroskaExtractor.read(): input.position=0

And then later, when we're no longer in the retryFromStartIfLive case, there is no seek call in between the read calls, and the position passed is the same as when the exception was thrown (as expected):

MatroskaExtractor.read(): input.position=0
MatroskaExtractor.read(): input.position=5583
MatroskaExtractor.read() throwing exception, expecting to be called again with input.position=5583
MatroskaExtractor.read(): input.position=5583
MatroskaExtractor.read() throwing exception, expecting to be called again with input.position=5584
MatroskaExtractor.read(): input.position=5584

Between resets I have to reset the variables to keep track of current seekheads, otherwise it is impossible to properly parse seekheads. How would you suggest I implement this clearing?

Perhaps in MatroskaExtractor.seek()?

@KingLucius
Copy link
Author

Please check last commit

//
// Note that we also need to check that we do not jump before or to the segment we are on
// as we do not want to clear our visitedSeekHeads
if (visitedSeekHeads.add(seekEntryPosition) && seekEntryPosition > segmentContentPosition) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seekEntryPosition > segmentContentPosition is needed to not seek above the segment to avoid clearing this visistedlist, however should this also be added to cues, or do we trust cues to be valid?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants