Latest release v0.7.5 does not include the fix for quoted filenames (for non ASCII filenames) #113

jnareb · 2023-11-16T23:25:55Z

I was wondering why unidiff fails on changes to files with filenames that include characters outside 7-bit ASCII, and it turns out that the latest release v0.7.5 does not include commit 2771a87 (Support quoted filenames, 2023-06-02).

Could we please get a new release with this fix included?

Thanks in advance.

The text was updated successfully, but these errors were encountered:

There were 5 cases of error when processing a commit (with ChatGPT sharing link in commit message): - 4 caused by pathname with characters outside 7-bit ASCII, which makes git-diff to use quoted format for pathnames; unidiff library includes the fix, but it is not yet released matiasb/python-unidiff#113 - 1 caused by change being to a submodule rather than to file, or to be more exact moving from one version of subproject to the other (clone was not done using --recursive option) - 1 UnidiffParseError('Target without source: ...') with a creation diff (source is /dev/null) with quoted destination name (containing spaces) Lines survival stats: 76.89% lines survived (in 694 commits in 76 projects).

matiasb · 2023-11-30T20:58:41Z

Will prepare a release in the upcoming days 👍

jnareb · 2023-12-13T23:28:57Z

Unfortunately, commit 2771a87 does not fully solve the problem of c-style quoted filenames.

It makes unidiff to be able to parse patch with quoted filenames, but it then reproduces those filenames in their original quoted format. Shouldn't unidiff decode such filename to str if possible, to bytes if not (e.g. invalid UTF-8)?

All the code does it makes unidiff be able to remove "a/" or "b/" prefix from filenames even if they are in their c-quoted form.

jnareb · 2023-12-14T00:29:29Z

Here is a bit ugly code that actually tries to decode c-quoted filename; not tested for Python 2

def decode_c_quoted_str(text):
    """C-style name unquoting

    See unquote_c_style() function in 'quote.c' file in git/git source code
    https://github.com/git/git/blob/master/quote.c#L401

    This is subset of escape sequences supported by C and C++
    https://learn.microsoft.com/en-us/cpp/c-language/escape-sequences

    :param str text: string which may be c-quoted
    :return: decoded string
    :rtype: str
    """
    # TODO?: Make it a global variable
    escape_dict = {
        'a': '\a',  # Bell (alert)
        'b': '\b',  # Backspace
        'f': '\f',  # Form feed
        'n': '\n',  # New line
        'r': '\r',  # Carriage return
        't': '\t',  # Horizontal tab
        'v': '\v',  # Vertical tab
    }

    quoted = text.startswith('"') and text.endswith('"')
    if quoted:
        text = text[1:-1]  # remove quotes

        buf = bytearray()
        escaped = False  # TODO?: switch to state = 'NORMAL', 'ESCAPE', 'ESCAPE_OCTAL'
        oct_str = ''

        for ch in text:
            if not escaped:
                if ch != '\\':
                    buf.append(ord(ch))
                else:
                    escaped = True
                    oct_str = ''
            else:
                if ch in ('"', '\\'):
                    buf.append(ord(ch))
                    escaped = False
                elif ch in escape_dict:
                    buf.append(ord(escape_dict[ch]))
                    escaped = False
                elif '0' <= ch <= '7':  # octal values with first digit over 4 overflow
                    oct_str += ch
                    if len(oct_str) == 3:
                        byte = int(oct_str, base=8)  # byte in octal notation
                        if byte > 256:
                            raise ValueError(f'Invalid octal escape sequence \\{oct_str} in "{text}"')
                        buf.append(byte)
                        escaped = False
                        oct_str = ''
                else:
                    raise ValueError(f'Unexpected character \'{ch}\' in escape sequence when parsing "{text}"')

        if escaped:
            raise ValueError(f'Unfinished escape sequence when parsing "{text}"')

        text = buf.decode()

    return text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest release v0.7.5 does not include the fix for quoted filenames (for non ASCII filenames) #113

Latest release v0.7.5 does not include the fix for quoted filenames (for non ASCII filenames) #113

jnareb commented Nov 16, 2023

matiasb commented Nov 30, 2023

jnareb commented Dec 13, 2023 •

edited

Loading

jnareb commented Dec 14, 2023

Latest release v0.7.5 does not include the fix for quoted filenames (for non ASCII filenames) #113

Latest release v0.7.5 does not include the fix for quoted filenames (for non ASCII filenames) #113

Comments

jnareb commented Nov 16, 2023

matiasb commented Nov 30, 2023

jnareb commented Dec 13, 2023 • edited Loading

jnareb commented Dec 14, 2023

jnareb commented Dec 13, 2023 •

edited

Loading