Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest release v0.7.5 does not include the fix for quoted filenames (for non ASCII filenames) #113

Open
jnareb opened this issue Nov 16, 2023 · 3 comments

Comments

@jnareb
Copy link

jnareb commented Nov 16, 2023

I was wondering why unidiff fails on changes to files with filenames that include characters outside 7-bit ASCII, and it turns out that the latest release v0.7.5 does not include commit 2771a87 (Support quoted filenames, 2023-06-02).

Could we please get a new release with this fix included?

Thanks in advance.

jnareb added a commit to ncusi/MSR_Challenge_2024 that referenced this issue Nov 20, 2023
There were 5 cases of error when processing a commit (with ChatGPT
sharing link in commit message):
- 4 caused by pathname with characters outside 7-bit ASCII,
  which makes git-diff to use quoted format for pathnames;
  unidiff library includes the fix, but it is not yet released
  matiasb/python-unidiff#113
- 1 caused by change being to a submodule rather than to file,
  or to be more exact moving from one version of subproject
  to the other (clone was not done using --recursive option)
- 1 UnidiffParseError('Target without source: ...')
  with a creation diff (source is /dev/null) with quoted
  destination name (containing spaces)

Lines survival stats: 76.89% lines survived
(in 694 commits in 76 projects).
jnareb added a commit to ncusi/MSR_Challenge_2024 that referenced this issue Nov 20, 2023
There were 5 cases of error when processing a commit (with ChatGPT
sharing link in commit message):
- 4 caused by pathname with characters outside 7-bit ASCII,
  which makes git-diff to use quoted format for pathnames;
  unidiff library includes the fix, but it is not yet released
  matiasb/python-unidiff#113
- 1 caused by change being to a submodule rather than to file,
  or to be more exact moving from one version of subproject
  to the other (clone was not done using --recursive option)
- 1 UnidiffParseError('Target without source: ...')
  with a creation diff (source is /dev/null) with quoted
  destination name (containing spaces)

Lines survival stats: 76.89% lines survived
(in 694 commits in 76 projects).
@matiasb
Copy link
Owner

matiasb commented Nov 30, 2023

Will prepare a release in the upcoming days 👍

@jnareb
Copy link
Author

jnareb commented Dec 13, 2023

Unfortunately, commit 2771a87 does not fully solve the problem of c-style quoted filenames.

It makes unidiff to be able to parse patch with quoted filenames, but it then reproduces those filenames in their original quoted format. Shouldn't unidiff decode such filename to str if possible, to bytes if not (e.g. invalid UTF-8)?

All the code does it makes unidiff be able to remove "a/" or "b/" prefix from filenames even if they are in their c-quoted form.

@jnareb
Copy link
Author

jnareb commented Dec 14, 2023

Here is a bit ugly code that actually tries to decode c-quoted filename; not tested for Python 2

def decode_c_quoted_str(text):
    """C-style name unquoting

    See unquote_c_style() function in 'quote.c' file in git/git source code
    https://github.com/git/git/blob/master/quote.c#L401

    This is subset of escape sequences supported by C and C++
    https://learn.microsoft.com/en-us/cpp/c-language/escape-sequences

    :param str text: string which may be c-quoted
    :return: decoded string
    :rtype: str
    """
    # TODO?: Make it a global variable
    escape_dict = {
        'a': '\a',  # Bell (alert)
        'b': '\b',  # Backspace
        'f': '\f',  # Form feed
        'n': '\n',  # New line
        'r': '\r',  # Carriage return
        't': '\t',  # Horizontal tab
        'v': '\v',  # Vertical tab
    }

    quoted = text.startswith('"') and text.endswith('"')
    if quoted:
        text = text[1:-1]  # remove quotes

        buf = bytearray()
        escaped = False  # TODO?: switch to state = 'NORMAL', 'ESCAPE', 'ESCAPE_OCTAL'
        oct_str = ''

        for ch in text:
            if not escaped:
                if ch != '\\':
                    buf.append(ord(ch))
                else:
                    escaped = True
                    oct_str = ''
            else:
                if ch in ('"', '\\'):
                    buf.append(ord(ch))
                    escaped = False
                elif ch in escape_dict:
                    buf.append(ord(escape_dict[ch]))
                    escaped = False
                elif '0' <= ch <= '7':  # octal values with first digit over 4 overflow
                    oct_str += ch
                    if len(oct_str) == 3:
                        byte = int(oct_str, base=8)  # byte in octal notation
                        if byte > 256:
                            raise ValueError(f'Invalid octal escape sequence \\{oct_str} in "{text}"')
                        buf.append(byte)
                        escaped = False
                        oct_str = ''
                else:
                    raise ValueError(f'Unexpected character \'{ch}\' in escape sequence when parsing "{text}"')

        if escaped:
            raise ValueError(f'Unfinished escape sequence when parsing "{text}"')

        text = buf.decode()

    return text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants