Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf-8 codec error #5

Open
harsha436 opened this issue May 13, 2021 · 6 comments
Open

utf-8 codec error #5

harsha436 opened this issue May 13, 2021 · 6 comments

Comments

@harsha436
Copy link

We are seeing below issue with some changes during the migration. We can proceed with the migration by excluding the paths n yaml file but this is the third time we got this error. Any idea what is this about and how to resolve this?

root@dc2-p4-gl-05:/scm/p4transfer # tail -20 log-P4Transfer-20210510090851.log

'rev': 1,
'time': datetime.datetime(2017, 7, 25, 7, 14, 59),
'type': 'text+kx',
'user': 'amirl'}]}]
2021-05-10 09:16:41,946:P4Transfer:DEBUG: src('sync', '//hmallesh_test_transfer/...@=219564')
2021-05-10 09:16:52,093:P4Transfer:DEBUG: src[]
2021-05-10 09:16:53,552:P4Transfer:ERROR: 'utf-8' codec can't decode byte 0xd1 in position 0: invalid continuation byte
Traceback (most recent call last):
File "P4Transfer.py", line 2149, in replicate
num_changes = self.replicate_changes()
File "P4Transfer.py", line 1965, in replicate_changes
fileRevs, branchRevs = self.source.getChange(change['change'])
File "P4Transfer.py", line 1114, in getChange
chRev.updateDigest()
File "P4Transfer.py", line 524, in updateDigest
self.fileSize, self.digest = getKTextDigest(self.fixedLocalFile)
File "P4Transfer.py", line 425, in getKTextDigest
contents = contents.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 0: invalid continuation byte
2021-05-10 09:16:53,552:P4Transfer:INFO: Sleeping on error for 60 minutes
root@dc2-p4-gl-05:/scm/p4transfer #

@th3de3th
Copy link

th3de3th commented Oct 14, 2023

Got almost the same issue on the recent version.

2023-10-14 15:29:13,115:P4Transfer:DEBUG: filelogs count: 2369
2023-10-14 15:29:13,117:P4Transfer:DEBUG: src('sync', '//<REMOVEDWSNAME>/...@=54236')
2023-10-14 15:29:13,176:P4Transfer:WARNING: src:sync-msg, //<REMOVEDWSNAME>//...@=54236 - file(s) up-to-date.
2023-10-14 15:29:13,176:P4Transfer:DEBUG: src[]
2023-10-14 15:29:13,187:P4Transfer:ERROR: 'utf-8' codec can't decode byte 0x96 in position 9489: invalid start byte
Traceback (most recent call last):
  File "/opt/p4transfer/p4transfer/P4Transfer.py", line 2632, in replicate
    num_changes = self.replicate_changes()
  File "/opt/p4transfer/p4transfer/P4Transfer.py", line 2448, in replicate_changes
    fileRevs, specialMoveRevs, srcFileLogs = self.source.getChange(change['change'])
  File "/opt/p4transfer/p4transfer/P4Transfer.py", line 1291, in getChange
    chRev.updateDigest()
  File "/opt/p4transfer/p4transfer/P4Transfer.py", line 626, in updateDigest
    self.fileSize, self.digest = getKTextDigest(self.fixedLocalFile)
  File "/opt/p4transfer/p4transfer/P4Transfer.py", line 521, in getKTextDigest
    contents = contents.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 9489: invalid start byte

Both servers targets at p4transfer config file is set to none charset, p4charset: - is empty, as both servers non-unicode

What i've managed to find out, is that file which causes the error is text+k type (using tag's like $File/$Change/$DateTime/$Revision) and contains the 0x96 character at the mentioned 9489 position (vim shows it as <96>) alongside with just usual text in it's content. Here is a good explanation of the symbol itself: https://unix.stackexchange.com/a/495650 which describes that it's cp1252 symbol, which makes pain, when trying to decode it as utf-8.

The file itself on target server local filesystem (as well as at source server) is in "unknown-8bit" encoding

root@perforce:/opt/p4transfer# file -bi  /opt/p4transfer/transfers/<REMOVED>
text/plain; charset=unknown-8bit

For sure, as a very dirty solution, we can change the file/or re-encode it, however this is not permitted way since we're not owning the code and can't predict any consequences of such change.

Can we just force to encoding="ISO-8859-1" or latin1 or even better to None? Or even better to not trying to re-encode any file contents during sync?

Here is all attempts of file opening, only one is a binary:

$ grep -n "open(" p4transfer/P4Transfer.py 
469:    with open(fname, flags) as fh:
479:    with open(fname, flags) as fh:
504:    with open(fname, "rb") as f:
2210:            with open(fpath, "a") as fh:
2232:        with open(fpath, "a") as fh:
2316:            with open(self.options.config) as f:

According to https://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html#the-binary-option looks like Binary option is only the reliable way to not corrupt the data.

Why do the encoding/decoding is still in-place? it's something that could lead to data corruption.
Can we introduce an option to allow disabling re-tagging for +k files? (this will allow avoid a kind of such issues)

Any thoughts?

@th3de3th
Copy link

th3de3th commented Oct 14, 2023

This is solved my issue

$ diff -u P4Transfer.py P4Transfer.py.modified.py 
--- P4Transfer.py	2023-09-21 16:31:03.163049658 +0000
+++ P4Transfer.py.modified.py	2023-10-14 21:50:24.936233459 +0000
@@ -518,12 +518,12 @@
     "Special calculation for ktext files - ignores lines with keywords in them"
     contents = readContents(fname)
     if python3:
-        contents = contents.decode()
+        contents = contents.decode(errors='surrogateescape')
     m = hashlib.md5()
     # Optimisation to search on whole file
     if not re_rcs_keywords.search(contents):
         if python3:
-            m.update(contents.encode())
+            m.update(contents.encode(errors='surrogateescape'))
         else:
             m.update(contents)
         fileSize = os.path.getsize(fname)
@@ -533,7 +533,7 @@
     for line in lines:
         if not re_rcs_keywords.search(line):
             if python3:
-                m.update(line.encode())
+                m.update(line.encode(errors='surrogateescape'))
             else:
                 m.update(line)
             fileSize += len(line)

@th3de3th
Copy link

@rcowham would you mind to comment on that?

@rcowham
Copy link
Contributor

rcowham commented Oct 15, 2023

Hmmm. Looks OK as a workaround. I wonder if it is better to allow an extra locale setting to be specified?

@th3de3th
Copy link

Not sure, that it can provide a reliable solution for keeping data consistent while encoding, since we never sure what exact encoding clients are using until it's not aligned for all developers, which is i believe a rare case. Most common, when several teams using their own approach for encoding settings.
However ability to confiigure this would provide at least instruments for administrators to try manage this on their own.

@th3de3th
Copy link

As well for some cases it could can be a good idea to allow restrict (make it configurable) of processsing text+k at all, this will eliminate such kind of issues at all. I believe some of cases (for some setups) not requires to update values of rcs tags while migration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants