utf-8 codec error #5

harsha436 · 2021-05-13T07:36:01Z

We are seeing below issue with some changes during the migration. We can proceed with the migration by excluding the paths n yaml file but this is the third time we got this error. Any idea what is this about and how to resolve this?

root@dc2-p4-gl-05:/scm/p4transfer # tail -20 log-P4Transfer-20210510090851.log

'rev': 1,
'time': datetime.datetime(2017, 7, 25, 7, 14, 59),
'type': 'text+kx',
'user': 'amirl'}]}]
2021-05-10 09:16:41,946:P4Transfer:DEBUG: src('sync', '//hmallesh_test_transfer/...@=219564')
2021-05-10 09:16:52,093:P4Transfer:DEBUG: src[]
2021-05-10 09:16:53,552:P4Transfer:ERROR: 'utf-8' codec can't decode byte 0xd1 in position 0: invalid continuation byte
Traceback (most recent call last):
File "P4Transfer.py", line 2149, in replicate
num_changes = self.replicate_changes()
File "P4Transfer.py", line 1965, in replicate_changes
fileRevs, branchRevs = self.source.getChange(change['change'])
File "P4Transfer.py", line 1114, in getChange
chRev.updateDigest()
File "P4Transfer.py", line 524, in updateDigest
self.fileSize, self.digest = getKTextDigest(self.fixedLocalFile)
File "P4Transfer.py", line 425, in getKTextDigest
contents = contents.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 0: invalid continuation byte
2021-05-10 09:16:53,552:P4Transfer:INFO: Sleeping on error for 60 minutes
root@dc2-p4-gl-05:/scm/p4transfer #

th3de3th · 2023-10-14T15:41:38Z

Got almost the same issue on the recent version.

2023-10-14 15:29:13,115:P4Transfer:DEBUG: filelogs count: 2369
2023-10-14 15:29:13,117:P4Transfer:DEBUG: src('sync', '//<REMOVEDWSNAME>/...@=54236')
2023-10-14 15:29:13,176:P4Transfer:WARNING: src:sync-msg, //<REMOVEDWSNAME>//...@=54236 - file(s) up-to-date.
2023-10-14 15:29:13,176:P4Transfer:DEBUG: src[]
2023-10-14 15:29:13,187:P4Transfer:ERROR: 'utf-8' codec can't decode byte 0x96 in position 9489: invalid start byte
Traceback (most recent call last):
  File "/opt/p4transfer/p4transfer/P4Transfer.py", line 2632, in replicate
    num_changes = self.replicate_changes()
  File "/opt/p4transfer/p4transfer/P4Transfer.py", line 2448, in replicate_changes
    fileRevs, specialMoveRevs, srcFileLogs = self.source.getChange(change['change'])
  File "/opt/p4transfer/p4transfer/P4Transfer.py", line 1291, in getChange
    chRev.updateDigest()
  File "/opt/p4transfer/p4transfer/P4Transfer.py", line 626, in updateDigest
    self.fileSize, self.digest = getKTextDigest(self.fixedLocalFile)
  File "/opt/p4transfer/p4transfer/P4Transfer.py", line 521, in getKTextDigest
    contents = contents.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 9489: invalid start byte

Both servers targets at p4transfer config file is set to none charset, p4charset: - is empty, as both servers non-unicode

What i've managed to find out, is that file which causes the error is text+k type (using tag's like $File/$Change/$DateTime/$Revision) and contains the 0x96 character at the mentioned 9489 position (vim shows it as <96>) alongside with just usual text in it's content. Here is a good explanation of the symbol itself: https://unix.stackexchange.com/a/495650 which describes that it's cp1252 symbol, which makes pain, when trying to decode it as utf-8.

The file itself on target server local filesystem (as well as at source server) is in "unknown-8bit" encoding

root@perforce:/opt/p4transfer# file -bi  /opt/p4transfer/transfers/<REMOVED>
text/plain; charset=unknown-8bit

For sure, as a very dirty solution, we can change the file/or re-encode it, however this is not permitted way since we're not owning the code and can't predict any consequences of such change.

Can we just force to encoding="ISO-8859-1" or latin1 or even better to None? Or even better to not trying to re-encode any file contents during sync?

Here is all attempts of file opening, only one is a binary:

$ grep -n "open(" p4transfer/P4Transfer.py 
469:    with open(fname, flags) as fh:
479:    with open(fname, flags) as fh:
504:    with open(fname, "rb") as f:
2210:            with open(fpath, "a") as fh:
2232:        with open(fpath, "a") as fh:
2316:            with open(self.options.config) as f:

According to https://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html#the-binary-option looks like Binary option is only the reliable way to not corrupt the data.

Why do the encoding/decoding is still in-place? it's something that could lead to data corruption.
Can we introduce an option to allow disabling re-tagging for +k files? (this will allow avoid a kind of such issues)

Any thoughts?

th3de3th · 2023-10-14T21:44:49Z

This is solved my issue

$ diff -u P4Transfer.py P4Transfer.py.modified.py 
--- P4Transfer.py	2023-09-21 16:31:03.163049658 +0000
+++ P4Transfer.py.modified.py	2023-10-14 21:50:24.936233459 +0000
@@ -518,12 +518,12 @@
     "Special calculation for ktext files - ignores lines with keywords in them"
     contents = readContents(fname)
     if python3:
-        contents = contents.decode()
+        contents = contents.decode(errors='surrogateescape')
     m = hashlib.md5()
     # Optimisation to search on whole file
     if not re_rcs_keywords.search(contents):
         if python3:
-            m.update(contents.encode())
+            m.update(contents.encode(errors='surrogateescape'))
         else:
             m.update(contents)
         fileSize = os.path.getsize(fname)
@@ -533,7 +533,7 @@
     for line in lines:
         if not re_rcs_keywords.search(line):
             if python3:
-                m.update(line.encode())
+                m.update(line.encode(errors='surrogateescape'))
             else:
                 m.update(line)
             fileSize += len(line)

th3de3th · 2023-10-14T21:53:38Z

@rcowham would you mind to comment on that?

rcowham · 2023-10-15T06:24:51Z

Hmmm. Looks OK as a workaround. I wonder if it is better to allow an extra locale setting to be specified?

th3de3th · 2023-10-15T11:10:50Z

Not sure, that it can provide a reliable solution for keeping data consistent while encoding, since we never sure what exact encoding clients are using until it's not aligned for all developers, which is i believe a rare case. Most common, when several teams using their own approach for encoding settings.
However ability to confiigure this would provide at least instruments for administrators to try manage this on their own.

th3de3th · 2023-10-15T11:15:25Z

As well for some cases it could can be a good idea to allow restrict (make it configurable) of processsing text+k at all, this will eliminate such kind of issues at all. I believe some of cases (for some setups) not requires to update values of rcs tags while migration.

th3de3th mentioned this issue Oct 15, 2023

UnicodeDecodeError during transfer of first changelist #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf-8 codec error #5

utf-8 codec error #5

harsha436 commented May 13, 2021

th3de3th commented Oct 14, 2023 •

edited

Loading

th3de3th commented Oct 14, 2023 •

edited

Loading

th3de3th commented Oct 14, 2023

rcowham commented Oct 15, 2023

th3de3th commented Oct 15, 2023

th3de3th commented Oct 15, 2023

utf-8 codec error #5

utf-8 codec error #5

Comments

harsha436 commented May 13, 2021

th3de3th commented Oct 14, 2023 • edited Loading

th3de3th commented Oct 14, 2023 • edited Loading

th3de3th commented Oct 14, 2023

rcowham commented Oct 15, 2023

th3de3th commented Oct 15, 2023

th3de3th commented Oct 15, 2023

th3de3th commented Oct 14, 2023 •

edited

Loading

th3de3th commented Oct 14, 2023 •

edited

Loading