Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash, most likely due to non-unicode characters in file name #9

Open
dmitry-irtegov opened this issue Jul 24, 2024 · 0 comments
Open

Comments

@dmitry-irtegov
Copy link

Hello!
Thanks for useful idea!

I tried to use your program on the big archive while using an UTF-8 locale and it crashed with the stack trace:
Traceback (most recent call last):
File "tarindexer.py", line 123, in
main()
File "tarindexer.py", line 118, in main
indextar(dbtarfile,indexfile)
File "tarindexer.py", line 66, in indextar
outfile.write(rec)
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 40-47: surrogates not allowed
The file name that most likely triggered the crash is
\317\360\356\341\353\345\354\373\ \341\345\347\356\357\340\361\355\356\361\362\350\ \342\ \310\322.pdf
(as output by ls -b), which indeed does not look like the valid UTF-8.
Unfortunately I cannot send you the archive, mostly because the file and the surrounding files are rather big.
While having this file in the archive is my fault, I think the program should avoid the crash, may be printing ls -b-style output instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant