-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrong encoding of national characters in file and folder names + folders disappearing #68
Comments
well, to investigate this issue I have also asked on subsonic forum. |
Issue #54 is still actively investigating the issue. I know that a number of people have complained about encoding problems, so I'm on it whenever time permits (which, unfortunately, is not too often at the moment). I suggest to follow #54 as well. This Stackoverflow question describes a case where the page sent from a (XAMPP) web-server was not encoded in UTF-8 (while telling the browser that the encoding is UTF-8), thereby leading to the display of those pesky question marks. I am wondering if something similar happens in Supersonic. What makes the issue even more complicated is the fact that Supersonic uses a combination of the directory name and tag data to convey information. What application server are you running? Jetty or something else? |
Could you also please provide a link to the Subsonic forum post you mentioned? |
Nevermind, I think I found your post. |
sorry, I was a week out. post on subsonic is my, you found it. What I found is this:
It is from database log, here the problem is visible, so it exists at least from saving information in the database. Couldn't it be a problem with reading (or re-coding or whatever they do with it) directory content? |
I am no way certain that there is some UTF-8-encoding problem within the client/server communication as I haven't had the time to dive into the issue more deeply. What I consider more likely at this moment (because I spent quite some time on the encoding topic during issue #54) is that Supersonic reads the file/directory name off the disk using a specific encoding (Java uses UTF-8 by default) when it may have been written in some other character set in the first place. File systems (at least POSIX ones) store file/directory names as raw binary data using whatever encoding the writing application chose, so there's no way to determine the right decoding character when reading the data. This is unlike MP3's ID3 where encoding markers as used (though not always in a canonical manner -- see #54). Another possibility: Supersonic's HSQLDB is storing the data in a different encoding. Could you please try the following: Verify that you're using a UTF-8 encoding, make a copy of one of the affected video files on the shell (using a slightly different name which still contains special characters), and check afterwards if it is indexed and displayed correctly in Supersonic. My theory is that if the file name's encoding was flawed before, creating a copy of the file in a correctly set up (UTF-8) locale should force the shell to write a properly encoded file name. I'll try to collect more information on Supersonic's inner mechanics w.r.t. video file indexing/parsing as soon as I can. Cheers, --Timo |
Addendum: It seems that Java isn't using UTF-8 by default when reading files off the disk; instead, the system encoding is used (source). I also verified that in case of video files, Supersonic simply uses the file name as title (via the File class). So if anything goes wrong encoding-wise when reading in the file name, it ends up being broken in Supersonic. |
I just realized that you said
Does that mean you had another locale configured when the file names in question were created? If so, that'd explain the encoding mismatch and why Java fails to read in the file names correctly. Re-writing all names in UTF-8 should do the trick then. A lot of speculating on my side. I'll refrain from putting down more theories until I hear from you again. ;) |
well, I have included the terminal outputs of ls of the example file in first post. And, I have written that I have LANG and LC as en_US.UTF-8 only for the reason this is first advice anyone gets on the forum of subsonic. So, to make it clear, my system-wide encoding is en_US.UTF-8 (so all deamons have this encoding set). User session encoding is not explicitly set, so it is using also en_US.UTF-8. After login on any account:
So, this information was there to not get advices about encoding settings :) Anyway, I tried to rename something in shell, with no effect. What is important in my opinion is the SQL command I found in log, which is clearly in bad encoding, so subsonic is getting the name from File class in bad encoding. So I made a short java program to list directory output, like this (shortened):
and everything is OK in terminal output:
So, it shows that there could be some problem in directory listing in subsonic - maybe some old self-written function which re-setting locale to C or I don't know. I made all I can up to this point to find the problem - next only available step for me is debugging subsonic on my machine... |
Thanks for the added information. I guess your directory listing sample code kind of beats my "the filename's encoding must be messed up" argument. I stepped through the relevant code myself yesterday. The method responsible for reading the meta data is MetaDataParser.getMetaData(). Since video files don't contain ID3-like tag information, the video title is guessed by a call to guessTitle(). It's implementation boils down to a call to removeTrackNumberFromTitle() which tries to strip any track number from the title first. The title, in essence, is provided like this:
(line 129) where file is of type File. So nothing particularly magic here as far as I can see. For what it's worth, my Supersonic instance is having no issues with your filenames. I am using an all-UTF-8 system environment as well. |
Well, ok, I cloned repository and I will try to build and debug it here, I haven't any deep java experience, in fact the test encoding program was one of my first java works ;) But I have already found the getMetaData method myself before you wrote, so I will give it a try. |
My post was meant as documentation for what I managed to figure out so far; I did not intend to push you into debugging things yourself. Sorry if I sounded anything like that. If you feel like trying though, I'm the last person to hold you back. :) For additional debugging-related questions, feel free to use our developers group. |
Well, the worst case happened. I have debugged it step by step with many problematic files - everything works. I didn't changed any configuration on server, only added debug parameters to java command line to be able to connect netbeans - I have headless server supersonic runs at. Thanks for all your help. |
You didn't upgrade Supersonic (e.g., checked out a more recent Github master copy) for debugging purposes which may have possibly fixed some issues, didn't you? Unless you did I guess the best explanation is that you have just encountered your first Heisenbug! Congratulations, we've all been down that road. :) If you ever happen to discover what caused the encoding troubles in the first place please let me know! |
Hi all,
I have problem with national characters. I found the same problem occurred for some people here, but none of their solutions worked for me, or thread is abandoned. My version is 4.7.beta1, I'm on arch linux using package from AUR (unsupported) repository. Well, the problem is, that:
What I tried is that I have checked all language variables from which java should know to use en_US.UTF-8, ans display everything normally.
for example, and as proof my system is set correctly, this is terminal output. This file and directory is not displayed at all:
However, in this example it is displayed, but with question marks.
and this is subsonic output:
_ceske » A bude h����
Up | Play all | Play random | Add all | Comment
of course that such files cannot be played, found easily, etc. The file name is clearly in wrong shape in database, or read wrong in indexation phase (which has the same effect).
well, now, I have every LANG and LC_* variables as en_US.UTF-8. As I'm on arch and starting daemon via script, I'm also forcing LC_ALL=en_US.UTF-8 in start script.
ags:
It is very annoying bug, because many files are not included in indexing, but I don't know which.
The text was updated successfully, but these errors were encountered: