-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong display for non-ASCII tags #54
Comments
Are you referring to the Android app, the web frontend, or both? |
I am seeing something very similar in my collection with Sigur Rós which appears as Sigur Rós in both the web and Android clients. The song result itself does not show this error, just the directory name and artist search results, leading me to believe this is isolated to portions of the backend's use of character encoding. |
portcharsui's assumption seems correct: I tested things myself by creating an album which contains a ó character in both album and artist file and tag names, respectively. Result: Everything looks fine on both the web and Android client. My Linux box is using UTF-8 for all locale settings, including LANG and LC_ALL. Could anyone affected by the issue please check his locale and see if switching to UTF-8 (possibly by using one of these methods) helps? (Don't forget to re-encode the ID3 portion too. I use eyeD3 for that.) |
Closing the issue as it seems fixed. Please report back if the problem persists in a reproducible way independent of a correctly configured locale environment. |
http://www.squirrel.nl/pub/xfer/uploads/3CAcxyTsu3l9UgXeWkLkIoyQ.png This is the current (as of yesterday) git version. ID3 data for this track: id3v1 tag info for 01_4th_And_Vine.mp3: |
@sciurius, what OS are you running? If Linux, can you provide the output of the locale command? |
I think Fedora17 qualifies as Linux ;) % cat /etc/redhat-release |
Adding one such affected MP3 file to my Supersonic instance, I can confirm that the artist's name looks borked. However, when I edit the MP3 and re-set the artist's name (by means of copying it off Wikipedia and pasting it into the artist field using an MP3 tag editor on my Mac), it looks perfectly right afterwards in Supersonic: All non-ASCII characters are rendered properly. So, my conclusion is that it is not a Supersonic issue. Apparently, an invalid tag encoding has been applied in the first place. sciurius, can you try to reproduce what I did and see if it confirms my findings? |
Yes, it is relatively easy to avoid this from happening. But this does
Even if this were the case, the same tag is rendered differently in two E.g., when I search for "o'connor" I get the following results: Play Add Download Sinéad O'Connor So there are two distinct but visually identical results for Sinéad The same tracks (files) are processed by the Logitech Media Server |
The same tag isn't rendered differently in Supersonic. Instead, Supersonic uses the directory name for the album name in the breadcrumb navigation, and the tag information for the playlist content. That's why the same label is shown properly one time and improperly a second time. So it's not a bug in Supersonic, just different approaches to achieve the same thing. (Admittedly, one can argue that the implementation should be more consistent.) I'm not too familiar with Squeeze Box so I cannot tell why (and how) it does better. Maybe it reads the artist's name from another well-encoded field (like album artist) and prefers that one over the broken one. Or it uses the file name. Or, if encoding schemas are stored within MP3 tags along the payload (which I need to figure out), it may decode the content accordingly. I'll need to dig somewhat into the MP3 specs to tell what the best approach for Supersonic would be. |
Timo Reimann [email protected] writes:
Ah! This explains some of the things I see ... Personally, I'd advice against using directory names for this purpose,
I moved the offending album to another directory and requested a scan For SqueezeboxServer it doesn't make any difference. Apparently it uses For Su*Sonic, the album does, indeed, appear under a new (artist) name. Still, while SqueezeboxServer and other tools(!) derive the correct Also still, it doesn't explain why Sinéad O'Connor is mentioned twice, BTW: The current Su*Sonic approach also means that collection albums -- Johan |
I agree with what you say regarding not to use directory names for information display. To my defense, I didn't implement it this way, and wouldn't do so. :) I'll open a separate issue later to deal with this. In general, Su*sonic lacks capability to fully leverage tag data which is a shortcoming IMHO. I took just a rough look at the MP3 specification: Apparently, it does include encoding information, so it should be possible to decode tag data appropriately (unless, of course, for those cases where the encoding is completely broken). We'll use this issue to indicate progress on the problem. |
I dug a little deeper into the Supersonic code: Apparently, it uses the Jaudiotagger library (in a fairly recent version) to decode audio tag data. One interesting bug report (JAUDIOTAGGER-179) deals with the request to allow proper decoding of MP3 tags in local encodings (like code pages). The Jaudiotagger author clearly states that anything but ISO-8859-1, variants of UTF-16, and UTF-8 (depending on the exact MP3v2 version being used) is illegal and won't be supported. It looks like auto-detecting a local encoding is not possible (at least not in a standard-compliant fashion) whereas the supported encodings identify themselves by means of a marking encoding byte (see Wikipedia). I haven't looked into the Jaudiotagger code but strongly assume that decoding standard-compliant encodings works, so the observed behavior is most likely due to one such illegal local encoding. While the Jaudiotagger author announced to add an option to specify the (local) encoding used when decoding an MP3 tag, it wouldn't be a great solution IMHO because it'd only work for people willing and capable of specifying the employed encoding. Re-writing tag data in a supported encoding seems to be the way to go. So much to what I figured about Jaudiotagger. Next, I'll try to find out what SqueezeBox Server does. |
Timo Reimann [email protected] writes:
One of the questions that keeps bugging me is what is wrong with the Does Jaudiotagger have a tool to display the 'raw' tag data? -- Johan |
Jaudiotagger does not seem to provide any tools. However, putting together a piece of code to read out the raw tag data should be a matter of a couple lines only; the Jaudiotagger code examples may be helpful with that. Note that UTF-8 is allowed in ID3v2.4 only. It's probably safer to use UTF-16 with ID3v2.3. |
In order to take a closer look at the ID3 data, I build a simple program using Jaudiotagger 2.0.4. It looks like this: import org.jaudiotagger.audio.AudioFileIO;
import org.jaudiotagger.audio.mp3.MP3File;
import org.jaudiotagger.tag.FieldKey;
import org.jaudiotagger.tag.Tag;
import org.jaudiotagger.tag.id3.*;
import java.io.File;
import java.nio.charset.Charset;
public class ParseMp3 {
private static String mp3File = "01_4th_And_Vine.mp3";
public static void main(String[] args) {
File file = new File(ParseMp3.mp3File);
MP3File mp3File = null;
try {
mp3File = (MP3File)AudioFileIO.read(file);
} catch (Exception e) {
e.printStackTrace();
}
Tag tag = mp3File.getTag();
System.out.println("artist: " + tag.getFirst(FieldKey.ARTIST));
ID3v1Tag id3v1Tag = mp3File.getID3v1Tag();
System.out.println("ID3v1 artist: " + id3v1Tag.getFirst(FieldKey.ARTIST));
AbstractID3v2Tag id3v23Tag = mp3File.getID3v2Tag();
System.out.println("ID3v2.3 artist: " + id3v23Tag.getFirst(ID3v23Frames.FRAME_ID_V3_ARTIST));
AbstractID3v2Frame id3v23Frame = id3v23Tag.getFirstField(ID3v23Frames.FRAME_ID_V3_ARTIST);
System.out.println("ID3v2.3 encoding: " + id3v23Frame.getEncoding());
ID3v24Tag id3v24Tag = mp3File.getID3v2TagAsv24();
System.out.println("ID3v2.4 artist: " + id3v24Tag.getFirst(ID3v24Frames.FRAME_ID_ARTIST));
AbstractID3v2Frame id3v24Frame = id3v24Tag.getFirstField(ID3v24Frames.FRAME_ID_ARTIST);
System.out.println("ID3v2.4 encoding: " + id3v24Frame.getEncoding());
byte[] rawData = id3v23Frame.getRawContent();
System.out.println("analyzing " + rawData.length + " bytes of raw ID3v2.3 artist data:");
String isoText = new String(rawData, Charset.forName("ISO-8859-1"));
String utfText = new String(rawData, Charset.forName("UTF-8"));
for (int i = 0; i < rawData.length; ++i) {
char singleByte = (char)rawData[i];
char utfChar = '\0';
try {
utfChar = utfText.charAt(i);
} catch (StringIndexOutOfBoundsException exception) {
}
System.out.println("ID3v2 artist raw byte #" + String.format("%02d", i+1) + ": " +
String.format("%5d", (int)singleByte) + "\t\thex: " +
String.format("%8s", Integer.toHexString(singleByte)) + "\t\tISO: " +
isoText.charAt(i) + "\t\tUTF: " + utfChar);
}
}
} The program's output is this:
(I parsed ID3v2.3 data in the example but the results are exactly the same with 2.4.) According to the ID3v2 frame specification, the bytes outlined above carry the following semantics:
As you can see, the artist name was supposedly stored in ISO-8859-1 (byte 00 after the frame size). Taking a closer look at bytes 15 and 16, however, one can observe that in fact UTF-8 was written to the ID3 field: The é character takes up two bytes in UTF-8 which, if mistakenly interpreted as ISO-8859-1, leads to two distinct characters à and ©. (This seems to be a common symptom for UTF-8/ISO-8859-1-related issues.) Another indication for the lack of proper UTF-8 encoding is the fact that the BOM (either FFFE or FEFF) is missing. So Jaudiotagger seems to be doing everything right. I can't tell for sure whether those other ID3-reading applications you are using are simply lucky to, say, choose your machine's locale setting for decoding, or sophisticated enough to try some kind of encoding detection. While the latter still seems desirable to have in Supersonic for certain cases, I believe that in this one the ID3 encoding is indeed broken. HTH, --Timo |
Timo Reimann [email protected] writes:
Ha! Now I finally know exactly what is wrong. I already made a lot of Most tools that I used apparently assume it's utf8 unless it contains Thanks! Now I can start working on a goot solution (which may involve Thanks again. -- Johan |
I was curious and gave juniversalchardet a try, a Java port of Mozilla's Universal Charset Detection implementation. Although the project looks a little bit neglected it managed to detect UTF-8 in the malformed files correctly. Since there are so many illegal music files out there, I'd like to incorporate the library in a Supersonic branch and see how it works. For the sake of completeness, I also tried out another Java alternative called jChardet. It failed to detect UTF-8, however. |
In my collection, I have an album from Sinéad O'Connor.
In the recently added (and recently played, ...) views, this displays as Sinéad O'Connor.
When the album itself is selected, the artist name displays okay.
The text was updated successfully, but these errors were encountered: