Wrong display for non-ASCII tags #54

sciurius · 2012-05-29T09:54:46Z

In my collection, I have an album from Sinéad O'Connor.

In the recently added (and recently played, ...) views, this displays as SinÃ©ad O'Connor.
When the album itself is selected, the artist name displays okay.

timoreimann · 2012-05-29T18:06:14Z

Are you referring to the Android app, the web frontend, or both?

porkcharsui · 2012-06-04T01:55:38Z

I am seeing something very similar in my collection with Sigur Rós which appears as Sigur RÃ³s in both the web and Android clients. The song result itself does not show this error, just the directory name and artist search results, leading me to believe this is isolated to portions of the backend's use of character encoding.

timoreimann · 2012-06-08T22:26:24Z

portcharsui's assumption seems correct: I tested things myself by creating an album which contains a ó character in both album and artist file and tag names, respectively. Result: Everything looks fine on both the web and Android client.

My Linux box is using UTF-8 for all locale settings, including LANG and LC_ALL. Could anyone affected by the issue please check his locale and see if switching to UTF-8 (possibly by using one of these methods) helps? (Don't forget to re-encode the ID3 portion too. I use eyeD3 for that.)

timoreimann · 2012-06-17T15:48:00Z

Closing the issue as it seems fixed. Please report back if the problem persists in a reproducible way independent of a correctly configured locale environment.

sciurius · 2012-07-18T13:34:37Z

http://www.squirrel.nl/pub/xfer/uploads/3CAcxyTsu3l9UgXeWkLkIoyQ.png

This is the current (as of yesterday) git version.

ID3 data for this track:

id3v1 tag info for 01_4th_And_Vine.mp3:
Title : 4th And Vine Artist: Sinéad O'Connor
Album : How About I Be Me, And You Be Year: , Genre: Unknown (255)
Comment: Track: 1
id3v2 tag info for 01_4th_And_Vine.mp3:
TALB (Album/Movie/Show title): How About I Be Me, And You Be You
TCON (Content type): (255) (255)
TIT2 (Title/songname/content description): 4th And Vine
TPE1 (Lead performer(s)/Soloist(s)): Sinéad O'Connor
TRCK (Track number/Position in set): 1

timoreimann · 2012-08-09T22:05:01Z

@sciurius, what OS are you running? If Linux, can you provide the output of the locale command?

sciurius · 2012-08-10T06:09:37Z

I think Fedora17 qualifies as Linux ;)

% cat /etc/redhat-release
Fedora release 17 (Beefy Miracle)
% uname -a
Linux phoenix.squirrel.nl 3.5.0-2.fc17.x86_64 #1 SMP Mon Jul 30 14:48:59 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
% locale
LANG=en_US.utf-8
LC_CTYPE="en_US.utf-8"
LC_NUMERIC="en_US.utf-8"
LC_TIME=en_GB
LC_COLLATE="en_US.utf-8"
LC_MONETARY="en_US.utf-8"
LC_MESSAGES="en_US.utf-8"
LC_PAPER=nl_NL
LC_NAME="en_US.utf-8"
LC_ADDRESS="en_US.utf-8"
LC_TELEPHONE="en_US.utf-8"
LC_MEASUREMENT="en_US.utf-8"
LC_IDENTIFICATION="en_US.utf-8"
LC_ALL=

timoreimann · 2012-08-13T19:24:25Z

Adding one such affected MP3 file to my Supersonic instance, I can confirm that the artist's name looks borked. However, when I edit the MP3 and re-set the artist's name (by means of copying it off Wikipedia and pasting it into the artist field using an MP3 tag editor on my Mac), it looks perfectly right afterwards in Supersonic: All non-ASCII characters are rendered properly.

So, my conclusion is that it is not a Supersonic issue. Apparently, an invalid tag encoding has been applied in the first place.

sciurius, can you try to reproduce what I did and see if it confirms my findings?

sciurius · 2012-08-13T21:07:57Z

Adding one such affected MP3 file to my Supersonic instance, I can
confirm that the artist's name looks borked. However, when I edit the
MP3 and re-set the artist's name (by means of copying it off Wikipedia
and pasting it into the artist field using an MP3 tag editor on my
Mac), it looks perfectly right afterwards in Supersonic: All non-ASCII
characters are rendered properly.

Yes, it is relatively easy to avoid this from happening. But this does
not make the problem (= a bug in the Su*Sonic code) go away. Only the
symptoms disappear.

So, my conclusion is that it is not a Supersonic issue. Apparently, an
invalid tag encoding has been applied in the first place.

Even if this were the case, the same tag is rendered differently in two
places. That indicates a coding error somewhere.

E.g., when I search for "o'connor" I get the following results:

Play Add Download Sinéad O'Connor
Play Add Download Tony O'Connor
Play Add Download Sinéad O'Connor
Play Add Download Yo-Yo Ma, Edgar Meyer, Mark O'Connor

So there are two distinct but visually identical results for Sinéad
O'Connor. The first result has 6 CD's, the second result has 3 CD's.

The same tracks (files) are processed by the Logitech Media Server
(a.k.a. SqueezeBoxServer) and this server has no problems with the tags.

timoreimann · 2012-08-13T22:45:01Z

The same tag isn't rendered differently in Supersonic. Instead, Supersonic uses the directory name for the album name in the breadcrumb navigation, and the tag information for the playlist content. That's why the same label is shown properly one time and improperly a second time. So it's not a bug in Supersonic, just different approaches to achieve the same thing. (Admittedly, one can argue that the implementation should be more consistent.)

I'm not too familiar with Squeeze Box so I cannot tell why (and how) it does better. Maybe it reads the artist's name from another well-encoded field (like album artist) and prefers that one over the broken one. Or it uses the file name. Or, if encoding schemas are stored within MP3 tags along the payload (which I need to figure out), it may decode the content accordingly. I'll need to dig somewhat into the MP3 specs to tell what the best approach for Supersonic would be.

sciurius · 2012-08-14T11:29:40Z

Timo Reimann [email protected] writes:

The same tag isn't rendered differently in Supersonic. Instead,
Supersonic uses the directory name for the album name in the
breadcrumb navigation, and the tag information for the playlist
content.

Ah! This explains some of the things I see ...

Personally, I'd advice against using directory names for this purpose,
since file/directory names are essentially binary blobs without
specified encoding. Interpretation depends on how the file system was
mounted, which is hard to take into account.

I'm not too familiar with Squeeze Box so I cannot tell why (and how)
it does better. Maybe it reads the artist's name from another
well-encoded field (like album artist) and prefers that one over the
broken one. Or it uses the file name.

I moved the offending album to another directory and requested a scan
from both SuperSonic and SqueezeboxServer.

For SqueezeboxServer it doesn't make any difference. Apparently it uses
the MP3 tags only.

For Su*Sonic, the album does, indeed, appear under a new (artist) name.

Still, while SqueezeboxServer and other tools(!) derive the correct
information from the album tracks, Su*Sonic is the only one that has a
problem.

Also still, it doesn't explain why Sinéad O'Connor is mentioned twice,
with the albums distributed over both entries.

BTW: The current Su*Sonic approach also means that collection albums
with different artists do not appear with the artist names.
SqueezeboxServer does this right.

-- Johan

timoreimann · 2012-08-14T14:38:26Z

I agree with what you say regarding not to use directory names for information display. To my defense, I didn't implement it this way, and wouldn't do so. :) I'll open a separate issue later to deal with this. In general, Su*sonic lacks capability to fully leverage tag data which is a shortcoming IMHO.

I took just a rough look at the MP3 specification: Apparently, it does include encoding information, so it should be possible to decode tag data appropriately (unless, of course, for those cases where the encoding is completely broken). We'll use this issue to indicate progress on the problem.

timoreimann · 2012-08-22T23:14:20Z

I dug a little deeper into the Supersonic code: Apparently, it uses the Jaudiotagger library (in a fairly recent version) to decode audio tag data. One interesting bug report (JAUDIOTAGGER-179) deals with the request to allow proper decoding of MP3 tags in local encodings (like code pages). The Jaudiotagger author clearly states that anything but ISO-8859-1, variants of UTF-16, and UTF-8 (depending on the exact MP3v2 version being used) is illegal and won't be supported.

It looks like auto-detecting a local encoding is not possible (at least not in a standard-compliant fashion) whereas the supported encodings identify themselves by means of a marking encoding byte (see Wikipedia). I haven't looked into the Jaudiotagger code but strongly assume that decoding standard-compliant encodings works, so the observed behavior is most likely due to one such illegal local encoding. While the Jaudiotagger author announced to add an option to specify the (local) encoding used when decoding an MP3 tag, it wouldn't be a great solution IMHO because it'd only work for people willing and capable of specifying the employed encoding. Re-writing tag data in a supported encoding seems to be the way to go.

So much to what I figured about Jaudiotagger. Next, I'll try to find out what SqueezeBox Server does.

sciurius · 2012-08-23T07:23:30Z

Timo Reimann [email protected] writes:

One interesting bug report deals with the request to allow proper
decoding of MP3 tags in local encodings (like code pages). The
Jaudiotagger author clearly states that anything but ISO-8859-1,
variants of UTF-16, and UTF-8 (depending on the exact MP3v2 version
being used) is illegal and won't be supported.

One of the questions that keeps bugging me is what is wrong with the
offending mp3 files. All tools that I've used so far return consistent
and correct (i.e., as desired) information. AFAIK all tags are UTF8
encoded and no local encodings are used.

Does Jaudiotagger have a tool to display the 'raw' tag data?

-- Johan

timoreimann · 2012-08-23T09:10:28Z

Jaudiotagger does not seem to provide any tools. However, putting together a piece of code to read out the raw tag data should be a matter of a couple lines only; the Jaudiotagger code examples may be helpful with that.

Note that UTF-8 is allowed in ID3v2.4 only. It's probably safer to use UTF-16 with ID3v2.3.

timoreimann · 2012-08-25T00:18:12Z

In order to take a closer look at the ID3 data, I build a simple program using Jaudiotagger 2.0.4. It looks like this:

import org.jaudiotagger.audio.AudioFileIO;
import org.jaudiotagger.audio.mp3.MP3File;
import org.jaudiotagger.tag.FieldKey;
import org.jaudiotagger.tag.Tag;
import org.jaudiotagger.tag.id3.*;

import java.io.File;
import java.nio.charset.Charset;


public class ParseMp3 {
    private static String mp3File = "01_4th_And_Vine.mp3";

    public static void main(String[] args) {
        File file = new File(ParseMp3.mp3File);
        MP3File mp3File = null;
        try {
            mp3File = (MP3File)AudioFileIO.read(file);
        } catch (Exception e) {
            e.printStackTrace();
        }

        Tag tag = mp3File.getTag();
        System.out.println("artist: " + tag.getFirst(FieldKey.ARTIST));

        ID3v1Tag id3v1Tag = mp3File.getID3v1Tag();
        System.out.println("ID3v1 artist: " + id3v1Tag.getFirst(FieldKey.ARTIST));

        AbstractID3v2Tag id3v23Tag = mp3File.getID3v2Tag();
        System.out.println("ID3v2.3 artist: " + id3v23Tag.getFirst(ID3v23Frames.FRAME_ID_V3_ARTIST));
        AbstractID3v2Frame id3v23Frame = id3v23Tag.getFirstField(ID3v23Frames.FRAME_ID_V3_ARTIST);
        System.out.println("ID3v2.3 encoding: " + id3v23Frame.getEncoding());

        ID3v24Tag id3v24Tag = mp3File.getID3v2TagAsv24();
        System.out.println("ID3v2.4 artist: " + id3v24Tag.getFirst(ID3v24Frames.FRAME_ID_ARTIST));
        AbstractID3v2Frame id3v24Frame = id3v24Tag.getFirstField(ID3v24Frames.FRAME_ID_ARTIST);
        System.out.println("ID3v2.4 encoding: " + id3v24Frame.getEncoding());

        byte[] rawData = id3v23Frame.getRawContent();
        System.out.println("analyzing " + rawData.length + " bytes of raw ID3v2.3 artist data:");
        String isoText = new String(rawData, Charset.forName("ISO-8859-1"));
        String utfText = new String(rawData, Charset.forName("UTF-8"));
        for (int i = 0; i < rawData.length; ++i) {
            char singleByte = (char)rawData[i];
            char utfChar = '\0';
            try {
                utfChar = utfText.charAt(i);
            } catch (StringIndexOutOfBoundsException exception) {
            }
            System.out.println("ID3v2 artist raw byte #" + String.format("%02d", i+1) + ": " +
                    String.format("%5d", (int)singleByte) + "\t\thex: " +
                    String.format("%8s", Integer.toHexString(singleByte)) + "\t\tISO: " +
                    isoText.charAt(i) + "\t\tUTF: " + utfChar);
        }
    }
}

The program's output is this:

artist: SinÃ©ad O'Connor
ID3v1 artist: SinÃ©ad O'Connor
ID3v2.3 artist: SinÃ©ad O'Connor
ID3v2.3 encoding: ISO-8859-1
ID3v2.4 artist: SinÃ©ad O'Connor
ID3v2.4 encoding: ISO-8859-1
analyzing 27 bytes of raw ID3v2.3 artist data:
ID3v2 artist raw byte #01:    84        hex:       54       ISO: T      UTF: T
ID3v2 artist raw byte #02:    80        hex:       50       ISO: P      UTF: P
ID3v2 artist raw byte #03:    69        hex:       45       ISO: E      UTF: E
ID3v2 artist raw byte #04:    49        hex:       31       ISO: 1      UTF: 1
ID3v2 artist raw byte #05:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #06:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #07:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #08:    17        hex:       11       ISO: �      UTF: �
ID3v2 artist raw byte #09:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #10:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #11:     0        hex:        0       ISO:        UTF: 
ID3v2 artist raw byte #12:    83        hex:       53       ISO: S      UTF: S
ID3v2 artist raw byte #13:   105        hex:       69       ISO: i      UTF: i
ID3v2 artist raw byte #14:   110        hex:       6e       ISO: n      UTF: n
ID3v2 artist raw byte #15: 65475        hex:     ffc3       ISO: Ã     UTF: é
ID3v2 artist raw byte #16: 65449        hex:     ffa9       ISO: ©     UTF: a
ID3v2 artist raw byte #17:    97        hex:       61       ISO: a      UTF: d
ID3v2 artist raw byte #18:   100        hex:       64       ISO: d      UTF:  
ID3v2 artist raw byte #19:    32        hex:       20       ISO:        UTF: O
ID3v2 artist raw byte #20:    79        hex:       4f       ISO: O      UTF: '
ID3v2 artist raw byte #21:    39        hex:       27       ISO: '      UTF: C
ID3v2 artist raw byte #22:    67        hex:       43       ISO: C      UTF: o
ID3v2 artist raw byte #23:   111        hex:       6f       ISO: o      UTF: n
ID3v2 artist raw byte #24:   110        hex:       6e       ISO: n      UTF: n
ID3v2 artist raw byte #25:   110        hex:       6e       ISO: n      UTF: o
ID3v2 artist raw byte #26:   111        hex:       6f       ISO: o      UTF: r
ID3v2 artist raw byte #27:   114        hex:       72       ISO: r      UTF: 

Process finished with exit code 0

(I parsed ID3v2.3 data in the example but the results are exactly the same with 2.4.)

According to the ID3v2 frame specification, the bytes outlined above carry the following semantics:

bytes 1-4: The frame ID (TPE1, the Lead performer(s)/Soloist(s))
bytes 5-8: The frame size (17)
byte 9: The character encoding, in this case ISO-8859-1 as per the spec's wording: Frames that allow different types of text encoding have a text encoding description byte directly after the frame size. If ISO-8859-1 is used this byte should be $00, if Unicode is used it should be $01.
byte 10-11: flags
remaining bytes: payload (the artist text)

As you can see, the artist name was supposedly stored in ISO-8859-1 (byte 00 after the frame size). Taking a closer look at bytes 15 and 16, however, one can observe that in fact UTF-8 was written to the ID3 field: The é character takes up two bytes in UTF-8 which, if mistakenly interpreted as ISO-8859-1, leads to two distinct characters Ã and ©. (This seems to be a common symptom for UTF-8/ISO-8859-1-related issues.) Another indication for the lack of proper UTF-8 encoding is the fact that the BOM (either FFFE or FEFF) is missing.

So Jaudiotagger seems to be doing everything right. I can't tell for sure whether those other ID3-reading applications you are using are simply lucky to, say, choose your machine's locale setting for decoding, or sophisticated enough to try some kind of encoding detection. While the latter still seems desirable to have in Supersonic for certain cases, I believe that in this one the ID3 encoding is indeed broken.

HTH,

--Timo

sciurius · 2012-08-25T09:22:22Z

Timo Reimann [email protected] writes:

As you can see, the artist name was supposedly stored in ISO-8859-1
(byte 00 after the frame size). Taking a closer look at bytes 15 and
16, however, one can observe that in fact UTF-8 was written to the ID3
field:

Ha! Now I finally know exactly what is wrong. I already made a lot of
guesses and assumptions but your data make it all clear.

Most tools that I used apparently assume it's utf8 unless it contains
erroneous characters, and then fall back to Latin-1.

Thanks! Now I can start working on a goot solution (which may involve
re-tagging all my mp3 files...).

Thanks again.

-- Johan

timoreimann · 2012-08-25T20:01:52Z

I was curious and gave juniversalchardet a try, a Java port of Mozilla's Universal Charset Detection implementation. Although the project looks a little bit neglected it managed to detect UTF-8 in the malformed files correctly. Since there are so many illegal music files out there, I'd like to incorporate the library in a Supersonic branch and see how it works.

For the sake of completeness, I also tried out another Java alternative called jChardet. It failed to detect UTF-8, however.

timoreimann closed this as completed Jun 17, 2012

timoreimann reopened this Jul 18, 2012

ghost assigned timoreimann Jul 18, 2012

timoreimann mentioned this issue Aug 22, 2012

wrong encoding of national characters in file and folder names + folders disappearing #68

Closed

This was referenced Aug 25, 2012

Prefer tag data over directory names to display artist information #70

Open

Support automatic character set detection for tag data #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong display for non-ASCII tags #54

Wrong display for non-ASCII tags #54

sciurius commented May 29, 2012

timoreimann commented May 29, 2012

porkcharsui commented Jun 4, 2012

timoreimann commented Jun 8, 2012

timoreimann commented Jun 17, 2012

sciurius commented Jul 18, 2012

timoreimann commented Aug 9, 2012

sciurius commented Aug 10, 2012

timoreimann commented Aug 13, 2012

sciurius commented Aug 13, 2012

timoreimann commented Aug 13, 2012

sciurius commented Aug 14, 2012

timoreimann commented Aug 14, 2012

timoreimann commented Aug 22, 2012

sciurius commented Aug 23, 2012

timoreimann commented Aug 23, 2012

timoreimann commented Aug 25, 2012

sciurius commented Aug 25, 2012

timoreimann commented Aug 25, 2012

Wrong display for non-ASCII tags #54

Wrong display for non-ASCII tags #54

Comments

sciurius commented May 29, 2012

timoreimann commented May 29, 2012

porkcharsui commented Jun 4, 2012

timoreimann commented Jun 8, 2012

timoreimann commented Jun 17, 2012

sciurius commented Jul 18, 2012

timoreimann commented Aug 9, 2012

sciurius commented Aug 10, 2012

timoreimann commented Aug 13, 2012

sciurius commented Aug 13, 2012

timoreimann commented Aug 13, 2012

sciurius commented Aug 14, 2012

timoreimann commented Aug 14, 2012

timoreimann commented Aug 22, 2012

sciurius commented Aug 23, 2012

timoreimann commented Aug 23, 2012

timoreimann commented Aug 25, 2012

sciurius commented Aug 25, 2012

timoreimann commented Aug 25, 2012