Skip to content
This repository has been archived by the owner on Aug 10, 2024. It is now read-only.

RSS feed broken if CDATA contains lower ascii characters #33

Open
relikd opened this issue Mar 5, 2019 · 4 comments
Open

RSS feed broken if CDATA contains lower ascii characters #33

relikd opened this issue Mar 5, 2019 · 4 comments

Comments

@relikd
Copy link

relikd commented Mar 5, 2019

Hi there,

I just stumbled upon a feed that uses chars in the range \0x01 - \0x1F (CDATA description).
Although libxml2 isn't supposed to handle this, RSParser will break early and drop the remaining feed articles. When parsing the RSS below, only the first two items will be returned.

It should be enough to regex and replace these, however, I was wondering if there is a libxml2 flag that could be used instead…

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0">
<channel>
	<title>Feed Title</title>
<item>
		<title>1</title>
		<link>http://someurl.com/1/</link>
		<description><![CDATA[Description of first]]></description>
</item>
<item>
		<title>2</title>
		<link>http://someurl.com/2/</link>
		<description><![CDATA[Description with � \0x04 values]]></description>
</item>
<item>
		<title>3</title>
		<link>http://someurl.com/3/</link>
		<description><![CDATA[Description of third]]></description>
</item>
<item>
		<title>4</title>
		<link>http://someurl.com/4/</link>
		<description><![CDATA[Description of fourth]]></description>
</item>
<item>
		<title>5</title>
		<link>http://someurl.com/5/</link>
		<description><![CDATA[Description of fifth]]></description>
</item>
	</channel>
</rss>
@relikd
Copy link
Author

relikd commented Mar 6, 2019

Here is the snipped I used:

[_xmlData enumerateByteRangesUsingBlock:^(const void * bytes, NSRange byteRange, BOOL * stop) {
    NSUInteger max = byteRange.location + byteRange.length;
    for (NSUInteger i = byteRange.location; i < max; i++) {
        unsigned char c = ((unsigned char*)bytes)[i];
        if (c < 0x20 && c != 0x9 && c != 0xA && c != 0xD) {
            ((unsigned char*)bytes)[i] = ' '; // replace lower ascii with blank
        }
    }
}];

E.g., with a class variable or flag and can be postponed until feed is about to be parsed. Let me know if this is a dumb idea, or if it has unforeseeable consequences.

@brentsimmons
Copy link
Collaborator

That might be the way to go, though I would do performance tests first to make sure it doesn’t have an impact.

It’s also possible that there’s some kind of way to tell libxml2 to ignore these. (I just haven’t looked yet.)

@Wevah
Copy link
Member

Wevah commented Nov 27, 2019

I don't think you want to do most of that range math; AIUI the bytes array always starts at a 0 index, not relative to the whole data (since it's a pointer to an arbitrary location in the data already). So just:

    for (NSUInteger i = 0; i < byteRange.length; i++) {

Other than that, seems fine to me!

@Wevah
Copy link
Member

Wevah commented Nov 27, 2019

Oh, you might also be able to specify const unsigned char *bytes as the type in the block declaration to avoid all the casting (if the compiler doesn't complain).

Not sure if the expected constness of the pointed-to bytes will cause issues when mutating, though it might be fine since you're not changing the length. (Copying to a new mutable data should be safer if that's a concern.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants