Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode pages does not work anymore on 0.5.0 #71

Open
nengine opened this issue Jul 15, 2015 · 9 comments
Open

Unicode pages does not work anymore on 0.5.0 #71

nengine opened this issue Jul 15, 2015 · 9 comments

Comments

@nengine
Copy link

nengine commented Jul 15, 2015

I was able to crawl Unicode pages in 0.4.0 but after upgrading to 0.5.0 only some English characters would be in a crawled page. Please let me if there any settings I have to change?

@nengine
Copy link
Author

nengine commented Aug 30, 2015

@taganaka @tmaier, For some reason, if I use the code below in 0.5.0 non english unicode characters would show properly

    def doc
      return @doc if @doc
      @doc = Nokogiri::HTML(@body) if @body && html? rescue nil
    end

however this one would not. I'm not so sure what this function intended to do solve. Any suggestion is appreciated as I like to use 0.5.0 without monkey patching to the gem on my server. Thanks a lot.

def doc
      return @doc if @doc
      @body ||= ''
      @body = @body.encode('utf-8', 'binary', invalid: :replace,
                                              undef: :replace, replace: '')
      @doc = Nokogiri::HTML(@body.toutf8, nil, 'utf-8') if @body && html?
    end

@nengine
Copy link
Author

nengine commented Aug 31, 2015

Text inside <title> appear correctly in 0.4.0

<title>လူကုန်ကူးခံရသူတွေရဲ့ဘဝခရီး - BBC ပင်မစာမျက်နှာ</title>

Text inside <title> gone in 0.5.0. Only English text remains.

<title> - BBC </title>

@taganaka
Copy link
Owner

I'll take a look at this soon

Thanks for reporting

@nengine
Copy link
Author

nengine commented Aug 31, 2015

Thank you.

Sent from my iPhone

On Aug 31, 2015, at 3:32 AM, Francesco Laurita [email protected] wrote:

I'll take a look at this soon

Thanks for reporting


Reply to this email directly or view it on GitHub.

@nengine
Copy link
Author

nengine commented Feb 9, 2016

Hi taganaka, Please let me know if you had a chance to look into?

@tmaier
Copy link
Contributor

tmaier commented Jan 5, 2017

I am trying to upgrade to 0.5.1 and saw the same issue.
This is a regression of #40.

@nengine
Copy link
Author

nengine commented Jan 5, 2017

I don't think this project is maintained anymore.

@taganaka
Copy link
Owner

taganaka commented Jan 5, 2017 via email

@nengine
Copy link
Author

nengine commented Jan 5, 2017

Ok Great. I didn't see activity for nearly 2 years so just thought it was not maintained anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants