Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Arabic language in warc-indexer -> Solr fields #291

Open
thomasegense opened this issue Jun 1, 2022 · 2 comments
Open

Support for Arabic language in warc-indexer -> Solr fields #291

thomasegense opened this issue Jun 1, 2022 · 2 comments

Comments

@thomasegense
Copy link
Contributor

thomasegense commented Jun 1, 2022

I am not sure if this is a duplicate of an existing issue.

When you harvest this url:
https://www.youtube.com/watch?v=Hnrdfb6HiK0

The title field in solr is:
title":"سيدة الصبر - المرأة العراقية - كريم العراقي - احمد الثرواني - YouTube",

Also other fields such as keywords has the same issue.

@anjackson anjackson added this to the 3.1.1 Bugfix release milestone Aug 3, 2022
@anjackson
Copy link
Contributor

This appears to be a problem with Apache Tika, as I get the same results using that directly...

12:39 $ tika watch_v_Hnrdfb6HiK0.html
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB">
<head>
<link rel="shortcut icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon.ico" type="image/x-icon"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_32x32.png"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_48x48.png"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_96x96.png"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_144x144.png"/>
<link rel="stylesheet" href="//fonts.googleapis.com/css2?family=Roboto:wght@300;400;500;700&amp;family=YouTube+Sans:[email protected]&amp;display=swap"/>
<link rel="stylesheet" href="/s/player/7a7465f5/www-player.css"/>
<link rel="stylesheet" href="https://www.youtube.com/s/desktop/1c192ebe/cssbin/www-main-desktop-watch-page-skeleton.css"/>
<link rel="stylesheet" href="https://www.youtube.com/s/desktop/1c192ebe/cssbin/www-main-desktop-player-skeleton.css"/>
<link rel="stylesheet" href="https://www.youtube.com/s/desktop/1c192ebe/cssbin/www-onepick.css"/>
<link rel="search" type="application/opensearchdescription+xml" href="https://www.youtube.com/opensearch?locale=en_GB"/>
<link rel="manifest" href="/manifest.webmanifest"/>
<link rel="canonical" href="https://www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" media="handheld" href="https://m.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="shortlinkUrl" href="https://youtu.be/Hnrdfb6HiK0"/>
<link rel="alternate" href="android-app://com.google.android.youtube/http/www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" href="ios-app://544007664/vnd.youtube/www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" type="application/json+oembed" href="https://www.youtube.com/oembed?format=json&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DHnrdfb6HiK0"/>
<link rel="alternate" type="text/xml+oembed" href="https://www.youtube.com/oembed?format=xml&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DHnrdfb6HiK0"/>
<link rel="image_src" href="https://i.ytimg.com/vi/Hnrdfb6HiK0/maxresdefault.jpg"/>






<link href="https://www.youtube.com/embed/Hnrdfb6HiK0"/>



<meta name="og:image" content="https://i.ytimg.com/vi/Hnrdfb6HiK0/maxresdefault.jpg"/>
<meta name="og:image:width" content="1280"/>
<meta name="twitter:card" content="player"/>
<meta name="og:site_name" content="YouTube"/>
<meta name="keywords" content="احمد الثرواني, كريم العراقي, سيدة الصبر, المرأة, شعر, كاظم الساهر, العراق, شعر �صيح, عمل شعري, شعر عربي, Mbc, الشرقية, العراقية, مامون النطاح, شعر شعبي, الثرواني, العراقي, الصبر, سيدة, نساء, جمال, احمد, كريم, شعر جميل, الثرواني احمد, العراقي كريم, الثرواني الثرواني, Bbc, Cnn, دبي, بغداد, مشاهير, صابر الرباعي, ام كلثوم, �يروز, ماجدة الرومي, الشعر, حب, الحب, العشق, عشق"/>
<meta name="twitter:url" content="https://www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<meta name="twitter:app:url:ipad" content="vnd.youtube://www.youtube.com/watch?v=Hnrdfb6HiK0&amp;feature=applinks"/>
<meta name="og:description" content="عمل شعري للمرأة العراقية - الشاعر كريم العراقي و الشاعر احمد الثرواني. تصويرChris Goslig Hans-Ole KirkGotYouBack ApSمونتاجغيث سلمان �كرة وتن�يذ هدى علوانالمو..."/>
<meta name="twitter:player" content="https://www.youtube.com/embed/Hnrdfb6HiK0"/>
<meta name="dc:title" content="سيدة الصبر - المرأة العراقية - كريم العراقي - احمد الثرواني - YouTube"/>
<meta name="Content-Encoding" content="ISO-8859-1"/>
...

Note that the Content-Encoding is wrong.

@anjackson
Copy link
Contributor

anjackson commented Aug 3, 2022

I suspect this is down to the buffer size used by the CharsetDetector. There's a lot of gumpf before the UTF-8 shows up.

Not sure if this really a bug or if we should find a way to configure a larger buffer/markLimit.

Linking the example HTML file: https://gist.github.com/anjackson/5bf6945b8b557ace07f5cd1d64cbcc4f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants