Support for Arabic language in warc-indexer -> Solr fields #291

thomasegense · 2022-06-01T09:44:25Z

I am not sure if this is a duplicate of an existing issue.

When you harvest this url:
https://www.youtube.com/watch?v=Hnrdfb6HiK0

The title field in solr is:
title":"Ø³ÙŠØ¯Ø© Ø§Ù„ØµØ¨Ø± - Ø§Ù„Ù…Ø±Ø£Ø© Ø§Ù„Ø¹Ø±Ø§Ù‚ÙŠØ© - ÙƒØ±ÙŠÙ… Ø§Ù„Ø¹Ø±Ø§Ù‚ÙŠ - Ø§ØÙ…Ø¯ Ø§Ù„Ø«Ø±ÙˆØ§Ù†ÙŠ - YouTube",

Also other fields such as keywords has the same issue.

anjackson · 2022-08-03T11:42:03Z

This appears to be a problem with Apache Tika, as I get the same results using that directly...

12:39 $ tika watch_v_Hnrdfb6HiK0.html
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB">
<head>
<link rel="shortcut icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon.ico" type="image/x-icon"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_32x32.png"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_48x48.png"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_96x96.png"/>
<link rel="icon" href="https://www.youtube.com/s/desktop/1c192ebe/img/favicon_144x144.png"/>
<link rel="stylesheet" href="//fonts.googleapis.com/css2?family=Roboto:wght@300;400;500;700&amp;family=YouTube+Sans:[email protected]&amp;display=swap"/>
<link rel="stylesheet" href="/s/player/7a7465f5/www-player.css"/>
<link rel="stylesheet" href="https://www.youtube.com/s/desktop/1c192ebe/cssbin/www-main-desktop-watch-page-skeleton.css"/>
<link rel="stylesheet" href="https://www.youtube.com/s/desktop/1c192ebe/cssbin/www-main-desktop-player-skeleton.css"/>
<link rel="stylesheet" href="https://www.youtube.com/s/desktop/1c192ebe/cssbin/www-onepick.css"/>
<link rel="search" type="application/opensearchdescription+xml" href="https://www.youtube.com/opensearch?locale=en_GB"/>
<link rel="manifest" href="/manifest.webmanifest"/>
<link rel="canonical" href="https://www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" media="handheld" href="https://m.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" media="only screen and (max-width: 640px)" href="https://m.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="shortlinkUrl" href="https://youtu.be/Hnrdfb6HiK0"/>
<link rel="alternate" href="android-app://com.google.android.youtube/http/www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" href="ios-app://544007664/vnd.youtube/www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<link rel="alternate" type="application/json+oembed" href="https://www.youtube.com/oembed?format=json&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DHnrdfb6HiK0"/>
<link rel="alternate" type="text/xml+oembed" href="https://www.youtube.com/oembed?format=xml&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DHnrdfb6HiK0"/>
<link rel="image_src" href="https://i.ytimg.com/vi/Hnrdfb6HiK0/maxresdefault.jpg"/>






<link href="https://www.youtube.com/embed/Hnrdfb6HiK0"/>



<meta name="og:image" content="https://i.ytimg.com/vi/Hnrdfb6HiK0/maxresdefault.jpg"/>
<meta name="og:image:width" content="1280"/>
<meta name="twitter:card" content="player"/>
<meta name="og:site_name" content="YouTube"/>
<meta name="keywords" content="Ø§ØÙ…Ø¯ Ø§Ù„Ø«Ø±ÙˆØ§Ù†ÙŠ, ÙƒØ±ÙŠÙ… Ø§Ù„Ø¹Ø±Ø§Ù‚ÙŠ, Ø³ÙŠØ¯Ø© Ø§Ù„ØµØ¨Ø±, Ø§Ù„Ù…Ø±Ø£Ø©, Ø´Ø¹Ø±, ÙƒØ§Ø¸Ù… Ø§Ù„Ø³Ø§Ù‡Ø±, Ø§Ù„Ø¹Ø±Ø§Ù‚, Ø´Ø¹Ø± Ù�ØµÙŠØ, Ø¹Ù…Ù„ Ø´Ø¹Ø±ÙŠ, Ø´Ø¹Ø± Ø¹Ø±Ø¨ÙŠ, Mbc, Ø§Ù„Ø´Ø±Ù‚ÙŠØ©, Ø§Ù„Ø¹Ø±Ø§Ù‚ÙŠØ©, Ù…Ø§Ù…ÙˆÙ† Ø§Ù„Ù†Ø·Ø§Ø, Ø´Ø¹Ø± Ø´Ø¹Ø¨ÙŠ, Ø§Ù„Ø«Ø±ÙˆØ§Ù†ÙŠ, Ø§Ù„Ø¹Ø±Ø§Ù‚ÙŠ, Ø§Ù„ØµØ¨Ø±, Ø³ÙŠØ¯Ø©, Ù†Ø³Ø§Ø¡, Ø¬Ù…Ø§Ù„, Ø§ØÙ…Ø¯, ÙƒØ±ÙŠÙ…, Ø´Ø¹Ø± Ø¬Ù…ÙŠÙ„, Ø§Ù„Ø«Ø±ÙˆØ§Ù†ÙŠ Ø§ØÙ…Ø¯, Ø§Ù„Ø¹Ø±Ø§Ù‚ÙŠ ÙƒØ±ÙŠÙ…, Ø§Ù„Ø«Ø±ÙˆØ§Ù†ÙŠ Ø§Ù„Ø«Ø±ÙˆØ§Ù†ÙŠ, Bbc, Cnn, Ø¯Ø¨ÙŠ, Ø¨ØºØ¯Ø§Ø¯, Ù…Ø´Ø§Ù‡ÙŠØ±, ØµØ§Ø¨Ø± Ø§Ù„Ø±Ø¨Ø§Ø¹ÙŠ, Ø§Ù… ÙƒÙ„Ø«ÙˆÙ…, Ù�ÙŠØ±ÙˆØ², Ù…Ø§Ø¬Ø¯Ø© Ø§Ù„Ø±ÙˆÙ…ÙŠ, Ø§Ù„Ø´Ø¹Ø±, ØØ¨, Ø§Ù„ØØ¨, Ø§Ù„Ø¹Ø´Ù‚, Ø¹Ø´Ù‚"/>
<meta name="twitter:url" content="https://www.youtube.com/watch?v=Hnrdfb6HiK0"/>
<meta name="twitter:app:url:ipad" content="vnd.youtube://www.youtube.com/watch?v=Hnrdfb6HiK0&amp;feature=applinks"/>
<meta name="og:description" content="Ø¹Ù…Ù„ Ø´Ø¹Ø±ÙŠ Ù„Ù„Ù…Ø±Ø£Ø© Ø§Ù„Ø¹Ø±Ø§Ù‚ÙŠØ© - Ø§Ù„Ø´Ø§Ø¹Ø± ÙƒØ±ÙŠÙ… Ø§Ù„Ø¹Ø±Ø§Ù‚ÙŠ Ùˆ Ø§Ù„Ø´Ø§Ø¹Ø± Ø§ØÙ…Ø¯ Ø§Ù„Ø«Ø±ÙˆØ§Ù†ÙŠ. ØªØµÙˆÙŠØ±Chris Goslig Hans-Ole KirkGotYouBack ApSÙ…ÙˆÙ†ØªØ§Ø¬ØºÙŠØ« Ø³Ù„Ù…Ø§Ù† Ù�ÙƒØ±Ø© ÙˆØªÙ†Ù�ÙŠØ° Ù‡Ø¯Ù‰ Ø¹Ù„ÙˆØ§Ù†Ø§Ù„Ù…Ùˆ..."/>
<meta name="twitter:player" content="https://www.youtube.com/embed/Hnrdfb6HiK0"/>
<meta name="dc:title" content="Ø³ÙŠØ¯Ø© Ø§Ù„ØµØ¨Ø± - Ø§Ù„Ù…Ø±Ø£Ø© Ø§Ù„Ø¹Ø±Ø§Ù‚ÙŠØ© - ÙƒØ±ÙŠÙ… Ø§Ù„Ø¹Ø±Ø§Ù‚ÙŠ - Ø§ØÙ…Ø¯ Ø§Ù„Ø«Ø±ÙˆØ§Ù†ÙŠ - YouTube"/>
<meta name="Content-Encoding" content="ISO-8859-1"/>
...

Note that the Content-Encoding is wrong.

anjackson · 2022-08-03T11:59:26Z

I suspect this is down to the buffer size used by the CharsetDetector. There's a lot of gumpf before the UTF-8 shows up.

Not sure if this really a bug or if we should find a way to configure a larger buffer/markLimit.

Linking the example HTML file: https://gist.github.com/anjackson/5bf6945b8b557ace07f5cd1d64cbcc4f

anjackson added this to the 3.1.1 Bugfix release milestone Aug 3, 2022

thomasegense mentioned this issue Mar 27, 2023

Heuristic fix of charset issues #301

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Arabic language in warc-indexer -> Solr fields #291

Support for Arabic language in warc-indexer -> Solr fields #291

thomasegense commented Jun 1, 2022 •

edited

Loading

anjackson commented Aug 3, 2022

anjackson commented Aug 3, 2022 •

edited

Loading

Support for Arabic language in warc-indexer -> Solr fields #291

Support for Arabic language in warc-indexer -> Solr fields #291

Comments

thomasegense commented Jun 1, 2022 • edited Loading

anjackson commented Aug 3, 2022

anjackson commented Aug 3, 2022 • edited Loading

thomasegense commented Jun 1, 2022 •

edited

Loading

anjackson commented Aug 3, 2022 •

edited

Loading