Skip to content

Configuration options explained

Pierre-Louis Gottfrois edited this page May 20, 2014 · 16 revisions

You'll find here detail explanations about each configuration options available on LinkThumbnailer.

redirect_limit

Maximum number of http redirection allowed. If LinkThumbnailer cannot resolve given URL before redirect_limit is reach, it will raise a LinkThumbnailer::RedirectLimit exception.

Default is 3

user_agent

You can set the http user agent used to resolve given URL.

Default is link_thumbnailer.

verify_ssl

You can activate/deactivate SSL verification for each LinkThumbnailer requests.

Default is true.

http_timeout

The amount of time in seconds to wait for a connection to be opened. If the HTTP object cannot open a connection in this many seconds, it raises a Net::OpenTimeout exception.

See here for more details.

Default is 5.

blacklist_urls

This is a list of backlisted URL pattern (using regex) to skip when LinkThumbnailer will fetch the website images. Use this option to filter advertising images.

Default are well known urls:

^http://ad\.doubleclick\.net/
^http://b\.scorecardresearch\.com/
^http://pixel\.quantserve\.com/
^http://s7\.addthis\.com/

attributes

This is a new option introduced in the v2 of LinkThumbnailer allowing you to explicitly tell what kind of HTML attributes you are expected to see.

LinkThumbnailer will do its best to find all given attributes in the provided website using the following scrapers (order matter):

  1. OpenGraph protocol scraper
  2. Homemade custom scraper

See here for more informations about scrapers and how to build your own.

Currently there are only the following attributes available:

  • title
  • description
  • images

See here for more informations about each attributes.

Default is [:title, :images, :description].

graders

This is a new option introduced with the v2 of LinkThumbnailer allowing you to customize how LinkThumbnailer selects the best description for a given website.

When fetching all possible description candidates for a given website, LinkThumbnailer score each one of them according to each grader return value. Each grader will return a score (number, positive or negative) that will be added to the global score attached to the description being evaluated. LinkThumbnailer does the above for all possible description candidates in order to return the best description that describes the website.

See here for more informations about graders and how to build your own.

Default are:

  • Length grader will score description length
  • HtmlAttribute grader will score class's html node
  • HtmlAttribute grader will score id's html node
  • Position grader will score descriptions based on the order they appeared on the page. The first one are more likely to be reliable descriptions.
  • LinkDensity grader will score description link density

description_min_length

This is a new option introduced with the v2 of LinkThumbnailer allowing you to set description minimum length threshold to be taken as a candidate.

Default is 25 characters.

positive_regex

This is a new option introduced with the v2 of LinkThumbnailer allowing you to customize the word used to score class's html node and id's html node when using the HtmlAttribute grader. Those are positive keywords.

Default is /article|body|content|entry|hentry|main|page|pagination|post|text|blog|story/i.

negative_regex

This is a new option introduced with the v2 of LinkThumbnailer allowing you to customize the word used to score class's html node and id's html node when using the HtmlAttribute grader. Those are negative keywords.

Default is /combx|comment|com-|contact|foot|footer|footnote|masthead|media|meta|outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|tool|widget|modal/i.

image_limit

This is a new option introduced with the v2 of LinkThumbnailer allowing you to set maximum number of images to fetch for a given website. Since fetching image informations has a cost (performing a http request for each images) you should consider setting a limit here.

Default is 5 images.

Clone this wiki locally