Extract publishing date #18

fhamborg · 2017-05-29T14:41:26Z

It would be great if you could additionally extract the date when an article was published. Currently, this requires parsing the web page and using tools such as newspaper3k to get that information. However, during the crawling process at least some webpages would offer this information, e.g. the time stamp within the RSS feed
<pubDate>Thu, 25 Dec 2014 02:10:00 +0900</pubDate>
or within the sitemap
<news:publication_date>2016-12-09T16:18:48Z</news:publication_date>

The text was updated successfully, but these errors were encountered:

sebastian-nagel · 2018-03-12T12:33:39Z

Status update:

<pubDate> (feeds) and <lastmod> (sitemaps) is now used to reject news articles older than 30 days.
TODO:
- add support for <news:publication_date>
- pass this info from feed/sitemap forward and add it to the WARC record

sebastian-nagel · 2019-05-22T13:23:19Z

The project now uses crawler-commons 1.0 which brings full support for all sitemap extensions, including news sitemaps. The <news:publication_date> is now used to skip older news articles (with the current configuration older than 30 days).
Next steps to implement would be:

make FeedParserBolt and NewsSiteMapParserBolt store the information in the metadata of the found links. This should be configurable so that also other details
write the information from the metadata into a WARC metadata record.

sebastian-nagel mentioned this issue Mar 21, 2018

Full support for sitemap extensions and namespaces #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract publishing date #18

Extract publishing date #18

fhamborg commented May 29, 2017

sebastian-nagel commented Mar 12, 2018

sebastian-nagel commented May 22, 2019

Extract publishing date #18

Extract publishing date #18

Comments

fhamborg commented May 29, 2017

sebastian-nagel commented Mar 12, 2018

sebastian-nagel commented May 22, 2019