GitHub - MingjieQian/RSSFeedCrawler: A Java crawler for multiple RSS feed sites. Both text and images could be scraped via HTML parsing.

MingjieQian / RSSFeedCrawler Public

Notifications You must be signed in to change notification settings
Fork 2
Star 1

A Java crawler for multiple RSS feed sites. Both text and images could be scraped via HTML parsing.

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
conf		conf
src/org/crawler		src/org/crawler
.classpath		.classpath
.project		.project
LICENSE		LICENSE
Readme.txt		Readme.txt
crawl-sites.xml		crawl-sites.xml
crontab.txt		crontab.txt
sys_conf.txt		sys_conf.txt

Repository files navigation

RSSFeedCrawler is a crawler for multiple RSS feed sites written in Java. Both text and images could be scraped via HTML parsing.

Description:
CSS selector expression is used to specify the DOM locations for the text and image path.
SHA256 is used instead of MD5 to digest URLs.

SourceForge:
https://sourceforge.net/projects/rssfeedcrawler

GitHup:
https://github.com/MingjieQian/RSSFeedCrawler

Usage:
java -jar path_of_RSSFeedCrawler.jar -sys_conf path_of_sys_conf.txt

e.g.,
java -jar /home/czhai/mqian2/Toolbox/Java/RSSFeedCrawler/RSSFeedCrawler.jar -sys_conf /home/czhai/mqian2/Toolbox/Java/RSSFeedCrawler/sys_conf.txt

If you want to clean up the old database and create a new database before crawling, please use the following command:
java -jar /home/czhai/mqian2/Toolbox/Java/RSSFeedCrawler/RSSFeedCrawler.jar -sys_conf /home/czhai/mqian2/Toolbox/Java/RSSFeedCrawler/sys_conf.txt -db_clean_up

Configurations:
a. All the parameters for the crawler are initialized from a file named sys_conf.txt. The sys_conf.txt specifies
	1. The saving path for the crawled data
	2. File path of an XML file containing the URLs of the RSS sites and XPath for its text and image content
	3. Username for mysql database
	4. Password for mysql database
b. An XML file should be provided to specify the feed channels and the CSS selector syntax for the text and image content in a DOM tree.

e.g.,
data_dir = ***/RSSFeedCrawler/data
crawl_conf = ***/RSSFeedCrawler/crawl-sites.xml
db_user = root
db_pass = 1234

Dependencies:

jsoup-*.*.*.jar
mysql-connector-java-*.*.**-bin.jar

-----------------------------------
Author: Mingjie Qian
Version: 1.0
Date: Dec. 18th, 2012