-
Notifications
You must be signed in to change notification settings - Fork 2
A Java crawler for multiple RSS feed sites. Both text and images could be scraped via HTML parsing.
License
MingjieQian/RSSFeedCrawler
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
RSSFeedCrawler is a crawler for multiple RSS feed sites written in Java. Both text and images could be scraped via HTML parsing. Description: CSS selector expression is used to specify the DOM locations for the text and image path. SHA256 is used instead of MD5 to digest URLs. SourceForge: https://sourceforge.net/projects/rssfeedcrawler GitHup: https://github.com/MingjieQian/RSSFeedCrawler Usage: java -jar path_of_RSSFeedCrawler.jar -sys_conf path_of_sys_conf.txt e.g., java -jar /home/czhai/mqian2/Toolbox/Java/RSSFeedCrawler/RSSFeedCrawler.jar -sys_conf /home/czhai/mqian2/Toolbox/Java/RSSFeedCrawler/sys_conf.txt If you want to clean up the old database and create a new database before crawling, please use the following command: java -jar /home/czhai/mqian2/Toolbox/Java/RSSFeedCrawler/RSSFeedCrawler.jar -sys_conf /home/czhai/mqian2/Toolbox/Java/RSSFeedCrawler/sys_conf.txt -db_clean_up Configurations: a. All the parameters for the crawler are initialized from a file named sys_conf.txt. The sys_conf.txt specifies 1. The saving path for the crawled data 2. File path of an XML file containing the URLs of the RSS sites and XPath for its text and image content 3. Username for mysql database 4. Password for mysql database b. An XML file should be provided to specify the feed channels and the CSS selector syntax for the text and image content in a DOM tree. e.g., data_dir = ***/RSSFeedCrawler/data crawl_conf = ***/RSSFeedCrawler/crawl-sites.xml db_user = root db_pass = 1234 Dependencies: jsoup-*.*.*.jar mysql-connector-java-*.*.**-bin.jar ----------------------------------- Author: Mingjie Qian Version: 1.0 Date: Dec. 18th, 2012
About
A Java crawler for multiple RSS feed sites. Both text and images could be scraped via HTML parsing.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published