GitHub - jasonnerothin/urlscraper: little rails app that counts words on web pages

Welcome to Jason’s URL word scraper¶ ↑

I accomplished most of the stated objectives (getting the framework up and running with three main pages, doing validation, adding a little css to keep it clean and making it work as well as it works).

The algorithm for extracting words from web pages comes out the rails stack (a class called TextHelper), and after trying some basic xml stripping algorithms, I decided to go with the one that was provided to me, figuring I wasn’t gonna figure out a better way to do it in the limited time I have for this little project. There are some limitations of the current scraping design:

1. It doesn't deal with modern javascript (AJAX-y) libraries and complex pages very well. For example,
   my personal site parses correctly (http://jasonnerothin.com/app/index.html), but since it is powered
   by AngularJS, the main content comes on a second callback request (as json).

2. I didn't take the time to chase re-directs.

3. Very complex pages (like http://news.google.com) end up mapping out css classnames and xml attribute values.
   There just isn't a reasonable way that a general purpose util could munge its way through it.

Other design issues:

I started things out early on (look back in the git hist) trying to serialize the word counts out to json and storing them in the page table as a ;json text field. This led to a recursive design and eventually I abandoned it; the realization being that I wasn’t gonna switch over to Mongoid to pull it off. So, I just added an associated word count table.

The code that manages the counts in the page object never really adapted to the change. I mean, it changed a lot as I improved my Rails moxxy (but it’s basically just a 10-member stack that’s altered by a push method). But I never got back around to pulling the stack out of the page class (it just works directly on the :has_many association field!) and saving the counts off at the end. So, all of the words that are coming back from the webpage processor are hitting the database. Big enough pages (news.google) cause 500s when the transaction stack gets to tall). But little static pages go just fine. The main cost is just that we spin through a lot of ids on the WordCount objects.

Let’s see, anything else? I didn’t go all script-crazy. I tried to leave an easter egg in the css, but it doesn’t render on Safari. (It’s a long-ago deprecated feature of HTML 3 that I sort of miss.) Almost as much as MIDI and animated gifs…

Okay, that’s it. See you in a bit.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
app		app
config		config
db		db
doc		doc
lib		lib
log		log
public		public
script		script
test		test
vendor		vendor
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.rdoc		README.rdoc
Rakefile		Rakefile
config.ru		config.ru

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome to Jason’s URL word scraper¶ ↑

About

Releases

Packages

Languages

jasonnerothin/urlscraper

Folders and files

Latest commit

History

Repository files navigation

Welcome to Jason’s URL word scraper¶ ↑

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages