Skip to content

jasonnerothin/urlscraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to Jason’s URL word scraper

I accomplished most of the stated objectives (getting the framework up and running with three main pages, doing validation, adding a little css to keep it clean and making it work as well as it works).

The algorithm for extracting words from web pages comes out the rails stack (a class called TextHelper), and after trying some basic xml stripping algorithms, I decided to go with the one that was provided to me, figuring I wasn’t gonna figure out a better way to do it in the limited time I have for this little project. There are some limitations of the current scraping design:

1. It doesn't deal with modern javascript (AJAX-y) libraries and complex pages very well. For example,
   my personal site parses correctly (http://jasonnerothin.com/app/index.html), but since it is powered
   by AngularJS, the main content comes on a second callback request (as json).

2. I didn't take the time to chase re-directs.

3. Very complex pages (like http://news.google.com) end up mapping out css classnames and xml attribute values.
   There just isn't a reasonable way that a general purpose util could munge its way through it.

Other design issues:

I started things out early on (look back in the git hist) trying to serialize the word counts out to json and storing them in the page table as a ;json text field. This led to a recursive design and eventually I abandoned it; the realization being that I wasn’t gonna switch over to Mongoid to pull it off. So, I just added an associated word count table.

The code that manages the counts in the page object never really adapted to the change. I mean, it changed a lot as I improved my Rails moxxy (but it’s basically just a 10-member stack that’s altered by a push method). But I never got back around to pulling the stack out of the page class (it just works directly on the :has_many association field!) and saving the counts off at the end. So, all of the words that are coming back from the webpage processor are hitting the database. Big enough pages (news.google) cause 500s when the transaction stack gets to tall). But little static pages go just fine. The main cost is just that we spin through a lot of ids on the WordCount objects.

Let’s see, anything else? I didn’t go all script-crazy. I tried to leave an easter egg in the css, but it doesn’t render on Safari. (It’s a long-ago deprecated feature of HTML 3 that I sort of miss.) Almost as much as MIDI and animated gifs…

Okay, that’s it. See you in a bit.

About

little rails app that counts words on web pages

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published