Skip to content

A web scraper using Webrat::MechanizeSession - does acceptance-based web scraping relying on the great webrat behavior description language.

License

Notifications You must be signed in to change notification settings

dennisjbell/webrat-scraper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

webrat-scraper

A web scraper built on Webrat::Mechanize that traverses the web and allows access to the webrat session through Mechanize and Nokogiri objects

How to use

Install

gem sources -a http://gems.github.com
gem install jtzemp-webrat-scraper

Use

require 'webrat_scraper'

class MyScraper < WebratScraper
  def initialize
    @url = "http://www.google.com"
    super
  end

  def first_result_for(search_term)
    visit @url

    fill_in "q", :with => search_term
    click_button

    first_link = (doc/"li.g a.l").first

    link = {}
    link[:text] = first_link.inner_text
    link[:url]  = first_link.attributes["href"].to_s
    link
  end
end

m = MyScraper.new
result = m.first_result_for("webrat-mechanize")
puts result.inspect

For more info check out documentation for:

  • Webrat

  • Mechanize

  • Nokogiri

    • CSS Selectors

    • XPath Selectors

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Add specs for it. This is important so I don’t break it in a future version unintentionally.

  • Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)

  • Send me a pull request. Bonus points for topic branches.

Copyright © 2009 JT Zemp. See LICENSE for details.

About

A web scraper using Webrat::MechanizeSession - does acceptance-based web scraping relying on the great webrat behavior description language.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published