ContentUrls¶ ↑

Find and rewrite URLs in different types of content.

ContentUrls was developed to address two use cases:

Find each URL in content retrieved from a website in order to spider and find all content on the website.
Rewrite each URL in content retrieved from a website in order to make a working local copy of the website.

Features¶ ↑

Three types of content: HTML, CSS and JavaScript
- HTML content
  - <a> tag href attribute
  - <area> tag href attribute
  - <body> tag background attribute
  - <embed> tag src attribute
  - <frame> tag src attribute
  - <iframe> tag src attribute
  - <img> tag src attribute
  - <link> tag href attribute
  - <meta> tag content attribute containing URL
  - <object> tag data attribute
  - <script> tag src attribute
  - style attribute of any tag (parsed as CSS content)
  - body of <style> tag (parsed as CSS content)
  - body of <script> tag when type or language attribute identifies JavaScript (parsed as JavaScript content)
CSS content
- url() notation
JavaScript content
- URI module’s REGEXP
Can convert relative URLs to absolute URLs by providing resource URL
Can convert relative URLs to absolute URLs when base URL found in HTML content

Examples¶ ↑

Find URLs in an HTML document¶ ↑

Provide the HTML content and the content type and obtain an array of unique URLs.

ContentUrls.urls(html, 'text/html').each do |url|
  puts "Found URL: #{url}"
end

Rewrite URLs in an HTML document¶ ↑

Provide the HTML content, the content type, and a block to rewrite each URL’s extension.

rewritten_html = ContentUrls.rewrite_each_url(html, 'text/html') {|url| url.sub(/.htm/, '.html'}

Requirements¶ ↑

nokogiri
css_parser
rkelly

Development¶ ↑

To test and develop this gem, additional requirements are:

bundler
jeweler
rake
rdoc
rspec
yard

Goals for ContentUrls¶ ↑

Include support for:
- Acrobat (.pdf)
- Flash (.swf)
- Microsoft Office (.doc, .xls, .ppt)
- text (regular expression for URLs)
Capture links retrieved from a headless web browser which executes the code (JavaScript, etc.)

Contributing to content_urls¶ ↑

Check out the latest master to make sure the feature hasn’t been implemented or the bug hasn’t been fixed yet.
Check out the issue tracker to make sure someone already hasn’t requested it and/or contributed it.
Fork the project.
Start a feature/bugfix branch.
Commit and push until you are happy with your contribution.
Make sure to add tests for it. This is important so I don’t unintentionally break it in a future version.
Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
lib		lib
spec		spec
.document		.document
.gitignore		.gitignore
.rspec		.rspec
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.rdoc		README.rdoc
Rakefile		Rakefile
content_urls.gemspec		content_urls.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContentUrls¶ ↑

Features¶ ↑

Examples¶ ↑

Find URLs in an HTML document¶ ↑

Rewrite URLs in an HTML document¶ ↑

Requirements¶ ↑

Development¶ ↑

Goals for ContentUrls¶ ↑

Contributing to content_urls¶ ↑

Copyright¶ ↑

About

Releases

Packages

Languages

License

sutch/content_urls

Folders and files

Latest commit

History

Repository files navigation

ContentUrls¶ ↑

Features¶ ↑

Examples¶ ↑

Find URLs in an HTML document¶ ↑

Rewrite URLs in an HTML document¶ ↑

Requirements¶ ↑

Development¶ ↑

Goals for ContentUrls¶ ↑

Contributing to content_urls¶ ↑

Copyright¶ ↑

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages