-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add script to save static pages after JS has run
#15639 We want to be able to generate some pages as part of the static site build after JavaScript has run on that page. This will allow these pages to be populated with data from JavaScript but appear to progressively enhance when deployed as part of our static site. This script will fetch a list of URLs from an endpoint on the application that should be scraped with JS enabled. It will then use capybara-webkit to fetch those pages, run JS, and save those pages to disk. The target files will be in the same location as `wget` uses which will allow us to overwrite the static files already generated by `wget` with these JS-enhanced versions. Subsequent commits will need to: * Create the endpoint that this script uses to determine which URLs to scrape * Update the build script to run this after `wget` has run
- Loading branch information
Showing
2 changed files
with
79 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
#!/usr/bin/env ruby | ||
# frozen_string_literal: true | ||
|
||
# Usage: save_javascript_pages.rb [URL with text list of URLs to scrape] | ||
# | ||
# This script accepts one argument, which is a URL that returns a text list of | ||
# URLs that this script should scrape and save to disk after jQuery has finished | ||
# running AJAX calls. Files are saved to a filesystem path matching the URL | ||
# path, relative to where you run the script from. This matches `wget` behavior | ||
# and allows this script to overwrite files previously scraped by `wget`. | ||
|
||
require 'capybara-webkit' | ||
require 'fileutils' | ||
require 'open-uri' | ||
|
||
class PageAfterAJAX | ||
attr_reader :page, :url | ||
|
||
def initialize | ||
Capybara::Webkit.configure(&:allow_unknown_urls) | ||
@page = Capybara::Session.new(:webkit) | ||
end | ||
|
||
def save(url) | ||
@url = url | ||
visit_and_wait | ||
restore_pre_js_page_classes | ||
write_page_to_disk | ||
puts "Saved #{filename}" | ||
end | ||
|
||
private | ||
|
||
def write_page_to_disk | ||
create_parent_directories | ||
File.write(filename, page.body) | ||
end | ||
|
||
def visit_and_wait | ||
page.visit(url) | ||
wait_for_ajax | ||
end | ||
|
||
# Restores page classes modified by running JS on the page | ||
def restore_pre_js_page_classes | ||
page.execute_script("$('html').addClass('no-js')") | ||
page.execute_script("$('html').removeClass('flexwrap')") | ||
end | ||
|
||
def create_parent_directories | ||
FileUtils.mkdir_p(File.dirname(filename)) | ||
end | ||
|
||
def filename | ||
url_path[-5..-1] == '.html' ? url_path : "#{url_path}.html" | ||
end | ||
|
||
def url_path | ||
URI.parse(url).path[1..-1] | ||
end | ||
|
||
def wait_for_ajax | ||
Timeout.timeout(Capybara.default_max_wait_time) do | ||
loop until finished_all_ajax_requests? | ||
end | ||
end | ||
|
||
def finished_all_ajax_requests? | ||
page.evaluate_script('jQuery.active').zero? | ||
end | ||
end | ||
|
||
page = PageAfterAJAX.new | ||
javascript_pages_to_scrape = open(ARGV[0]).read.split | ||
|
||
javascript_pages_to_scrape.each do |url| | ||
page.save(url) | ||
end |