Skip to content

Java middleware and direct invocation api for web page snapshotting services used to make javascript rendered web pages crawlable / parseable by bots.

Notifications You must be signed in to change notification settings

cainaf/webapp-snapshot-java

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Webapp Snapshot Java

This project contains a Java middleware implementation and a direct invocation api to leverage existing web page snapshotting service providers. The web page snapshotting services snapshot a web page after executing any Javascript to enable search engines / bots to parse the post-javascript rendered page (since bots / search engines don't execute javascript). This is helpful if you have a javascript web app (backbone, angular, emberjs, etc.)

There are two parts to this code

  1. Java middleware implemented by a servlet filter that detects if a search-engine / bot is making a request and if so leverages the web page snapshotting service to return a response.
  2. An api to explicitly snapshot your web pages.

The code is based upon https://github.com/greengerong/prerender-java. The ways it differs from that project are:

Note: If you are using a # in your urls, make sure to change it to #!. View Google's ajax crawling protocol

Note: Make sure you have more than one webserver thread/process running because the snapshotting service will make a request to your server to render the HTML.

Middleware / Servlet Filter

How the servlet filter works:

  1. Check if a webpage snapshot is required
    1. Check if the request is from a crawler (_escaped_fragment_ or agent string)
    2. Check to make sure we aren't requesting a resource (js, css, etc...)
    3. (optional) Check to make sure the url is in the whitelist
    4. (optional) Check to make sure the url isn't in the blacklist
  2. If a snapshot is required
    1. (optional) Invoke SeoFilterEventHandler.beforeSnapshot to check if a snapshot is available. If so, use this as the snapshot and skip the remaining steps.
    2. Make a request to the snapshotting service to get a snapshot.
    3. (optional) Invoke SeoFilterEventHandler.afterSnapshot with the snapshot (for persistence / logging)
    4. return the snapshot result to the crawler

Installing the servlet filter

[1] Modify your pom.xml

<dependency>
  <groupId>com.github.avaliani.snapshot</groupId>
  <artifactId>webapp-snapshot-java</artifactId>
  <version>1.0</version>
</dependency>

[2] Modify your web.xml (you will probably want to add this filter prior to all other filters)

<filter>
    <filter-name>SeoFilter</filter-name>
    <filter-class>com.github.avaliani.snapshot.SeoFilter</filter-class>
    <init-param>
        <param-name>snapshotService</param-name>
        <param-value>com.github.avaliani.snapshot.AjaxSnapshotsSnapshotService</param-value>
    </init-param>
    <init-param>
        <param-name>snapshotServiceToken</param-name>
        <param-value>{token}</param-value>
    </init-param>
</filter>
<filter-mapping>
    <filter-name>SeoFilter</filter-name>
    <url-pattern>/*</url-pattern>
</filter-mapping>

Servlet filter initialization parameters:

All parameters are optional except the parameter used to specify the snapshot service token: snapshotServiceToken or snapshotServiceTokenProvider. By default AjaxSnapshotsSnapshotService is used as the snapshotService.

Snapshot service parameters:

  • snapshotService - the snapshotting service. Two built in services are available: (1) com.github.avaliani.snapshot.AjaxSnapshotsSnapshotService and (2) com.github.avaliani.snapshot.PrerenderSnapshotService. Or you can implement your own.
  • snapshotServiceHeaders - headers to use when making a request to the snapshotting service. Specified as semi-colon seperated headerName and headerValue pairs, e.g. "X-AJS-SNAP-TIME=2000;{headerName2}={headerValue2}"
  • snapshotServiceToken - specifies the snapshot service token
  • snapshotServiceTokenProvider - used if you want to generate your snapshot service token from a class and not from web.xml. The class must implement com.github.avaliani.snapshot.SnapshotServiceTokenProvider
  • snapshotServiceUrl - used to specify an explicit url for the snapshotting service. If not specified the default url for the snapshotting service will be used.

Request selection parameters:

  • crawlerUserAgents - additional user agents to check for
  • whitelist - if set and the request url is not in the whitelist it is not snapshotted
  • blacklist - if set and the request url is in the blacklist it is not snapshotted

Other parameters:

  • loggingLevel - java logging Level at which to write logs. Default logging level is FINE.
  • seoFilterEventHandler - event handler to be invoked before and after taking snapshots.

Snapshot API

See com.github.avaliani.snapshot.SnapshotService for the API. Two built in services are available: (1) com.github.avaliani.snapshot.AjaxSnapshotsSnapshotService and (2) com.github.avaliani.snapshot.PrerenderSnapshotService.

Testing

If you want to make sure your pages are rendering correctly:

  1. Open the Developer Tools in Chrome (Cmd + Atl + J)
  2. Click the Settings gear in the bottom right corner.
  3. Click "Overrides" on the left side of the settings panel.
  4. Check the "User Agent" checkbox.
  5. Choose "Other..." from the User Agent dropdown.
  6. Type googlebot into the input box.
  7. Refresh the page (make sure to keep the developer tools open).

License

The MIT License (MIT)

About

Java middleware and direct invocation api for web page snapshotting services used to make javascript rendered web pages crawlable / parseable by bots.

Resources

Stars

Watchers

Forks

Packages

No packages published