Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Request Body Canonicalization specification #149

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cdxj/0.1.0/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
latestVersion: "https://specs.webrecorder.net/cdxj/latest/",
shortName: "cdxj",
includePermalinks: true,
group: "CDX"
editors: [
{
name: "Ilya Kreymer",
Expand Down
1 change: 1 addition & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@
* [WACZ Signing and Verification](wacz-auth/latest/): the mechanics for signing and verifying WACZ files for proof of authenticity
* [Crawl Index JSON (CDXJ)](cdxj/latest/): an extensible format for WARC index files
* [IPFS Custom File Chunking for WARC and WACZ](wacz-ipfs/latest): a specialized content-aware chunking strategy for adding composite web archive files to IPFS
* [Request Body Canonicalization](request-body-canonicalization/latest/): a format for canonicalizing non-GET HTTP requests as GET requests for indexing and replay

</section>
<section data-format="markdown">
Expand Down
69 changes: 69 additions & 0 deletions request-body-canonicalization/latest/index.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
{
"specStatus": "DRAFT",
"respec_js": "../../assets/js/respec-webrecorder.js",
"publishDate": "2024-04-08",
"license": "cc-by",
"thisVersion": "https://specs.webrecorder.net/request-body-canonicalization/latest/",
"latestVersion": "https://specs.webrecorder.net/request-body-canonicalization/latest/",
"shortName": "request-body-canonicalization",
"group": "CDX",
"includePermalinks": true,
"authors": [],
"editors": [
{
"name": "Alex Osborne",
"url": "https://github.com/ato"
},
{
"name": "Tessa Walsh",
"url": "https://bitarchivist.net"
},
{
"name": "Ilya Kreymer",
"url": "https://github.com/ikreymer"
}
],
"group": {
"name": "WACZ Editors",
"url": "https://webrecorder.net"
},
"otherLinks": [
{
"key": "Repository",
"data": [
{
"value": "Github",
"href": "https://github.com/webrecorder/specs"
},
{
"value": "Issues",
"href": "https://github.com/webrecorder/specs/issues"
},
{
"value": "Commits",
"href": "https://github.com/webrecorder/specs/commits"
}
]
}
],
"maxTocLevel": 3,
"logos": [
{
"src": "../../assets/images/webrecorder.svg",
"alt": "Webrecorder Logo",
"height": 100
}
],
"localBiblio": {
"PYWB-CDXJ": {
"title": "pywb Indexing: CDXJ Format",
"publisher": "Webrecorder",
"href": "https://pywb.readthedocs.io/en/latest/manual/indexing.html#cdxj-index"
}
},
"lint": {
"privsec-section": false,
"no-http-props": false,
"no-headingless-sections": false
}
}
279 changes: 279 additions & 0 deletions request-body-canonicalization/latest/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,279 @@
# Request Body Canonicalization

## Abstract

Originally, CDX files were only used to index web archives containing GET requests. As browser-based capture methods can record non-GET requests such as those generated by JavaScript, a way for CDX/CDXJ index records to differentiate based on request method and request body is needed. This document describes the mechanism used for encoding the request method and body in the CDX/CDXJ key by appending additional query parameters, as originally implemented by pywb.

## Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MAY and MUST in this document are to be interpreted as described in BCP 14 [RFC2119][1] [RFC8174][2] when, and only when, they appear in all capitals, as shown here.

## Terminology

- CDX
- CDXJ
- WACZ
- WARC

## Introduction

### Web Archive Formats (WARC and WACZ)

Web archiving data is often stored in specialized formats, which include a full record of the HTTP network traffic as well as additional metadata. The archived data is often accessed via random-access, loading the appropriate chunks of data based on URLs requested by end users.

Web archiving data is often stored in two key file formats:

1. WARC — A widely accepted [ISO standard][3] used by many institutions around the world for storing web archive data.
2. WACZ — A new format [developed by Webrecorder][4] for packaging WARCs with other web archive data enabling efficient random-access reads.

Both formats are 'composite' formats, containing smaller amounts of data interspersed with metadata. In the case of WARC, the format consists of concatenated records which are appended one after the other, eg. `cat A.warc B.warc > C.warc`. The WARCs MAY be gzipped, in which case the result is a multi-member gzip.

WACZ files use the ZIP format, which contains a specialized file and directory layout. ZIP is also a composite format, containing the raw (sometimes compressed) data as well as header data which contains the location files and directories within the ZIP file.

## Web Archive Index Formats (CDX and CDXJ)

Web archive search and retrieval is frequently intermediated by index files of WARC data in the CDX or CDXJ formats. WACZ files contain CDXJ indices, which MAY be gzipped, within the ZIP file that comprises the WACZ.

### CDX

CDX is a web archive index format developed as part of the Internet Archive's Wayback Machine, where CDX may have been an acronym for Crawl (or Capture) inDeX. A CDX file consists of plain text, with the first line being a legend and each line afterwards describing a web document. More information about how the format works can be found in the [CDX specification][5].

CDX was the precursor to the CDXJ index format.

### Crawl Index JSON (CDXJ)

Crawl Index JSON or [CDXJ](4) provides a standardized way of representing an index to one or more WARC files. It allows applications to quickly locate a given page in a set of archived web content, as well as metadata associated with that page. Each CDXJ entry can be looked up by URL, and contains a JSON payload that can be used for representing information about that URL. It is used in the [WACZ specification][4].

A CDXJ file is a sorted, line oriented plain-text file (optionally GZIP compressed) where each line represents information about a single captured URL in a web archive collection.

Each line MUST have three components that are separated by single spaces (0x20):

1. a Searchable URL
2. an Integer Timestamp
3. a JSON Block

The Searchable URL is a normalized form of the archived URL that allows a CDXJ file to be sorted and efficiently scanned using a binary search algorithm. The Searchable URL is sometimes referred to as Sort-friendly URI Reordering Transform (SURT).

The JSON Block contains a serialized [JSON][7] object with newlines escaped so that it fits completely on one line. The object MUST contain the following properties:

* url: The URL that was archived
Shrinks99 marked this conversation as resolved.
Show resolved Hide resolved
* digest: A cryptographic hash for the HTTP response payload
* mime: The media type for the response payload
* filename: the WARC file where the WARC record is located
* offset: the byte offset for the WARC record
* length: the length in bytes of the WARC record
* status: the HTTP status code for the HTTP response

## Indexing non-GET HTTP requests

### Motivation

Request body canonicalization provides a standardized way of representing a non-GET HTTP request as a GET request for indexing and playback in web archives. The original HTTP request type as well as the encoded request body are appended to the original URL and included in CDX/CDXJ indices as the Searchable URL. This allows web archive playback engines to then reconstruct the original non-GET requests for use in playback with their original HTTP method and request body.

### Encoding the request method

If the request method is not `GET` it MUST be appended as the value of query parameter `__wb_method`.

If the URL does not have a query string a `?` MUST be added:

http://example.org/ => http://example.org/?__wb_method=POST

If the URL already has a query string the `__wb_method` parameter MUST be added at the end after a `&` separator:

http://example.org/?page=1 => http://example.org/?page=1&__wb_method=POST

Even if the query string already ends in `&` another separator MUST still be added:

http://example.org/?foo& => http://example.org/?foo&&__wb_method=POST

### Encoding the request body

Encoding the request body depends on the content-type.

| Content-Type | Primary Encoding | Fallback Encoding |
|-----------------------------------|------------------|-------------------|
| application/json | JSON | |
| application/x-amf | AMF | |
| application/x-www-form-urlencoded | urlencoded form | binary |
| multipart/* | multipart form | binary |
| text/plain | JSON | binary |
| * | binary | |

#### AMF (Action Message Format) request body encoding

AMF request body encoding is considered experimental and is only supported in pywb. It is possible this feature will be deprecated in the future.

The current ([pywb implementation of AMF request body encoding][7]) and ([associated tests][8]) are available in the pywb repository.

#### Binary request body encoding

The request body is encoded as Base64 ([RFC 4648][9]) and appended to the query string as the `__wb_post_data` parameter.

> **Example**
>
> Original request:
>
> POST /chat HTTP/1.0
> Host: example.org
> Content-Length: 5
>
> hello
>
> Encoded URL:
>
> http://example.org/chat?__wb_method=POST&__wb_post_data=aGVsbG8=

#### Encoding a urlencoded form request body

Decode the body to a string using UTF-8, percent decoded the string, **percent plus encode** it and then append the result to the output.

If a UTF-8 decoding error occurs then the binary encoding method MUST be used instead.

> **Example**
>
> Original request:
>
> POST / HTTP/1.0
> Host: example.org
> Content-Type: application/x-www-form-urlencoded
> Content-Length: 13
>
> say=Hi&to=Mom
>
> Encoded URL:
>
> http://example.org/?__wb_method=POST&__wb_post_data=say%3DHi%26to%3DMom

#### Encoding a multipart form request body

The body MUST be decoded as form data per [RFC 2388][10] and then percent plus encoded. If the body is not a valid multipart/form-data message then the binary encoding method MUST be used instead.

> **Example**
>
> Original request:
>
> POST / HTTP/1.1
> Host: example.org
> Content-Type: multipart/form-data; boundary=AaB03x
> Content-Length: Content-Length: 437
>
> --AaB03x
> Content-Disposition: form-data; name="submit-value"
>
> Example
> --AaB03x
> Content-Disposition: form-data; name="files"
> Content-Type: multipart/mixed; boundary=BbC04y
>
> --BbC04y
> Content-Disposition: file; filename="file1.txt"
> Content-Type: text/plain
>
> Content of file1.txt.
>
> --BbC04y
> Content-Disposition: file; filename="file2.html"
> Content-Type: text/html
>
> <!DOCTYPE html><title>Content of file2.html.</title>
>
> --BbC04y--
> --AaB03x--
>
>
> Encoded URL:
>
> http://example.org/?__wb_method=POST&__wb_post_data=--AaB03x%0AContent-Disposition%3A%20form-data%3B%20name%3D%22submit-name%22%0A%0AExample%0A--AaB03x%0AContent-Disposition%3A%20form-data%3B%20name%3D%22files%22%0AContent-Type%3A%20multipart%2Fmixed%3B%20boundary%3DBbC04y%0A%0A--BbC04y%0AContent-Disposition%3A%20file%3B%20filename%3D%22file1.txt%22%0AContent-Type%3A%20text%2Fplain%0A%0AContent%20of%20file1.txt.%0A%0A--BbC04y%0AContent-Disposition%3A%20file%3B%20filename%3D%22file2.html%22%0AContent-Type%3A%20text%2Fhtml%0A%0A%3C%21DOCTYPE%20html%3E%3Ctitle%3EContent%20of%20file2.html.%3C%2Ftitle%3E%0A%0A--BbC04y--%0A--AaB03x--%0A

#### Encoding a JSON request body

The request MUST be parsed as JSON ([RFC 8259][11]) and then apply the following algorithm with an empty string as the initial value of *name*.

To **encode a JSON *value***, given a *name* and an initially-empty map *nameCounts* of strings to integers:

1. If *value* is a JSON object:
1. Recursively encode each member of the object passing member's name as *name* and the member's value as *value*.
2. If *value* is a JSON array:
1. Recursively encode each element of the array passing the current value of *name* as
*name* and the value of the element as *value*.
3. Otherwise:
1. Define the string *encodedValue* as:
1. If *value* is JSON true then the string "true".
2. If *value* is JSON false then the string "false".
3. If *value* is JSON null then the string "null".
4. If *value* is a JSON string then the result of **percent plus encoding** the string.
5. If *value* is a JSON number then the number as a string consistent with the output of JavaScript's toString() method for the number.
2. If *nameCounts* contains the integer *count* for *name*:
1. Increment *count* by 1.
2. Store *count* as the new count for *name* in *nameCounts*.
3. Append the string "&*name*.*count*_=*encodedValue*" to the output.
3. Otherwise, if *nameCounts* does not contain *name*:
1. Store the integer 1 in *nameCounts* for *name*.
2. Append the string "&*name*=*encodedValue*" to the output.

The resulting query string will contain encoded key/value pairs of each leaf node of the JSON body.

> **Example**
>
> Original request:
>
> POST /events HTTP/1.0
> Host: example.org
> Content-Type: application/json
>
> {
> "type": "event",
> "id": 44.0,
> "float": 35.7
> "values": [true, false, null],
> "source": {
> "type": "component",
> "id": "a+b&c= d",
> "values": [3, 4]
> }
> }
>
> Encoded URL:
>
> http://example.org/events?__wb_method=POST&type=event&id=44&float=35.7&values=true
> &values.2_=false&values.3_=null&type.2_=component&id.2_=a%2Bb%26c%3D+d
> &values.4_=3&values.5_=4

## Appendix

### Percent plus encoding

To **percent plus encode a string**, first encode it as UTF-8 and then **percent plus encode** the resulting byte sequence.

To **percent plus encode a byte sequence**, for each byte in the input sequence:

1. If the byte falls within the following ASCII character ranges, append it to the output as is.

`'0'-'9', 'a'-'z', 'A'-'Z', '-', '.', '_', '~'`

2. If the byte is the ASCII space character (' '), append the ASCII plus character ('+') to the output.

3. Otherwise, append ASCII percent character ('%') to the output and followed by the value of the byte formatted as two uppercase hexadecimal digits.

> **Compatibility Note**
>
> Prior to Python 3.7 the character "~" was percent encoded.
>
> Older versions of ([pywb][12]) and ([warcio.js][13]) had slight discrepencies in the query strings they output for the same request data. For instance, pywb wrote Pythonic values for some values (`True`, `False`, `None`) rather than native JSON values (`true`, `false`, `null`), and warcio.js handled nested JSON differently than pywb. As of the publication of this specification, all current versions of Webrecorder software should behave identically.


[1]: https://www.rfc-editor.org/rfc/rfc2119
[2]: https://www.rfc-editor.org/rfc/rfc8174
[3]: https://iipc.github.io/warc-specifications/
[4]: https://specs.webrecorder.net/wacz/latest/
[5]: https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/
[6]: https://specs.webrecorder.net/cdxj/0.1.0/
[7]: https://github.com/webrecorder/pywb/blob/main/pywb/warcserver/amf.py
[8]: https://github.com/webrecorder/pywb/blob/main/pywb/warcserver/test/test_amf.py
[9]: https://tools.ietf.org/html/rfc4648
[10]: https://datatracker.ietf.org/doc/html/rfc2388
[11]: https://www.rfc-editor.org/rfc/rfc8259
[12]: https://github.com/webrecorder/pywb
[13]: https://github.com/webrecorder/warcio.js