-
Notifications
You must be signed in to change notification settings - Fork 1
/
high-fidelity.html
387 lines (361 loc) Β· 18.7 KB
/
high-fidelity.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>High Fidelity Web Archiving and pywb 2.x</title>
<meta name="description" content="High Fidelity Web Archiving and pywb 2.x">
<meta name="author" content="Ilya Kreymer">
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="css/reveal.css">
<link rel="stylesheet" href="css/theme/white_styled.css" id="theme">
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,500" rel="stylesheet">
<!-- Code syntax highlighting -->
<link rel="stylesheet" href="lib/css/zenburn.css">
<!-- CUSTOM STYLE -->
<link rel="stylesheet" href="css/theme/white_styled.css" id="theme">
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,500" rel="stylesheet">
<!-- Printing and PDF exports -->
<script>
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? 'css/print/pdf.css' : 'css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
</script>
<style>
ul {
text-align: left !important;
display: block !important;
}
.clear-both {
clear: both;
}
.footer-links {
display: inline-block;
margin-left: 20px;
vertical-align: middle;
line-height: 1.5em;
font-family: sans-serif;
font-weight: bold;
}
.reveal pre {
line-height: 0.9em;
box-shadow: none;
}
.redh {
/*text-transform: uppercase;*/
text-align: left;
color: crimson !important;
}
.greenh {
/*text-transform: uppercase;*/
color: green !important;
}
.left {
text-align: left;
font-size: 80% !important;
}
.url {
font-family: courier, monospace !important;
font-size: 90%;
}
.pad-top {
margin-top: 60px !important;
}
.footer-text {
font-size: 28px;
margin-left: 50px;
font-family: sans-serif;
display: inline;
}
</style>
<!--[if lt IE 9]>
<script src="lib/js/html5shiv.js"></script>
<![endif]-->
</head>
<body>
<div class="reveal">
<!-- Any section element inside of this container is displayed as a slide -->
<div class="slides">
<section>
<h3>pywb 2.x technical overview</h3>
<h5>Or, everything you wanted to know about high fidelity web archiving, but were afraid to ask!</h5>
<p style="margin-top: 100px">Ilya Kreymer, Webrecorder Lead Developer</p>
</section>
<section>
<h3>What is High Fidelity?</h3>
<section>
<p class="fragment">(Not the movie)<br/>
<img src="img/high-fidelity-movie.jpeg" style="width: 25%; height: 25%">
</p>
</section>
<section>
<h4>High Fidelity Audio</h4>
<p>"The <i>reproduction</i> of sound with little distortion, giving a result very similar to the original."</p>
<p style="text-align: right">from <i>merriam-webster.com</i></p>
</section>
<section>
<h4>High Fidelity Web Archiving</h4>
<p class="fragment">The <i>reproduction</i> of a web page with little distortion, giving a result very similar to the <i>original</i>?</p>
<p class="fragment">The <i>reproduction</i> of a web page as it was originally designed to be experienced?</p>
<p class="fragment">What exactly is original?<br/>What is distortion of a web page?</p>
</section>
</section>
<section>
<h3>What is pywb?</h3>
<ul>
<li class="fragment">Core "engine" powering Webrecorder</li>
<li class="fragment">Fully <a href="https://github.com/webrecorder/pywb">open source suite of web archiving tools</a></li>
<li class="fragment">Originally designed as a new wayback machine, now on 2.1 release</li>
<li class="fragment">Includes High Fidelity Replay</li>
<li class="fragment">and High Fidelity Capture (Record Mode)</li>
<li class="fragment">Two developers at Rhizome, ~20 contributors over the years</li>
<li class="fragment">Docs: <a href="https://pywb.readthedocs.io">pywb.readthedocs.io</a></li>
<li class="fragment">Code: <a href="https://github.com/webrecorder/pywb">github.com/webrecorder/pywb</a></li>
</ul>
</section>
<section>
<h3>Why is "high fidelity" so hard?</h3>
<p style="text-align: left">A modern web site is generally:</p>
<ul>
<li class="fragment">Not a static document!</li>
<li class="fragment">Executes complex Javascript</li>
<li class="fragment">Renders differently at different sizes...</li>
<li class="fragment">...and on different browsers with different plugins</li>
<li class="fragment">Includes cacheing and cache-busting</li>
<li class="fragment">Responds to dynamic user input</li>
<li class="fragment">Change URL at any time dynamically</li>
<li class="fragment">Adapts to network latency and speed</li>
</ul>
<p class="fragment">Many web sites (applications) are <i>non-deterministic</i> == <br/>no exact original to archive!</i></p>
</section>
<section>
<h3>What can we do?</h3>
<p class="fragment">Try to reduce the non-determinism as much as we can!</p>
<p class="fragment">A few "tricks":</p>
<ul>
<li class="fragment">Capture pages using a web browser...</li>
<li class="fragment">...and Capture the browser!</li>
<li class="fragment">Rewrite Urls and Sandbox Javascript Environment</li>
<li class="fragment">Fuzzy Match Url Lookup</li>
<li class="fragment">Adapt to Adaptive Rendering and Streaming</li>
<li class="fragment">Preserve dynamic history changes</li>
</ul>
</section>
<section>
<h3>Browser Based Capture</h3>
<div class="fragment">Use same browser and tools for capture and replay:
<p class="left">Replay:</p><span>http://myarchive.example.com/coll/http://example.com/</span>
<p class="left">Capture:</p><span>http://myarchive.example.com/coll/<span class="redh">record</span>/http://example.com/</span>
</div>
<h4 class="fragment left pad-top">Capture the browser itself and run in containers and emulators<br/>(ex: <a href="http://oldweb.today">oldweb.today</a>)</h4>
</section>
<section>
<h3>Url Rewriting and JS Sandboxing</h3>
<section>
<p class="fragment">Traditional url rewriting:</p>
<p class="fragment">http://example.com/ -> http://myarchive.institution.org/replay/http://example.com</p>
<p class="fragment">But Javascript expects original url <i>http://example.com/</i> !</p>
</section>
<section>
<p class="fragment">"Trick" the browser or "sandbox" the original JS environment</p>
<ul>
<li class="fragment">Override many system JS functions to custom versions which:</li>
<li class="fragment"><i>Unrewrite</i> urls before passing them to user-defined functions</li>
<li class="fragment"><i>Rewrite</i> urls before passing to any network-related function</li>
</ul>
</section>
<section>
<p>A partial list of overrides:</p>
<ul class="fragment" style="font-family: courier, monospace !important">
<li>setAttribute, getAttribute</li>
<li>DOM Element setter/getter</li>
<li>AJAX Request/fetch()</li>
<li>DOM insert/append functions</li>
<li>Document/Window properties</li>
<li>HTML insertion/document.write</li>
<li>Web and Service workers</li>
<li>History and hash change</li>
<li>Cookie functions</li>
<li>postMessage</li>
</ul>
</section>
<section>
<p class="fragment">But some Javascript objects can't be overriden at all!</p>
<p class="fragment">Ex: <code>window</code>, <code>location</code>, <code>top</code></p>
<p class="fragment">Fortunately, they can be "masked" by replacing:</p>
<pre class="left fragment">
location.href = "http://example.com/"
</pre>
<p class="fragment">With:</p>
<pre class="left fragment">
// add a fake scope
{
let location = ...get fake location object
...
location.href = "http://example.com/"
}
</pre>
<p class="fragment">(Discovered and added to pywb by <i>@johnaberlin</i>)</p>
</section>
</section>
<section>
<h3>HTTP/S Proxy Mode</h3>
<p class="fragment">pywb can also be used as an HTTP and HTTPS proxy<br/>(for both capture and replay)</p>
<p class="fragment">More info in pywb docs: <a href="https://pywb.readthedocs.io/en/latest/manual/configuring.html#http-s-proxy-mode">HTTP/S Proxy Mode Setup</a></p>
<p class="fragment left">π Running in proxy mode avoids the need for url rewriting and JS sandboxing tricks! π</p>
<p class="fragment left">π But not any of the other replay/capture fidelity issues... π</p>
<p class="fragment left">Proxy Mode alone is no longer sufficient for high-fidelity capture and replay!</p>
</section>
<section>
<h3>Fuzzy Match Url Lookup</h3>
<p>Only match "significant" portion of URL, ignore remainder</p>
<section>
<p class="fragment"><span class="left">Captured: </span>http://example.com/some/path?<span class="redh">_=12345</span></p>
<p class="fragment"><span class="left">Replayed: </span>http://example.com/some/path?<span class="redh">_=67890</span></p>
<p class="fragment">Need to fuzzy match, ignore the <span class="redh">_=</span> (a timestamp param)</p>
<p class="fragment">How do we know what query params to ignore?</p>
</section>
<section>
<p class="fragment"><span class="left">Captured: </span>http://example.com/some/path?<span class="greenh">id=ABC</span>&__otherparam=abc&other=test</p>
<p class="fragment"><span class="left">Replayed: </span>http://example.com/some/path?__temp=testid&<span class="greenh">id=ABC</span>&__foo=234</p>
<p class="fragment">Only match the <span class="greenh">id=</span> query parameter</p>
<p class="fragment">pywb maintains a list of fuzzy matching rules</p>
</section>
</section>
<section>
<h3>Adaptive Rendering (Static)</h3>
<p>Browsers can load different resources based on screen size, pixel density, window size, and general device capabilities</p>
<section>
<p class="fragment">HTML and CSS can be added dynamcially by Javascript</p>
</section>
<section>
<p>CSS @media queries</p>
<pre>
/* If screen dimension is upto 1000px */
@media screen and (max-width: 1000px) {
body {
background-image: <span class="redh">large.png</span>
}
}
/* If screen dimension is 500px or less */
@media screen and (max-width: 500px) {
body {
background-color: <span class="redh">small.png</span>
}
}
</pre>
</section>
<section>
<p><img> srcset tag</p>
<pre>
<img src="image-src.png"
srcset="image-1x.png 1x,
image-2x.png 2x,
image-3x.png 3x,
image-4x.png 4x">
</pre>
</section>
<section>
<p class="fragment">Solution: <i>Try to load all options dynamically</i></p>
<p class="fragment">pywb 2.1 (and Webrecorder) have a new 'Auto Fetch' system for capturing all versions of a resource automatically during browser-based capture.</p>
<p class="fragment">More details in an upcoming blog post<br/> by John Berlin, our Senior Backend Developer</p>
</section>
</section>
<section>
<h3>Adaptive Video Streaming</h3>
<section>
<ul>
<li class="fragment">A video is split into discreet chunks on the server</li>
<li class="fragment">Server sends manifest of chunks, and possibly resolutions</li>
<li class="fragment">Aftre each chunk, browser picks one of many options based on network speed/latency</li>
<li class="fragment">DASH and HLS are two standards for this approach</li>
<li class="fragment">Next url is determined by current network conditions!</li>
<li class="fragment">Non-deterministic by design</li>
</ul>
</section>
<section>
Possible options:
<ol>
<li>Trick browser into not using adaptive streaming at all!</li>
<li>Use some other system to get video (eg. youtube-dl)</li>
<li>Rewrite the HLS or DASH manifest to include only one resolution</li>
</ol>
</section>
<section>
<p>DASH / HLS manifest Rewriting</p>
<ul>
<li class="fragment">For each video chunk, pick only one resolution (not necessarily highest)</li>
<li class="fragment">Store choice in WARC as extra WARC header</li>
<li class="fragment">Use same rewriting for capture and replay</li>
<li class="fragment">One resolution only == deterministic behavior!</li>
<li class="fragment">Allows capture and replay of streaming media</li>
</ul>
</section>
</section>
<section>
<h3>Dynamic History Changes</h3>
<p>History <i>pushState/replaceState</i> API allows manipulating currently displayed url
</p>
<ul>
<li class="fragment">Page can change URL to any other on same domain</li>
<li class="fragment">A URL shown in browser does not mean a URL was loaded over HTTP</li>
<li class="fragment">No way to save in WARC (yet)</li>
</ul>
<div class="fragment">Possible solution: <i>Store history changes in WARC?</i>
<p class="pad-top">(More info at WARC File Format workshop)</p></div>
</section>
<section>
<h3>Conclusions</h3>
<p class="fragment">Many "tricks" needed to reproduce a modern web page at <i>high-fidelity</i></p>
<p class="fragment">pywb solves some of these problems,<br/>but many more remain</p>
<p class="fragment">We need your help in working on these difficult issues!</p>
</section>
<section>
<h3>Next Steps</h3>
<p class="fragment">Continuous work on fidelity improvements is necessary.</p>
<p class="fragment">Automation of high fidelity web archiving!</p>
<p class="fragment">Explore overlap between software preservation and web archives.</p>
<p class="fragment left">Prototype combining web archives + browsers + sofware preservation: <br/>See: <a href="https://scalar.webrecorder.net">scalar.webrecorder.net</a>
</p>
</section>
<section>
<h2 style="margin-top: 100px">Webrecorder Automation Demo!</h2>
</section>
<section>
<h2>Thank you</h2>
<h3>Q & A</h3>
</section>
</div>
</div>
<!-- Footer -->
<div style="position:absolute; bottom:1%; left:2%; width:100%; vertical-align: middle">
<a href="https://webrecorder.io/"><img style="vertical-align: middle; width: auto; height: 90px" src="img/webrecorder-logo-vector.svg"></a>
</div>
<script src="lib/js/head.min.js"></script>
<script src="js/reveal.js"></script>
<script>
// Full list of configuration options available at:
// https://github.com/hakimel/reveal.js#configuration
Reveal.initialize({
controls: true,
progress: true,
history: true,
center: false,
margin: 0.05,
transition: 'slide', // none/fade/slide/convex/concave/zoom
// Optional reveal.js plugins
dependencies: [
{ src: 'lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: 'plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: 'plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: 'plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: 'plugin/zoom-js/zoom.js', async: true },
{ src: 'plugin/notes/notes.js', async: true }
]
});
</script>
</body>
</html>