-
Notifications
You must be signed in to change notification settings - Fork 1
/
code4lib.html
354 lines (317 loc) · 15.8 KB
/
code4lib.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Webrecorder Open Source Tools</title>
<meta name="description" content="Webrecorder Open Source Tools">
<meta name="author" content="Ilya Kreymer">
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui">
<link rel="stylesheet" href="css/reveal.css">
<link rel="stylesheet" href="css/theme/white_styled.css" id="theme">
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,500" rel="stylesheet">
<!-- Code syntax highlighting -->
<link rel="stylesheet" href="lib/css/zenburn.css">
<!-- Printing and PDF exports -->
<script>
var link = document.createElement( 'link' );
link.rel = 'stylesheet';
link.type = 'text/css';
link.href = window.location.search.match( /print-pdf/gi ) ? 'css/print/pdf.css' : 'css/print/paper.css';
document.getElementsByTagName( 'head' )[0].appendChild( link );
</script>
<style>
ul {
text-align: left !important;
display: block !important;
}
.clear-both {
clear: both;
}
.footer-links {
display: inline-block;
margin-left: 20px;
vertical-align: middle;
line-height: 1.5em;
font-family: sans-serif;
font-weight: bold;
}
.reveal pre {
line-height: 0.9em;
box-shadow: none;
}
.redh {
/*text-transform: uppercase;*/
text-align: left;
color: crimson !important;
}
.greenh {
/*text-transform: uppercase;*/
color: green !important;
}
.imgsqr {
width: 128px;
height: 128px;
}
.left {
text-align: left;
font-size: 80% !important;
margin: 0px;
left: -300px;
position: absolute;
margin: 0px;
}
.url {
font-family: courier, monospace !important;
font-size: 90%;
}
.pad-top {
margin-top: 60px !important;
}
.footer-text {
font-size: 28px;
margin-left: 50px;
font-family: sans-serif;
display: inline;
}
</style>
<!--[if lt IE 9]>
<script src="lib/js/html5shiv.js"></script>
<![endif]-->
</head>
<body>
<div class="reveal">
<!-- Any section element inside of this container is displayed as a slide -->
<div class="slides">
<section>
<h3>Webrecorder: Open Source Web Archiving Toolset</h3>
<h4>Code4Lib, 2019</h4>
<p style="margin-top: 100px">Ilya Kreymer, Webrecorder Lead Developer</p>
<p>@IlyaKreymer @webrecorder_io</p>
</section>
<section>
<h3>What is Webrecorder?</h3>
<ul>
<li class="fragment">A set of FOSS tools for creating and viewing web archives</a></li>
<li class="fragment">A free, hosted service running on <a href="https://webrecorder.io/">https://webrecorder.io/</a></li>
<li class="fragment">Supports anonymous capture and user account system</li>
<li class="fragment">Browser-based capture and access focusing on high-fidelity</b></li>
<li class="fragment">Goal is web archiving for all!</li>
<li class="fragment">Stewarded by Rhizome, an arts non-profit in NYC</li>
<li class="fragment">Team of six working on Webrecorder</li>
<li class="fragment">Supported by two grants from the Mellon Foundation</li>
</ul>
</section>
<section>
<h3>How is Webrecorder different?</h3>
<ul>
<li class="fragment">Traditional web archiving is crawler based</li>
<p><img class="fragment" style="float: right" src="img/heritrix.gif"/></p>
<li class="fragment">Crawler loads URLs, starting with a list of 'seeds'</li>
<li class="fragment">Parse HTML to find more urls to crawl</li>
<li class="fragment">Store HTTP traffic in lossless format (WARC)</li>
<li class="fragment">Easy to parse HTML == Fast!</li>
<li class="fragment">Easy to crawl lots of content in bulk!</li>
<li class="fragment redh">Mostly inadequate for modern websites, because</li>
<p><img class="fragment imgsqr" src="img/JavaScript-logo.png"/></p>
</ul>
</section>
<section>
<h3>What's a WARC?</h3>
<ul>
<li class="fragment">Standardized (ISO) file format for web archives</li>
<li class="fragment">Concatenated byte-level capture of each HTTP (1.x) request and response</li>
<li class="fragment">Optional metadata records (no set standard)</li>
<li class="fragment">Webrecorder produces standard WARCs</li>
</ul>
</section>
<section>
<h2><a href="https://webrecorder.io/" target="_blank">Webrecorder Demo!</a></h2>
<p class="fragment">Remote Browser Example Links:</p>
<ul>
<li class="fragment"><a target="_blank" href="http://www.smithsonianeducation.org/idealabs/collecting/flashdetect.html">Flash Page live</a></li>
<li class="fragment"><a target="_blank" href="https://webrecorder.io/record/$br:firefox:57/http://www.smithsonianeducation.org/idealabs/collecting/flashdetect.html">Flash page in Webrecorder using Firefox 57</a></li>
<li class="fragment"><a target="_blank" href="https://webrecorder.io/record/$br:firefox:57/homestarrunner.com/">Another Flash Example</a></li>
<li class="fragment"><a target="_blank" href="https://webrecorder.io/demo/java/list/bookmarks/b1/20170505193641$br:firefox:49/http://sites.math.rutgers.edu/~sontag/336/brownian-applet.html">Java Applet!</a></li>
</ul>
</section>
<section>
<h3>Web archiving != Archiving the entire web</h3>
<ul>
<li class="fragment">Web archives can be small</li>
<li class="fragment">Web archives can contain bounded objects</li>
<li class="fragment">Quality over quantity</li>
<li class="fragment">You can run Webrecorder at your institution today</li>
</ul>
</section>
<section>
<h3>The Webrecorder Stack</h3>
<ul>
<li class="fragment">Componentized Architecture</li>
<li class="fragment">Python and JS</li>
<li class="fragment">Lots of tools of developers</li>
<li class="fragment"><a href="https://github.com/webrecorder">https://github.com/webrecorder</a></li>
</ul>
</section>
<section>
<h3>What if I just want to read/write WARC files?</h3>
<h4 class="fragment">warcio</h4>
<ul>
<li class="fragment">package for creating and reading WARC files</li>
<li class="fragment">Make a WARC in 4 lines of Python:</li>
<pre class="left fragment">
from warcio.capture_http import capture_http
import requests
with capture_http('example.warc.gz', warc_version='1.1'):
requests.get('https://example.com/ ')
</pre>
<li class="fragment">Code: <a href="https://github.com/webrecorder/warcio">warcio</a></li>
</ul>
</section>
<section>
<h3>pywb</h3>
<h4>Python Wayback / Web Archive Toolkit</h4>
<ul>
<li class="fragment">Core "engine" powering Webrecorder</li>
<li class="fragment">Create and view WARCs through browser, via rewritten urls and HTTP/S proxy</li>
<li class="fragment">Docs: <a href="https://pywb.readthedocs.io">pywb.readthedocs.io</a></li>
<li class="fragment">Code: <a href="https://github.com/webrecorder/pywb">pywb</a></li>
</ul>
</section>
<section>
<h3>What if I want to archive through the browser?</h3>
<ul>
<li>Create a web archive of a page in 4 line script:</li>
<pre class="left fragment">
pip install pywb
wb-manager init my-web-archive
wayback --proxy my-web-archive --proxy-record --live
google-chrome http://localhost:8080/my-web-archive<span class="redh">/record/</span>http://example.com/
</pre>
<li class="fragment">OR</li>
<pre class="left fragment">
google-chrome --proxy-server=http://localhost:8080 https://example.com/
</pre>
</ul>
</section>
<section>
<h3>What if I want to host a wayback machine/provide access?</h3>
<ul>
<li>View an archive of with a 4-line script:</li>
<pre class="left fragment">
pip install pywb
wb-manager init my-web-archive
wayback --proxy my-web-archive
google-chrome http://localhost:8080/my-web-archive/http://example.com/
</pre>
<li class="fragment">OR</li>
<pre class="left fragment">
google-chrome --proxy-server=http://localhost:8080 https://example.com/
</pre>
</ul>
</section>
<section>
<h3>What if I want a simple desktop app for users to browse a web archive?</h3>
<h4 class="fragment">Webrecorder Player</h4>
<img class="fragment imgsqr" src="img/WRlogo.png"></li>
<ul>
<li class="fragment">Electron Desktop App for OSX, Windows, Linux</li>
<li class="fragment">Open and browse any WARC file locally, offline</li>
<li class="fragment">UI consistent with webrecorder.io</li>
<li class="fragment"><a href="https://github.com/webrecorder/webrecorder-player/releases">Released via Github</a></li>
<li class="fragment">Code: <a href="https://github.com/webrecorder/webrecorder-player">webrecorder-player</a></li>
</section>
<section>
<h3>What if I want a specific browser, eg. with Flash?</h3>
<h4>Remote Browser System</h4>
<ul>
<li class="fragment">Docker containers each containing a web browser</li>
<li class="fragment">Originally developed for <a href="http://oldweb.today/">oldweb.today</a></li>
<li class="fragment">Preserving browsers with Flash, even Java</li>
<li class="fragment">Several versions of Chrome, Firefox</li>
<li class="fragment">Access via VNC + WebRTC</li>
<li class="fragment">Lots of Code: <a href="https://github.com/oldweb-today">github.com/oldweb-today</a></li>
<li class="fragment">Docs still needed</li>
<li class="fragment">Another Browser: <a href="http://54.164.112.170:9020/view/www/http://info.cern.ch/hypertext/WWW/TheProject.html" target="_blank">demo</a>
</ul>
</section>
<section>
<h3>What if I want to make a custom behavior?</h3>
<h4 class="fragment">Webrecorder Behaviors</h4>
<ul>
<li class="fragment">Working on an extensible per-site behavior system</li>
<li class="fragment">Will provide a JS library for building behaviors</li>
<li class="fragment">API and documentation coming soon</li>
<li class="fragment">Talk to us about beta-testing!</li>
<li class="fragment">Code: <a href="https://github.com/webrecorder/wr-behaviors">wr-behaviors</a></li>
</ul>
</section>
<section>
<h3>What if I want to try it all!</h3>
<h4 class="fragment">webrecorder/webrecorder</h4>
<ul>
<li class="fragment">Full system running on webrecorder.io</li>
<li class="fragment">Containerized deployment with Docker Compose</li>
<li class="fragment">Adds user, collection management, friendly UI</li>
<li class="fragment">Integrates Remote Browser System</li>
<li class="fragment">API Backend, React Frontend</li>
<li class="fragment">Code: <a href="https://github.com/webrecorder/webrecorder">webrecorder</a></li>
</ul>
</section>
<section>
<h3>Any other tools?</h3>
<ul>
<li class="fragment"><a href="https://github.com/webrecorder/webrecorder-deploy">webrecorder-deploy</a> -- Ansible cookbook for Webrecorder deployment</li>
<li class="fragment"><a href="https://github.com/webrecorder/warcit">warcit</a> -- turn files on disk into WARCs</li>
<li class="fragment"><a href="https://github.com/webrecorder/har2warc">har2warc</a> -- convert HAR files into WARCs</li>
</ul>
</section>
<section>
<h3>Have we solved web archiving?</h3>
<p style="text-align: left" class="fragment">Unfortunately, many challenges remain:</p>
<ul>
<li class="fragment">Websockets</li>
<li class="fragment">HTTP/2</li>
<li class="fragment">Dynamic History / Single Page Apps</li>
<li class="fragment">Non-deterministic behaviors (eg. using time)</li>
</ul>
<p class="greenh fragment">Want to help? Contributions Welcome!</p>
</section>
<section>
<h2>Thank you</h2>
<h3>Q & A</h3>
<br/>
<h5>Contact:<br/><a href="mailto:[email protected]">[email protected]</a></h5>
</section>
</div>
</div>
<!-- Footer -->
<div style="position:absolute; bottom:1%; left:2%; width:100%; vertical-align: middle">
<a href="https://webrecorder.io/"><img style="vertical-align: middle; width: auto; height: 90px" src="img/webrecorder-logo-vector.svg"></a>
</div>
<script src="lib/js/head.min.js"></script>
<script src="js/reveal.js"></script>
<script>
// Full list of configuration options available at:
// https://github.com/hakimel/reveal.js#configuration
Reveal.initialize({
controls: true,
progress: true,
history: true,
center: false,
margin: 0.05,
transition: 'slide', // none/fade/slide/convex/concave/zoom
// Optional reveal.js plugins
dependencies: [
{ src: 'lib/js/classList.js', condition: function() { return !document.body.classList; } },
{ src: 'plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: 'plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
{ src: 'plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } },
{ src: 'plugin/zoom-js/zoom.js', async: true },
{ src: 'plugin/notes/notes.js', async: true }
]
});
</script>
</body>
</html>