Skip to content

Latest commit

 

History

History
464 lines (397 loc) · 23 KB

README-OLD.md

File metadata and controls

464 lines (397 loc) · 23 KB

Intertwingler — An Engine for Dense Hypermedia

This software is an engine for creating dense hypermedia networks. Dense hypermedia is what the Web, out of the box, is not. The Web is sparse hypermedia: big, long documents, with few links aside from things like navigation and footers. Dense hypermedia is all about short resources connected by lots of links. One example of dense hypermedia are the personal knowledge management systems, colloquially known as tools for thought. Another is knowledge graphs. The goal of this product is to support those categories of application, and others — perhaps even art or literature.

One important success criterion is to eliminate the mundane aspects of "building a website", and otherwise get out of the way.

This is very much a speaking artifact: Since the ultimate goal is to create better conditions for developing dense hypermedia on the Web by retrofitting it with the capabilities of systems that preceded it (real and imagined), there are a number of subsidiary problems that need to be solved, and this system implements concrete ways to solve them.

Links are first-class citizens

Before we can do anything related to dense hypermedia, we have to solve for link rot. The median URL has a lifespan that can be measured in weeks. If you have orders of magnitude more addressable resources under management than the median, this kind of performance is a non-starter. Link rot doesn't need to happen (at least, for now, up to keeping one's domain name bill paid), but what does need to happen in order to fix it is a radical rethinking of how Web-based software is made. This system shows how to do it.

My own site, which admittedly has only been on this system since I made the latter in 2018, nevertheless still serves every URL it has ever exposed, dating back to the summer of 2008. I also use it for my client extranets, and my book project, The Nature of Software.

One ongoing criticism of the Web by Ted Nelson, who in 1960 coined the term hypertext (not to mention what it means to intertwingle), is that links only go in one direction: without extra apparatus, you can't see what links to you. Well, turns out the apparatus to display backlinks is the same apparatus as the one for eliminating link rot.

Links (and resources) have different species

The Web has three kinds of links: the conventional arc that when activating it (typically) completely replaces the representational state (both <a> and forms), what I would characterize as a "naïve embed" — images, A/V, and iframe documents — and non-printing metadata. Earlier systems had all kinds of other links besides, like stretchtext, conditional display, and proper, seamless transclusion. These of course can all be done on the Web, but the solutions are suboptimal. In particular, the embedded metadata that drives these capabilities tends to be ad-hoc and mutually incompatible, making it single-purpose for some particular UI framework or other. Many content management systems, moreover, have a concept of content type, but few systems — even sophisticated PKM systems — have a concept of link type (as in, precisely what the link means). It's the link types in conjunction with the content types that make it possible to derive how they ought to be rendered in the user interface.

Don't copy what you can reference

One perennial problem of informational content, whether on the Web or even digital at all, is keeping it up to date. A necessary condition for keeping content up to date is ensuring that there is precisely one authoritative copy of it.

The key word here of course is authoritative. We will invariably need multiple copies for things like cache and backups, but having exactly one copy that drives all the others is absolutely indispensable.

This principle can be extended to resources which can be modeled as functions of other resources, for example the HTML that corresponds to a Markdown document, or a cropped and/or resized image. Explicitly modeling these as transformations shrinks the footprint of original content to be managed.

Finally, for content to be reusable it must be finely addressable, with durable addresses at both the document and sub-document level.

Standard interfaces & data transparency

With this system we're trying to imagine what it means to be a "model citizen" on the Web: a reliable source of clear, actionable information. This is not only entails everything already discussed, but also:

  • structured, machine-actionable data is available for every resource,
  • interfaces are standard, so as not to require custom API adapters,
  • this includes data semantics as well as syntax,
  • A user (with sufficient authority) should be able to export 100% of the system's instance data, and furthermore that data should mean something to other systems.

Layered system, clear development targets

This system anticipates being situated in a heterogeneous operating environment, sharing space with other programming languages and frameworks. Indeed, this engine can be thought of as a "language bus", that marshals all things Ruby. The design is intended to be copied to other programming languages, and these systems are expected to interoperate in a daisy chain-like configuration.

Every component in this system, including the central piece that does the routing, is implemented as a Rack handler, which ultimately could be run as a stand-alone microservice. The handlers subsequently subdivide into two subspecies:

  • Content handlers that either originate information resources or proxy them from somewhere else,
  • transforms that manipulate HTTP requests or responses in transit.

Since every building block in the system is a potentially stand-alone Rack component, the language spoken between them is nominally HTTP. This not only makes for extremely well-defined development targets — you get a request and return a response — but the system anticipates future segmentation, including, as mentioned, across different programming languages, machines, and runtimes.

I should note that HTTP communication within the process space of a particular runtime is simulated, so we don't waste resources unnecessarily re-parsing and serializing. I also have a rudimentary sub-protocol in the works for specific constraints on how these components, particularly the transforms, are expected to behave.

History

This module began life as a thing called RDF::SAK, or the Content Swiss Army Knife. After positing the notion of content management meta-system, I made an initial cut in 2018, to support some work I was doing for a client. It quickly became a breadboard and/or test environment for developing what I just referred to as "good ideas about Web content", which I ultimately realized as a static website generator, in the same vein as Jekyll or 11ty. Since most of my work was around durable addressing and embedded metadata, a live engine was not a high priority. Priorities have since changed.

Five years prior to creating RDF::SAK, in 2013, I designed a protocol to aid in the development of Semantic Web applications called RDF-KV. It provides an extraordinarily simple mechanism for getting RDF statement deltas (i.e., commands to add and/or remove statements) from a Web client to a graph database on the server, with a minimum of moving parts (i.e., no JavaScript). To test the implementation, I needed a complete vocabulary, so I used the IBIS vocabulary I had written a year earlier, and created a tool called App::IBIS. This tool turned out to be useful, but limited in its capacity for expansion, because it was written in Perl, which does not have an RDF reasoner, a piece of software that is both highly abstract and difficult to write (a rudimentary yet satisficing one of which Ruby happens to possess). Without a reasoner, App::IBIS was much too sclerotic to develop very far past the initial prototype.

App::IBIS is unambiguously a dense hypermedia application, and developing it meant generating a lot of markup that was thick with embedded RDFa metadata. This led me to create a family of terse markup generators (Perl, Ruby, JavaScript) with some nice advantages over their incumbents. Working extensively with RDFa helped develop technique for reusing the embedded metadata for directing presentation markup, as well as providing the basis for CSS selectors in both HTML and SVG.

The plan for RDF::SAK was always to turn it into a live engine that could be accessed and updated online. Nevertheless, due to its decidedly organic origins, it was (and still very much is) a huge mess that needed (and still needs) several rounds of intense refactoring. I began this work in December of 2021 but suspended it a few weeks later due to an injury, and this refactor had to take a back seat to other priorities for most of 2022. I decided early this year (2023) I was going to complete the overhaul no matter what, which, it later turned out, the Summer of Protocols organizers have graciously elected to sponsor.

I have also gotten some interest, beginning last year, in the use of IBIS as a planning tool. Part of the impetus for getting RDF::SAK to a state where it can take over from the torpid App::IBIS is that I have an entire project planning framework based on IBIS for which any tooling will need a more flexible substrate. I am also grateful for my clients who support this development.

This project also represents a confluence of over two decades of work on the Web. What is now called Intertwingler closely tracks a design I sketched out back in 2006 for a "Web substrate", with the intent of decoupling functionality that generates content from that which manipulates it, on the premise that separating the two would result in both ending up markedly simpler. This design drew on technique I had developed at my first tech job back in 1999.

During my night shifts in 1999-2000 as a baby system administrator, I had a lot of time to mess around with mod_perl, the Perl bindings for the Apache API. One thing you learn when you work directly with a server API is that almost all Web development happens in a tiny corner (the response handler or content handler) of what you are able to address. It turns out there are several other places one can manipulate both the request and response (header twiddling, URL rewriting, access control, filters — albeit filters came a couple years later) that are orthogonal to the actual application. Indeed, many Web application frameworks recapitulate this structure within their own confines, and the result is undoubtedly a whole lot of redundant code.

I had had a personal site from 1998 to about 2003, and by 2008 I was ready to put one up again. It was around this time I had realized that one could use XSLT (which I had picked up in 2001) to transform (X)HTML into itself, meaning it could be used in the browser as an extremely lazy Web template engine that does its page composition at the network level. This means you can mix content sources on the server side, which can be any mixture of static or dynamic content written in any programming language or framework you like, since all communication happens using standard protocols and data formats. This is a technique I have used and expanded on for the last 15 years.

Specifically, I have written an RDFa query engine (2016) and seamless transclusion mechanism (2018). While XSLT is still actively developed and used in publishing outside the Web, I am somewhat concerned about its future as a native capability in the browser. XML is irredeemably out of fashion in mainstream Web circles (despite ostensibly having been reinvented as "custom elements"), but in my opinion XSLT is, for reasons too numerous to articualate here, unparalleled in its ability to manipulate markup — which is why I continue to use it. Indeed, a compact, easier-to-type XSLT syntax similar to RelaxNG may be enough to renew interest in it. I should note that the use of XSLT is not strictly necessary; you could probably acccomplish the same effect using (a lot more) JavaScript.

When I went to put the site up in 2008, I was keenly interested in creating dense hypermedia (though I would coin that term much later). I wanted to convey information without forcing the audience to read any more than they had to. The constraints were:

  1. that no page should be so long that it scrolls (on an average desktop monitor),
  2. any digressions, footnotes or parenthetical remarks would be hived off to their own page and linked,
  3. no 404 errors — URLs do not get exposed to the wild until there is something at that location.

These constraints made it very difficult to operate. For one, having to stop and think up a URL for a page because you happened to digress a bit in the page you were just writing (which, since URLs tend to track with titles, ultimately meant coming up with a title) is a jarring context switch of considerable cognitive overhead. Moreover, this would an exponential jump in workload, because the digressions would invariably generate their own digressions, and since nothing could ship until all of it was complete (or at least roughed in), it would take forever to do anything. Notwithstanding, I got about 40 pages into what I called a Resource Handling and Representation Policy done in this style, before I gave up and decided to just write essays.

This policy manual actually worked out a number of design decisions that are still perfectly valid fifteen years later, and have made their way into the Intertwingler.

The experience of writing this policy promptly moved me to start thinking about a mechanism that would enable information resources to be stored under canonical identifiers (specifically UUIDs and cryptographic hashes) that traded off legibility for being durable, and overlay human-friendly addresses on top. It would likewise track changes to these addresses, try to fix errors, and ensure all URLs on a domain that have ever been exposed to the wild route to something. I got this subsystem to finally work in RDF::SAK in 2019, and it remains present in Intertwingler.

The need to solve the same problem for fragment identifiers led me (apparently back in 2012) to invent a compact UUID representation which I am (slowly) trying to get graduated into an RFC.

The state mechanism for this URL naming history is a content inventory vocabulary I began roughing in around 2010. This was originally conceived as a data storage and exchange format for website content inventories, but has since become a catch-all, including a structure for holding quantitative metrics to help content strategists apprehend the contours of websites (developed in 2011), and a sophisticated set-theoretic mechanism for modeling audiences, and pairing (or anti-pairing) them with content (2019).

As I mentioned above, the Intertwingler engenders an ultimately simpler system by decoupling the generation of content from its subsequent downstream manipulation. I had sketched out how this was going to work as far back as 2008, along with a couple ill-fated prototypes. It wasn't until a project in 2020 though that I completed a Transformation Functions Ontology (started in 2014) and concomitant infrastructure that would resolve transformation functions, apply them to content, and cache their results. This infrastructure depends on earlier work on content-addressable stores (Perl in 2013, Ruby in 2019), that use RFC6920 ni: URIs, making them compatible with RDF.

On a similar tack, I also explored creating a registry for query parameters (2015), with the triple purpose of parsing and validating input, generating round-trip-stable query strings, and facilitating the creation of organization-wide policy for the names, types, and semantics of query parameters. I do not currently have a Ruby port of this particular software, but I will probably eventually make one, along with an RDF vocabulary as an extension to the transformation one for expressing the configuration.

And so, the Intertwingler is an odyssey spanning over two decades, which is fitting, since its ultimate goal is to retrofit the Web with the capabilities of its hypermedia predecessors.

Anatomy

As I have hopefully communicated, the Intertwingler is in a state of absolute disarray, still undergoing its metamorphosis from the less-ambitious and much more organic RDF::SAK. I have tried to outline some of the more important modules; those I have left out are either not very interesting (such as the generated vocabularies under Intertwingler::Vocab or slated for removal. Checkmarks on the bullet points indicate the modules are complete enough to use.

Essential components

  • Intertwingler::GraphOps is a mix-in that extends RDF::Queryable with the all-important inferencing operations.
  • Intertwingler::Resolver is the also-all-important URI resolver.
  • Intertwingler::Representation is a cheap knockoff of a monad-like structure that enables parsed, in-memory representations of content to persist across successive transformations, so they don't get unnecessarily serialized and reparsed.
  • Intertwingler::Document houses (mostly) context-free markup generation (though may be dissipated into other modules).

Engine components

Everything in the engine, including the engine itself, is an Intertwingler::Handler that accepts a Rack::Request and returns a Rack::Response, plus an embedded adapter so it can be used directly as a stand-alone Rack application.

Handler

Transforms

An Intertwingler::Transform is a specialized Intertwingler::Handler that responds to POST requests to a single URI. I am still working out the details of a protocol but the general sense is you POST a payload and it returns the transformed payload back. When a transform is in the engine, this happens automatically by subrequest. Shortcuts are in place (via Intertwingler::Representation) for transformations that happen in the same process space.

  • Intertwingler::Transform is the base which also includes Intertwingler::Transform::Harness, which probably needs some to bring it up to par with the rest of the system.
  • Intertwingler::Transform::Tidy for sanitizing/normalizing HTML via tidy.
  • Intertwingler::Transform::Nokogiri for transforming HTML/XML via Nokogiri;
    • Transform Markdown into HTML
    • Turn HTML into XHTML and vice versa
    • Strip comments
    • Reindent markup
    • Repair/"rehydrate" RDFa
      • Normalize RDFa prefixes
    • Mangle mailto: addresses (by whatever house style) to prevent spam
    • Insert stylesheet references
    • Rewrite links
    • Add backlinks
    • Add secondary links (e.g. glossary entries)
    • Add (e.g.) Amazon affiliate codes to amazon.com links
    • Add social media metadata (Google, Facebook, Twitter, whoever…)
  • Intertwingler::Transform::Vips for images via Vips.
    • Crop images
    • Resize (downward only due to potential denial of resources)
    • Desaturate
    • Posterize
    • etc…

"Offline" components

  • Intertwingler::CLI is the command line harness.
  • Intertwingler::Static is (to be) an "end cap" on the engine that performs the legacy static site generator function.
  • Intertwingler::DocStats gathers statistics about a corpus of documents.
  • Intertwingler::NLP is a very rudimentary natural language processor for extracting terminology (jargon, acronyms, proper nouns etc.) from a corpus of documents.
  • Intertwingler::URLRunner is planned as a generic crawler (eventually with some kind of Handler interface) for resolving link previews.

Documentation

API documentation, for what it's worth at the moment, can be found in the usual place.

Installation

For now I recommend just running the library out of its source tree:

~$ git clone [email protected]/doriantaylor/rb-intertwingler.git intertwingler
~$ cd intertwingler
~/intertwingler$ bundle install

Contributing

Bug reports and pull requests are welcome at the GitHub repository.

Copyright & License

©2018-2023 Dorian Taylor

This software is provided under the Apache License, 2.0.