Slaw is a lightweight library for generating Akoma Ntoso 3.0 Act XML from plain text documents. It is used to power Indigo and uses grammars developed for the legal tradition in South Africa, although others traditions are supported.
Slaw allows you to:
- parse plain text and transform it into an Akoma Ntoso Act XML document
- unparse Akoma Ntoso XML into a plain-text format suitable for re-parsing
Slaw is lightweight because it wraps around a Nokogiri XML representation of the parsed document. It provides some support methods for manipulating these documents, but anything advanced must manipulate the XML directly.
Add this line to your application's Gemfile:
gem 'slaw'
And then execute:
$ bundle
Or install it with:
$ gem install slaw
The simplest way to use Slaw is via the commandline:
$ slaw parse myfile.text --grammar za
Slaw generates Acts in the Akoma Ntoso 2.0 XML standard for legislative documents. It first parses plain text using a grammar and then generates XML from the resulting syntax tree.
Most by-laws in South Africa are available as PDF documents. You will therefore need to extract the text from the PDF first, using a tool like pdftotext. PDFs can product oddities (such as oddly wrapped lines) and Slaw has a number of rules-of-thumb for correcting these. These rules are based on South African by-laws and may not be suitable for all regions.
The grammar is expressed as a Treetop grammar and has been developed specifically for the format of South African acts and by-laws. Grammars for other regions could de developed depending on the complexity of a region's formats.
The grammar cannot catch some subtleties of an act or by-law -- such as nested list numbering -- so Slaw performs some post-processing on the XML produced by the parser. In particular, it nests lists correctly.
Slaw uses Treetop to compile a grammar into a backtracking parser. The parser builds a parse tree, the nodes of which know how to serialize themselves in XML format.
Supporting formats from other country's legal traditions probably requires creating a new grammar and parser.
Slaw can dynamically load your custom Treetop grammars. When called with --grammar xy
, Slaw
tries to require slaw/grammars/xy/act
and instantiate the parser class Slaw::Grammars::XY::ActParser
.
Slaw always uses the rule act
as the root of the parser.
You can create your own grammar by creating a gem that provides these files and classes.
- Fork it at http://github.com/longhotsummer/slaw/fork
- Install dependencies:
bundle install
- Create your feature branch:
git checkout -b my-new-feature
- Write great code!
- Run tests:
rspec
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Create a new Pull Request
- Update
lib/slaw/version.rb
- Run
rake release
- Generate correct
.../!main
in FRBR URIs.
- Use
<br/>
for newlines in tables, rather than<eol/>
, since it's more semantically correct.
- Prefix eId attributes in attachments with attachement's eId
- Use crossHeading element for crossheadings
- Support underlines with
__text__
- Handle sup and sub when extracting from HTML.
- Handle escaping inlines when unparsing.
- Handle escaping in inlines, so that forward slashes in link text are unescaped correctly, eg
[https:\/\/example.com](https://example.com)
- Remove dependency on mimemagic. Guess file type based on filename instead.
- Strip ascii, unicode general and unicode supplemental punctuation from num elements when building eIds
- support inline superscript
^^text^^
- support inline subscript
_^text^_
- hcontainer elements have name attributes, to be compliant with AKN 3.0
- BREAKING: Create XML with AKN 3 namespace (http://docs.oasis-open.org/legaldocml/ns/akn/3.0), AKN2 is no longer supported
- BREAKING: replace id attributes with eId attributes
- BREAKING: serialize schedules as attachments to act, not as components as peers of the act
- BREAKING: anonymous blocks are serialized as hcontainers, not paragraphs
- BREAKING: crossheading hcontainer IDs correctly use hcontainer
- Remove unnecessary schemaLocation header in root element
- Subpart numbers are optional
- Subsections can have numbers such as 1.1A and 1.1bis
- Support SUBPART
- Fix bug with id prefix on schedules container
- Obey --id-prefix for group nodes
- Ensure that schedules prefix their children, for those that require it (parts and chapters)
- Lists ids are now numbered sequentially, rather than by tree position
- New Slaw::Grammars::Counters helper module
- Better support for ol, ul and li when importing from HTML
- Support Chapters inside Parts
- Give grammars the opportunity to post-process generated XML
- Move blocklist handling into postprocessing for ZA grammar
- ZA grammar rewrites schedule aliases to include full text content of headings
- Schedules have a new grammar to make it easier for users to understand headings and subheadings.
- The way schedule IDs are generated has been simplified.
- BODY is allowed to be empty
- BODY marks start of body
- Preserve whitespace for mixed content nodes
- Don't pretty-print XML, as this can introduce meaningful whitespace
- Restructure subsections to support generic block elements, starting with an inline block element
- FIX bug where unparse was returning XML, not text
- Internal adjustments to make rules easier to override
- Crossheadings at start of body (ending preface and preamble)
- Only renest annotated blocklists
- Table grammar uses additional rules and permits whitespace
- Permit inline content in chapter, part and section headings
- FIX don't error when a line is just a backslash
- Add --ascii flag to %-encode utf-8 strings into US-ASCII for speed. See cjheath/treetop#31
- Inline bold and italics
- Support for CROSSHEADING elements using an empty hcontainer until we support AKN 3.0
- Support for LONGTITLE in PREFACE
- Remarks and references support nested inline elements
- BREAKING:
clauses
rule renamed toinline_elements
so as not to clash with real AKN clauses - BREAKING:
block_paragraphs
rule renamed togeneric_container
and adjusted to be singular to be simpler to understand - BREAKING: un-numbered paragraph elements have new ids, that should not clash with numbered paragraphs from other grammars
- Schedules use hcontainer, not article
- Schedules allow rich content in title and heading
- Make subclassing preface statements easier
- Remove support for PDFs. Do text extraction from PDFs outside of this library.
- Support dynamically loading grammars from other gems.
- Don't change ALL CAPS headings to Sentence Case.
- SECURITY require Nokogiri 1.8.5 or greater to address https://nvd.nist.gov/vuln/detail/CVE-2018-14404
- FIX bug in all grammars that dropped less-than symbols
<
from input text.
- FIX bug in ZA grammar when parsing dotted numbered subsections ending with a newline
- Improved support for other legal traditions / grammars.
- Add Polish legal tradition grammar.
- Slaw no longer does too much introspection of a parsed document, since that can be so tradition-dependent.
- Move reformatting out of Slaw since it's tradition-dependent.
- Remove definition linking, Slaw no longer supports it.
- Remove unused code for interacting with the internals of acts.
- Match defined terms in 'definition' section.
- Updated nokogiri dependency to 1.8.2
- Support links and images inside tables, by parsing tables natively.
- Support --crop for PDFs. Requires poppler pdftotex, not xpdf.
- Update nokogiri to ~> 1.8.1
- Ignore non-AKN compatible table attributes
- Support tables in many non-PDF documents (eg. Word documents) by converting to HTML and then to Akoma Ntoso
- Convert non-breaking space (\xA0) to space
- Support links in remarks
- Support inline image tags, using Markdown syntax: ![alt text](image url)
- Smarter un-break lines
- FIX allow Schedule, Part and other headings at the start of blocklist and subsections
- FIX replace empty CONTENT elements with empty P tags so XML validates
- Better handling of empty subsections and blocklist items
- Support links/references using Markdown-like [text](href) syntax.
- FIX allow remarks in blocklist items
- Support newlines in table cells as EOL (or BR in HTML)
- FIX unparsing of remarks, introduced in 0.10.0
- Ensure backslash escaping handles listIntroductions and partial words correctly
- New command
unparse FILE
which transforms an Akoma Ntoso XML document into plain text, suitable for re-parsing - Support escaping special words with a backslash
- This release makes reasonably significant changes to generated XML, particularly for sections without explicit subsections.
- Blocklists with (aa) following (z) are using the same numbering format.
- Change how blockList listIntroduction elements are created to be more generic
- Support for sections that dive straight into lists without subsections
- Simplify grammar
- Fix elements with potentially duplicate ids
- During cleanup, break lines on section titles that don't have a space after the number, eg: "New section title 4.(1) The content..."
- Schedules can be empty (#10)
- Schedules can have both a title and a heading, permitting schedules titled "First Schedule" and not just "Schedule 1"
- FEATURE: parse command only reformats input for PDFs or when --reformat is given
- FIX: don't error on defn tags without link to defined term
- use refersTo to identify blocks containing term definitions, rather than setting an (invalid) ID
- add link-definitions command to find and extract defined terms and link them to their definitions
- exit with non-zero exit code on failure (see rails/thor#244)
- add --section-number-position argument to slaw command
- grammar supports empty chapters and parts
- major changes to grammar to permit chapters, parts, sections etc. in schedules