Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

American Legal Parser : Known Issues #4

Open
krusynth opened this issue Apr 23, 2014 · 1 comment
Open

American Legal Parser : Known Issues #4

krusynth opened this issue Apr 23, 2014 · 1 comment

Comments

@krusynth
Copy link
Contributor

This issue is to document everything we know about issues that our parser has with American Legal's data.

SubChapter Titles may be hidden, numbering may be confusing.

A good example here is Charter, Article XVI (0-0-0-1327.xml), where the "THE ARTS, MUSIC, SPORTS, AND PRE-SCHOOL FOR EVERY CHILD AMENDMENT OF 2003" section shows up after SEC. 16.123 with section numbering of the SEC. 16.123.X pattern, even though it has nothing to do with SEC. 16.123.

Expired sections not displayed as sections

In the Administrative Code, Chapter 5 , Article II (0-0-0-1708.xml),
Article II is expired, but rather than preserving the sections listed within it, it's just being output as a single chunk of content, like a list. Reading these in as sections fails as a result.

Building Code

All parts of the building code (Plumbing, Mechanical, etc) are all grouped within the building code - only in each file differentiates them. Currently, we're scraping that title to create one building code with substructures for each part; this is less than ideal.

Inconsistent Naming

Some sections begin with "Sec.", some with "Section" and some with "Secs." (in cases of multiples), and these may be all caps or natural case, interchangeably within a file. The Charter and Fire code don't always start titles with "Section", so we use custom parsers that can handle these.

Similar problems exist for Structure names, and structure types may include Chapter, Division, Part, Section (where the actual sections are SubSections), or Appendix.

Subparagraphs

The nested subparagraphs, subsubparagraphs, etc, are not actually nested in the data that we receive. As a result, we have no way of knowing where the nesting should be performed for text sections. These sections generally begin with <TAB tab-count="1"/>#<TAB tab-count="1"/> where # is the letter of the paragraph.

Table of Contents

In most files, the first sub LEVEL encountered is a table of contents. In some cases, this may even be the first two LEVELs. We skip the first one by default, and ignore content that is only a big table with nothing else in the section. Note this may be problematic later. Strangely, these sections always have the toc-section="false" flag set, as do normal sections.

Tables

Tables with heading rows are displayed as two different tables - one for the head, and one for the body. We deal with this by checking for two consecutive tables where the first table only has one row - when this is encountered, we make it all into one table.

Images

Images are all exported as JPEGs, regardless of the original format. We cannot show the Seal of SF legally and we probably do not want to show the ALP Icon where we're not authorized to do so, so we skip over images that match.

@krusynth
Copy link
Contributor Author

krusynth commented May 9, 2014

Empty Structures

There are quite a few "Reserved" structures in Chicago's code that have no children. We need to be able to handle these. The obvious solution is to check and see if the section parsing was successful, and if not attempt to parse it as a structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant