You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is to document everything we know about issues that our parser has with American Legal's data.
SubChapter Titles may be hidden, numbering may be confusing.
A good example here is Charter, Article XVI (0-0-0-1327.xml), where the "THE ARTS, MUSIC, SPORTS, AND PRE-SCHOOL FOR EVERY CHILD AMENDMENT OF 2003" section shows up after SEC. 16.123 with section numbering of the SEC. 16.123.X pattern, even though it has nothing to do with SEC. 16.123.
Expired sections not displayed as sections
In the Administrative Code, Chapter 5 , Article II (0-0-0-1708.xml),
Article II is expired, but rather than preserving the sections listed within it, it's just being output as a single chunk of content, like a list. Reading these in as sections fails as a result.
Building Code
All parts of the building code (Plumbing, Mechanical, etc) are all grouped within the building code - only in each file differentiates them. Currently, we're scraping that title to create one building code with substructures for each part; this is less than ideal.
Inconsistent Naming
Some sections begin with "Sec.", some with "Section" and some with "Secs." (in cases of multiples), and these may be all caps or natural case, interchangeably within a file. The Charter and Fire code don't always start titles with "Section", so we use custom parsers that can handle these.
Similar problems exist for Structure names, and structure types may include Chapter, Division, Part, Section (where the actual sections are SubSections), or Appendix.
Subparagraphs
The nested subparagraphs, subsubparagraphs, etc, are not actually nested in the data that we receive. As a result, we have no way of knowing where the nesting should be performed for text sections. These sections generally begin with <TAB tab-count="1"/>#<TAB tab-count="1"/> where # is the letter of the paragraph.
Table of Contents
In most files, the first sub LEVEL encountered is a table of contents. In some cases, this may even be the first two LEVELs. We skip the first one by default, and ignore content that is only a big table with nothing else in the section. Note this may be problematic later. Strangely, these sections always have the toc-section="false" flag set, as do normal sections.
Tables
Tables with heading rows are displayed as two different tables - one for the head, and one for the body. We deal with this by checking for two consecutive tables where the first table only has one row - when this is encountered, we make it all into one table.
Images
Images are all exported as JPEGs, regardless of the original format. We cannot show the Seal of SF legally and we probably do not want to show the ALP Icon where we're not authorized to do so, so we skip over images that match.
The text was updated successfully, but these errors were encountered:
There are quite a few "Reserved" structures in Chicago's code that have no children. We need to be able to handle these. The obvious solution is to check and see if the section parsing was successful, and if not attempt to parse it as a structure.
This issue is to document everything we know about issues that our parser has with American Legal's data.
SubChapter Titles may be hidden, numbering may be confusing.
A good example here is Charter, Article XVI (0-0-0-1327.xml), where the "THE ARTS, MUSIC, SPORTS, AND PRE-SCHOOL FOR EVERY CHILD AMENDMENT OF 2003" section shows up after
SEC. 16.123
with section numbering of theSEC. 16.123.X
pattern, even though it has nothing to do withSEC. 16.123
.Expired sections not displayed as sections
In the Administrative Code, Chapter 5 , Article II (0-0-0-1708.xml),
Article II is expired, but rather than preserving the sections listed within it, it's just being output as a single chunk of content, like a list. Reading these in as sections fails as a result.
Building Code
All parts of the building code (Plumbing, Mechanical, etc) are all grouped within the building code - only in each file differentiates them. Currently, we're scraping that title to create one building code with substructures for each part; this is less than ideal.
Inconsistent Naming
Some sections begin with "Sec.", some with "Section" and some with "Secs." (in cases of multiples), and these may be all caps or natural case, interchangeably within a file. The Charter and Fire code don't always start titles with "Section", so we use custom parsers that can handle these.
Similar problems exist for Structure names, and structure types may include Chapter, Division, Part, Section (where the actual sections are SubSections), or Appendix.
Subparagraphs
The nested subparagraphs, subsubparagraphs, etc, are not actually nested in the data that we receive. As a result, we have no way of knowing where the nesting should be performed for text sections. These sections generally begin with
<TAB tab-count="1"/>#<TAB tab-count="1"/>
where#
is the letter of the paragraph.Table of Contents
In most files, the first sub
LEVEL
encountered is a table of contents. In some cases, this may even be the first twoLEVEL
s. We skip the first one by default, and ignore content that is only a big table with nothing else in the section. Note this may be problematic later. Strangely, these sections always have thetoc-section="false"
flag set, as do normal sections.Tables
Tables with heading rows are displayed as two different tables - one for the head, and one for the body. We deal with this by checking for two consecutive tables where the first table only has one row - when this is encountered, we make it all into one table.
Images
Images are all exported as JPEGs, regardless of the original format. We cannot show the Seal of SF legally and we probably do not want to show the ALP Icon where we're not authorized to do so, so we skip over images that match.
The text was updated successfully, but these errors were encountered: