Skip to content

ProcessingGuidelines

Clifford Wulfman edited this page Feb 13, 2014 · 4 revisions

Town Topics Processing Guidelines

1 Identifiers and Naming Conventions

PUL assigns a URI to each newspaper, newspaper issue, METS, and MODS record it maintains:

<PREFIX> ::= "urn:PUL:newspapers"
<DATESTRING> ::= CCYY-MM-DD | CCYY-MM | CCYY
<ISSUEINDEX> ::= <0-9><0-9>
<PAPERID>     ::= "TownTopics"
<ISSUID>     ::= <PAPERID> "_" <DATESTRING> "_" <ISSUEINDEX>

<TITLEURI>    ::= <PREFIX> ":" <PAPERID>
<ISSUEURI>    ::= <PREFIX> ":" <ISSUEID>


<TITLEMETSURI> ::= <PREFIX> ":td:" <PAPERID>
<ISSUEMETSURI> ::= <PREFIX> ":td:" <ISSUEID>

<TITLEMODSURI> ::= <PREFIX> ":dmd:" <PAPERID>
<ISSUEMODSURI> ::= <PREFIX> ":dmd:" <ISSUEID>

This syntax is explained below.

1.1 Identifiers (paperids)

PUL adopts the Universal Resource Name conventions (URN) to compose unique identifiers for titles and issues, as well as for the metadata records (METS and MODS) used to encode information about them.

A Newspaper Title is the paper as a whole, as opposed to discrete volumes or issues – the entire run of the newspaper. PUL assigns an acronym for each paper in its digital collection.

*The paperid for the Town Topics newspaper is TownTopics *

1.2 Issuance Strings

Each newspaper issue will be assigned an issue identifier or issueid of the form paperid_issuanceString, where issuanceString corresponds to the date of issuance and takes the form CCYY-MM-DD_II, defined as follows:

CCYY
A four-digit number representing the year of publication (e.g., 1912)
MM
A two-digit number representing the month of publication, where January = 01, February = 02, etc.
DD
A two-digit number representing the day of publication (e.g., 01, 02 .. 30, 31).
II
A two-digit index of daily issuance (e.g., the first issue of the day is 01, the second is 02, and so on). This convention is adopted from the issuance of newspapers, which not infrequently issued a morning edition and an evening edition on the same day.

1.2.1 How to Compose an Issuance String

Issueids, like ISO 8601 dates, are organized from most significant to least significant chronological unit (i.e., from year to day). This format has two advantages: it allows ids to be sorted naturally, and it enables variable precision: the representation of daily, monthly, or yearly issuance.

  • If you know the year, month, and day of publication (e.g., you know that the issue was published on January 5th, 1912) :: then the issuance string os 1912010501.
  • If you have two issues published on the same day (e.g., an issue and a special supplement were both issued on January 5th, 1912) :: then the issuance strings are 1912-01-05_01 and 1912-01-05_02.
  • If you know only the year and month of publication (e.g., you know it was published in January, 1912, but you do not know on what day: 1912-01_01.
  • If you know that the magazine published two issues monthly, but you do not know the dates of publication (e.g., you have two issues published in January, 1912): The issuance strings are 1912-01_01 and 1912-01_02.
  • If you know only the year of publication (e.g., you know that the issue was published in 1912, but you do not know the month or, therefore, the day of publication): The issuance string is 1912_01.
  • If you have several issues published in the same year, but you know neither the month nor the day of publication (e.g., you know the journal published two issues in 1912, but you do not know the months or days of publication): The issuance strings are 1912_01 and 1912_02.

1.2.1.1 A Real Example

For the issue of Town Topics published on April 7, 2010:

titleid urn:PUL:newspapers:TownTopics
title METS id urn:PUL:newspapers:td:TownTopics
title MODS id urn:PUL:newspapers:dmd:TownTopics
issueid urn:PUL:newspapers:TownTopics_2010-04-07_01
issue METS id urn:PUL:newspapers:td:TownTopics_2010-04-07_01
issue MODS id urn:PUL:newspapers:dmd:TownTopics_2010-04-07_01

1.3 File Names

Names of Blue Mountain files will be constructed using the naming convention described above.

1.3.1 Image File Names

<EXTENSION> ::= "tif" | "jp2"
<IMGINDEX>  ::= <0-9><0-9><0-9>
<FILENAME>  ::= <ISSUEID> "_" <IMGINDEX> "." <EXTENSION>

Image files shall be named issueid_nnn.jp2 or issueid_nnn.tif, where

  • issuid is the identifier of the issue;
  • nnn is a three-digit number indicating the location of the image file in the sequence of image files (not necessarily the number printed on the page that has been photographed);
  • jp2 is the conventional file extension for JPEG2000 files.
  • tif is the conventional file extension for TIFF files.

For example,

TownTopics_2010-04-07_01_001.jp2
TownTopics_2010-04-07_01_002.jp2
...

1.3.2 ALTO File Names

<EXTENSION> ::= "alto.xml"
<IMGINDEX>  ::= <0-9><0-9><0-9>
<FILENAME>  ::= <ISSUEID> "_" <IMGINDEX> "." <EXTENSION>

ALTO files shall be named issueid_nnn.alto.xml, where

  • issuid is the identifier of the issue
  • nnn is a three-digit number corresponding to the sequence number of the image file to which this ALTO file corresponds
  • alto indicates the schema used to encode the document
  • xml indicates the format of the file.

For example,

TownTopics_2010-07-01_01_001.alto.xml
TownTopics_2010-07-01_01_002.alto.xml
...

1.3.3 METS File Names

<EXTENSION> ::= "mets.xml"
<FILENAME>  ::= <ISSUEID>  "." <EXTENSION>

METS files shall be named issueid.mets.xml, where

  • issueid is the identifier of the issue
  • mets indicates the schema used to encode the document
  • xml indicates the format of the file.

For example,

TownTopics_2010-07-01_01.mets.xml

1.3.4 PDF File Names

<EXTENSION> ::= "pdf"
<FILENAME>  ::= <ISSUEID> "." <EXTENSION>

PDF files shall be named issueid.pdf, where

  • issueid is the identifier of the issue
  • pdf indicates the format of the file.

For example,

TownTopics_2010-07-01_01.pdf

1.3.5 Extensions

For each issue, four types of files will be produced:

  1. An ALTO XML file for each page;
  2. A JPEG2000 image file for each page;
  3. A METS XML file for the entire issue;
  4. A PDF file for the entire issue

Specifications for each of these file types appear below; the file-naming conventions are as follows:

Where NNN is the left-padded sequential number of the page (e.g., 001, 002, etc.)

type name
ALTO basename_NNN.alto.xml
jpeg2000 basename_NNN.jp2
METS basename.mets.xml
PDF basename.pdf

1.4 Directory Names

The files for the Town Topics shall be stored in a directory hierarchy whose structure is as follows:

 - TownTopics/
   - CCYY/
     - MM/
	 - DD_II/

Where CCYY, /MM, and DD_II are as defined above. Thus the files for the April 7, 2010 issue would be arranged as follows:

   - TownTopics/
     - 2010/
	 - 04/
	   - 07_01/
	     - TownTopics_2010-04-07_01.mets.xml
	     - TownTopics_2010-04-07_01.pdf
	     - TownTopics_2010-04-07_01_001.alto.xml
	     - TownTopics_2010-04-07_01_002.alto.xml
	       ...
	     - TownTopics_2010-04-07_01_001.jp2
	     - TownTopics_2010-04-07_01_002.jp2
	       ...

2 File Profiles

2.1 METS XML

The root element <mets> contains these attributes:

TYPE
the fixed value Newspaper
OBJID
the URN for the issue
LABEL
the issueid

2.1.1 <metsHdr>

The <metsHdr> element shall contain two elements:

2.1.1.1 <agent>

A constant value for all records:

<agent ROLE="CREATOR" TYPE="ORGANIZATION">
 <name>Princeton University Library, Digital Initiatives</name>
</agent>

2.1.1.2 <metsDocumentID TYPE=”URN”>

Contains a string whose contents is composed as follows:

PREFIX:ISSUID

Where PREFIX is the following fixed value:

urn:PUL:newspapers:td:

And ISSUEID is the issue identifier, computed using the rules above. Thus for the April 7, 2010 issue:

<metsDocumentID TYPE="URN">urn:PUL:newspapers:td:TownTopics_2010-04-07_01</metsDocumentID>

2.1.2 <dmdSec>

The record contains a single <dmdSec> element with an ID attribute of “dmd01”’ it contains an embedded MODS record for the issue (described below).

2.1.3 <amdSec ID=”amdSecN”> elements

The <amdSec> contains a <techMD> element for each image file (a <mix> record).

There are five <amdSec>s in an issue-level METS file, one for the files in each of the following groups:

  1. Delivery-level jp2s
  2. Generative jp2s
  3. Preservation-level tifs
  4. low-resolution issue-level PDF
  5. high-resolution issue-level PDF

2.1.4 <fileSec>

The fileSec comprises two <fileGrp> elements: one for the JP2 images and one for the ALTO records.

2.1.4.1 <fileGrp ID=”IMGGRP” USE=”Images”>

The Images file group contains <file> elements that indicate the location of each image file, with attributes linking the file to the corresponding technical metadata and to the corresponding ALTO file.

2.1.4.1.1 <file>
ID
a unique XML id
AMDID
the ID of the <techmd> element corresponding to the

image file

GROUPID
an ID that links an image file to an ALTO file.

The image file for a page and the ALTO file containing the OCR output for that page share an id (conventionally named pageN, where N is a sequence number).

MIMETYPE
the constant “image/jp2” for jpeg2000 images
CHECKSUM
the checksum of the file, according to the

algorithm specified in CHECKSUMTYPE

CHECKSUMTYPE
the algorithm used to compute the checksum; usually SHA-1.
2.1.4.1.1.0.1 <FLocat> The METS element indicating the actual file location.
LOCTYPE
the constant URL
xlink:href
the path to the file. For this project, it will be a local path. For example:
		      file://./TownTopics_1952-01-06_01_001.jp2
      

2.1.4.2 <fileGrp ID=”PDFGRP1” USE=”PDF”>

Contains a single <file> element corresponding to the composite PDF for the issue.

2.1.5 <structMap>

The <structMap> element describes a hierarchical arrangement of the parts (<div>s) making up the digital object described by the METS. For this project, there are two kinds: a physical structMap, which delineates the pages of the newspaper issue in reading order, and a logical structMap, which functions as an outline of the newspaper’s contents. Both of these are assembled by docWorks, using configuration rules.

2.1.5.1 <structMap type=”PHYSICAL”>

This <structMap> outlines the physical layout of pages in the following hierarchy of div types:

  • Newspaper
    • Page+
2.1.5.1.1 <div TYPE=”Newspaper”>

The root <div> of the physical structmap is <div type=”Newspaper”>. It must contain one or more <div TYPE=”PAGE”> elements.

Attributes:

ID
an xml id unique within the docment
TYPE
must be “Newspaper”
LABEL
“Town Topics, {DATE}”, where {DATE} is the date of publication.
DMDID
The ID of the <dmdSec> for the issue.
2.1.5.1.2 <div TYPE=”PAGE”>

Attributes:

ID
an xml id unique within the document
ORDER
the index of the page in the containing div
ORDERLABEL
identical with the ORDER attribute
LABEL
also identical with the ORDER attribute

Each PAGE div contains an <fptr> element that associates the image file with the corresponding ALTO file via parallel <area> elements in a <par> element, like this:

<div ID="DIVP2" ORDER="1" ORDERLABEL="1" TYPE="PAGE">
  <fptr>
    <par>
      <area FILEID="IMG00001"/>
      <area FILEID="ALTO00001" BETYPE="IDREF" BEGIN="P1"/>
    </par>
  </fptr>
</div>

2.1.5.2 <structMap type=”LOGICAL”>

2.1.5.2.1 The <div> hierarchy

The outlines below show the hierarchical relationship among the <div> elements in the logical structMap.

  • Newspaper
  • Volume+
    • Issue+
      • Title_Section
        • Headline
        • Textblock+
      • Content
        • { Article* | Section* }
        • Article
  • Heading
    • Title
  • Body
    • Section
  • Body
    • Body
  • Body_Content
    • { Paragraph* | Advertisement* }
      • Paragraph
  • TextBlock+

2.2 MODS XML

Princeton encodes descriptive metadata for the contents of each newspaper issue so that it may be searched and analyzed.

<mods>
  <recordInfo>
    <recordIdentifier>urn:PUL:newspapers:dmd:TownTopics_1952-01-06_01</recordIdentifier>
  </recordInfo>
  <identifier type="PUL">urn:PUL:newspapers:TownTopics_1952-01-06_01</identifier>
  <titleInfo>
    <title>Town Topics</title>
  </titleInfo>
  <language>
    <languageTerm type="code" authority="rfc3066">en</languageTerm>
  </language>
  <part>
    <detail type="volume">
      <number>6</number>
      <caption>Vol. 6</caption>
    </detail>
    <detail type="issue">
      <number>43</number>
      <caption>No. 43</caption>
    </detail>
  </part>
  <originInfo>
    <dateIssued>Jan. 6-12, 1952</dateIssued>
    <dateIssued keyDate="yes" encoding="iso8601">1952-01-06</dateIssued>
  </originInfo>
  <relatedItem type="host" xlink:href="urn:PUL:newspapers:TownTopics">
    <recordInfo>
      <recordIdentifier>urn:PUL:newspapers:dmd:TownTopics</recordIdentifier>
    </recordInfo>
  </relatedItem>
     
  <relatedItem type="constituent" ID="c01">
    <titleInfo>
      <title>WE NOMINATE</title>
    </titleInfo>
    <typeOfResource>text</typeOfResource>
    <part>
      <extent unit="pages">
        <list>1</list>
      </extent>
    </part>
  </relatedItem>

  <!-- The remaining constituents go here. -->

</mods>

The MODS elements are detailed below.

2.2.1 <mods>

The root element of the document. When output as a stand-alone document, it has fixed attributes, as illustrated below:

 <mods xmlns="http://www.loc.gov/mods/v3"
	  xmlns:xlink="http://www.w3.org/1999/xlink"
	  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	  xsi:schemaLocation="http://www.loc.gov/mods/v3 
	    http://www.loc.gov/mods/v3/mods-3-4.xsd">

2.2.2 <recordInfo>

The <recordInfo> element contains information about the MODS record itself. It shall contain a <recordIdentifier> element, as below.

2.2.3 <recordIdentifier>

The <recordIdentifier> element contains the ISSUEMODSURI, the unique identifier for the MODS record itself, as described above.

2.2.4 <identifier type=”PUL”>

The <identifier> element is used to identify the resource the MODS record describes (the magazine issue). Its value is the resource’s ISSUEURI, as described above.

<identifier type="PUL">urn:PUL:newspapers:TownTopics_1952-01-06_01</identifier>

2.2.5 <relatedItem type=”host”>

Each issue-level MODS record is related to the title-level record via a <relatedItem type=’host’> element.

   <relatedItem type="host" xlink:type="simple" xlink:href="urn:PUL:newspapers:TownTopics">
     <recordInfo>
	 <recordIdentifier>urn:PUL:newspapers:dmd:TownTopics</recordIdentifier>
     </recordInfo>
   </relatedItem>

where PUBID is the publication identifier of the title (TownTopics).

The xlink:href shows the semantic relation between the issue and its host; the <recordIdentifier> is a specific key to the title-level record.

2.2.6 <language>

The <language> element(s) indicates the language of the resource (the newspaper issue). If the issue contains material written in several languages, the record should include a <language> element for each one. The value of the <languageTerm> element must be drawn from iso639-2b. For example:

<language>
  <languageTerm type="code" authority="iso639-2b">en</languageTerm>
 </language>

2.2.7 <titleInfo>

The <titleInfo> element shall be determined by standard cataloging rules.

<titleInfo>
  <nonSort>Le</nonSort>
  <title>coeur à barbe</title>
  <subTitle>journal transparent</subTitle>
</titleInfo>

2.2.8 <part>

The <part> element shall take the following form:

<part>
 <detail type="volume">...</detail>
 <detail type="issue">...</detail>
</part>

2.2.8.1 <detail type=”volume”>

 <detail type="volume">
  <number>ARABICVOL</number>
  <caption>Vol. MASTHEADVOL</caption>
</detail>

Where

  • ARABICVOL is the volume number expressed as a non-formatted arabic numeral (e.g., 1, 2, 3, … 10, 11, …)
  • MASTHEADVOL is the volume number as it appears in the masthead.

2.2.8.2 <detail type=”issue”>

The <detail type=”issue”> element shall take one of two possible forms:

  • For “normal” issues (i.e., those following the recorded sequence of publication), record both the sequential number of the issue as an arabic numeral and the issue number as it appears in the masthead:
          <detail type="issue">
    	<number>ARABICISSUE</number>
    	<caption>No. MASTHEADISSUE</caption>
          </detail>
        

    Where

    • ARABICISSUE is the issue number expressed as a non-formatted arabic numeral

(e.g., 1, 2, 3, …, 10, 11, …)

  • MASTHEADISSUE is the volume number as it appears in the

masthead.

  • For “special” issues (e.g., supplements, etc.), for which there is no sequential number for the issue, the <detail type=”issue”> element should take the following form:
          <detail type="issue">
    	<caption>CAPTIONTEXT</caption>
          </detail>
        

    Where CAPTIONTEXT is determined using standard cataloging rules.

2.2.9 <originInfo>

The <originInfo> element shall be used to record the date of issuance, as follows:

<originInfo>
 <dateIssued>PRINTEDDATE</dateIssued>
 <dateIssued encoding="iso8601" keyDate="yes">ISODATE</dateIssued>
</originInfo>

Where

  • PRINTEDDATE is the date as it appears in the cover page FolioLine, or in the Masthead.
  • ISODATE is the value of the date in the masthead, expressed in iso8601 format (YYYY-MM-DD) – see http://www.w3.org/TR/NOTE-datetime for details.

2.3 ALTO Profile

For each page, an encoded representation of the layout and the machine-readable text on the page shall be provided, using the ALTO schema, version 2.0 or higher, with the following specifications, adopted from the NDNP:

  • The text shall be encoded in the natural reading order of the language in which the text is written;
  • Point size and font data to at least the word level shall be included;
  • The ALTO file shall include bounding-box coordinates to at least the word level;
  • Non-rectangular blocks shall not be used. Some illustrations may format as “tight” in the document.

2.4 Image Profiles

2.4.1 JPEG2000: Image Description

2.4.1.1 Delivery Derivatives

To generate a JP2000 using Kakadu, use the following recipe (taken from The National Digital Newspaper Program (NDNP) Technical Guidelines for Applicants):

kdu_compress -i YOURINPUT.tif -o YOUROUTPUT.jp2 -rate 
1,0.84,0.7,0.6,0.5,0.4,0.35,0.3,0.25,0.21,0.18,0.15,0.125,0.1,0.088,0.0 
75,0.0625,0.05,0.04419,0.03716,0.03125,0.025,0.0221,0.01858,0.015625 
Clevels=6 Stiles={1024,1024} Corder=RLCP