-
Notifications
You must be signed in to change notification settings - Fork 1
ProcessingGuidelines
PUL assigns a URI to each newspaper, newspaper issue, METS, and MODS record it maintains:
<PREFIX> ::= "urn:PUL:newspapers" <DATESTRING> ::= CCYY-MM-DD | CCYY-MM | CCYY <ISSUEINDEX> ::= <0-9><0-9> <PAPERID> ::= "TownTopics" <ISSUID> ::= <PAPERID> "_" <DATESTRING> "_" <ISSUEINDEX> <TITLEURI> ::= <PREFIX> ":" <PAPERID> <ISSUEURI> ::= <PREFIX> ":" <ISSUEID> <TITLEMETSURI> ::= <PREFIX> ":td:" <PAPERID> <ISSUEMETSURI> ::= <PREFIX> ":td:" <ISSUEID> <TITLEMODSURI> ::= <PREFIX> ":dmd:" <PAPERID> <ISSUEMODSURI> ::= <PREFIX> ":dmd:" <ISSUEID>
This syntax is explained below.
PUL adopts the Universal Resource Name conventions (URN) to compose unique identifiers for titles and issues, as well as for the metadata records (METS and MODS) used to encode information about them.
A Newspaper Title is the paper as a whole, as opposed to discrete volumes or issues – the entire run of the newspaper. PUL assigns an acronym for each paper in its digital collection.
*The paperid for the Town Topics newspaper is TownTopics *
Each newspaper issue will be assigned an issue identifier or issueid of the form paperid_issuanceString, where issuanceString corresponds to the date of issuance and takes the form CCYY-MM-DD_II, defined as follows:
- CCYY
- A four-digit number representing the year of publication (e.g., 1912)
- MM
- A two-digit number representing the month of publication, where January = 01, February = 02, etc.
- DD
- A two-digit number representing the day of publication (e.g., 01, 02 .. 30, 31).
- II
- A two-digit index of daily issuance (e.g., the first issue of the day is 01, the second is 02, and so on). This convention is adopted from the issuance of newspapers, which not infrequently issued a morning edition and an evening edition on the same day.
Issueids, like ISO 8601 dates, are organized from most significant to least significant chronological unit (i.e., from year to day). This format has two advantages: it allows ids to be sorted naturally, and it enables variable precision: the representation of daily, monthly, or yearly issuance.
- If you know the year, month, and day of publication (e.g., you know that the issue was published on January 5th, 1912) :: then the issuance string os 1912010501.
- If you have two issues published on the same day (e.g., an issue and a special supplement were both issued on January 5th, 1912) :: then the issuance strings are 1912-01-05_01 and 1912-01-05_02.
- If you know only the year and month of publication (e.g., you know it was published in January, 1912, but you do not know on what day: 1912-01_01.
- If you know that the magazine published two issues monthly, but you do not know the dates of publication (e.g., you have two issues published in January, 1912): The issuance strings are 1912-01_01 and 1912-01_02.
- If you know only the year of publication (e.g., you know that the issue was published in 1912, but you do not know the month or, therefore, the day of publication): The issuance string is 1912_01.
- If you have several issues published in the same year, but you know neither the month nor the day of publication (e.g., you know the journal published two issues in 1912, but you do not know the months or days of publication): The issuance strings are 1912_01 and 1912_02.
For the issue of Town Topics published on April 7, 2010:
titleid | urn:PUL:newspapers:TownTopics |
title METS id | urn:PUL:newspapers:td:TownTopics |
title MODS id | urn:PUL:newspapers:dmd:TownTopics |
issueid | urn:PUL:newspapers:TownTopics_2010-04-07_01 |
issue METS id | urn:PUL:newspapers:td:TownTopics_2010-04-07_01 |
issue MODS id | urn:PUL:newspapers:dmd:TownTopics_2010-04-07_01 |
---|
Names of Blue Mountain files will be constructed using the naming convention described above.
<EXTENSION> ::= "tif" | "jp2" <IMGINDEX> ::= <0-9><0-9><0-9> <FILENAME> ::= <ISSUEID> "_" <IMGINDEX> "." <EXTENSION>
Image files shall be named issueid_nnn.jp2 or issueid_nnn.tif, where
- issuid is the identifier of the issue;
- nnn is a three-digit number indicating the location of the image file in the sequence of image files (not necessarily the number printed on the page that has been photographed);
- jp2 is the conventional file extension for JPEG2000 files.
- tif is the conventional file extension for TIFF files.
For example,
TownTopics_2010-04-07_01_001.jp2 TownTopics_2010-04-07_01_002.jp2 ...
<EXTENSION> ::= "alto.xml" <IMGINDEX> ::= <0-9><0-9><0-9> <FILENAME> ::= <ISSUEID> "_" <IMGINDEX> "." <EXTENSION>
ALTO files shall be named issueid_nnn.alto.xml, where
- issuid is the identifier of the issue
- nnn is a three-digit number corresponding to the sequence number of the image file to which this ALTO file corresponds
- alto indicates the schema used to encode the document
- xml indicates the format of the file.
For example,
TownTopics_2010-07-01_01_001.alto.xml TownTopics_2010-07-01_01_002.alto.xml ...
<EXTENSION> ::= "mets.xml" <FILENAME> ::= <ISSUEID> "." <EXTENSION>
METS files shall be named issueid.mets.xml, where
- issueid is the identifier of the issue
- mets indicates the schema used to encode the document
- xml indicates the format of the file.
For example,
TownTopics_2010-07-01_01.mets.xml
<EXTENSION> ::= "pdf" <FILENAME> ::= <ISSUEID> "." <EXTENSION>
PDF files shall be named issueid.pdf, where
- issueid is the identifier of the issue
- pdf indicates the format of the file.
For example,
TownTopics_2010-07-01_01.pdf
For each issue, four types of files will be produced:
- An ALTO XML file for each page;
- A JPEG2000 image file for each page;
- A METS XML file for the entire issue;
- A PDF file for the entire issue
Specifications for each of these file types appear below; the file-naming conventions are as follows:
Where NNN is the left-padded sequential number of the page (e.g., 001, 002, etc.)
type | name |
---|---|
ALTO | basename_NNN.alto.xml |
jpeg2000 | basename_NNN.jp2 |
METS | basename.mets.xml |
basename.pdf |
The files for the Town Topics shall be stored in a directory hierarchy whose structure is as follows:
- TownTopics/ - CCYY/ - MM/ - DD_II/
Where CCYY, /MM, and DD_II are as defined above. Thus the files for the April 7, 2010 issue would be arranged as follows:
- TownTopics/ - 2010/ - 04/ - 07_01/ - TownTopics_2010-04-07_01.mets.xml - TownTopics_2010-04-07_01.pdf - TownTopics_2010-04-07_01_001.alto.xml - TownTopics_2010-04-07_01_002.alto.xml ... - TownTopics_2010-04-07_01_001.jp2 - TownTopics_2010-04-07_01_002.jp2 ...
The root element <mets> contains these attributes:
- TYPE
- the fixed value Newspaper
- OBJID
- the URN for the issue
- LABEL
- the issueid
The <metsHdr> element shall contain two elements:
A constant value for all records:
<agent ROLE="CREATOR" TYPE="ORGANIZATION">
<name>Princeton University Library, Digital Initiatives</name>
</agent>
Contains a string whose contents is composed as follows:
PREFIX:ISSUID
Where PREFIX is the following fixed value:
urn:PUL:newspapers:td:
And ISSUEID is the issue identifier, computed using the rules above. Thus for the April 7, 2010 issue:
<metsDocumentID TYPE="URN">urn:PUL:newspapers:td:TownTopics_2010-04-07_01</metsDocumentID>
The record contains a single <dmdSec> element with an ID attribute of “dmd01”’ it contains an embedded MODS record for the issue (described below).
The <amdSec> contains a <techMD> element for each image file (a <mix> record).
There are five <amdSec>s in an issue-level METS file, one for the files in each of the following groups:
- Delivery-level jp2s
- Generative jp2s
- Preservation-level tifs
- low-resolution issue-level PDF
- high-resolution issue-level PDF
The fileSec comprises two <fileGrp> elements: one for the JP2 images and one for the ALTO records.
The Images file group contains <file> elements that indicate the location of each image file, with attributes linking the file to the corresponding technical metadata and to the corresponding ALTO file.
- ID
- a unique XML id
- AMDID
- the ID of the <techmd> element corresponding to the
image file
- GROUPID
- an ID that links an image file to an ALTO file.
The image file for a page and the ALTO file containing the OCR output for that page share an id (conventionally named pageN, where N is a sequence number).
- MIMETYPE
- the constant “image/jp2” for jpeg2000 images
- CHECKSUM
- the checksum of the file, according to the
algorithm specified in CHECKSUMTYPE
- CHECKSUMTYPE
- the algorithm used to compute the checksum; usually SHA-1.
- LOCTYPE
- the constant URL
- xlink:href
- the path to the file. For this project, it will
be a local path. For example:
file://./TownTopics_1952-01-06_01_001.jp2
Contains a single <file> element corresponding to the composite PDF for the issue.
The <structMap> element describes a hierarchical arrangement of the parts (<div>s) making up the digital object described by the METS. For this project, there are two kinds: a physical structMap, which delineates the pages of the newspaper issue in reading order, and a logical structMap, which functions as an outline of the newspaper’s contents. Both of these are assembled by docWorks, using configuration rules.
This <structMap> outlines the physical layout of pages in the following hierarchy of div types:
- Newspaper
- Page+
The root <div> of the physical structmap is <div type=”Newspaper”>. It must contain one or more <div TYPE=”PAGE”> elements.
Attributes:
- ID
- an xml id unique within the docment
- TYPE
- must be “Newspaper”
- LABEL
- “Town Topics, {DATE}”, where {DATE} is the date of publication.
- DMDID
- The ID of the <dmdSec> for the issue.
Attributes:
- ID
- an xml id unique within the document
- ORDER
- the index of the page in the containing div
- ORDERLABEL
- identical with the ORDER attribute
- LABEL
- also identical with the ORDER attribute
Each PAGE div contains an <fptr> element that associates the image file with the corresponding ALTO file via parallel <area> elements in a <par> element, like this:
<div ID="DIVP2" ORDER="1" ORDERLABEL="1" TYPE="PAGE">
<fptr>
<par>
<area FILEID="IMG00001"/>
<area FILEID="ALTO00001" BETYPE="IDREF" BEGIN="P1"/>
</par>
</fptr>
</div>
The outlines below show the hierarchical relationship among the <div> elements in the logical structMap.
- Newspaper
- Volume+
- Issue+
- Title_Section
- Headline
- Textblock+
- Content
- { Article* | Section* }
- Article
- Title_Section
- Issue+
- Heading
- Title
- Body
- Section
- Body
- Body
- Body_Content
- { Paragraph* | Advertisement* }
- Paragraph
- { Paragraph* | Advertisement* }
- TextBlock+
Princeton encodes descriptive metadata for the contents of each newspaper issue so that it may be searched and analyzed.
<mods>
<recordInfo>
<recordIdentifier>urn:PUL:newspapers:dmd:TownTopics_1952-01-06_01</recordIdentifier>
</recordInfo>
<identifier type="PUL">urn:PUL:newspapers:TownTopics_1952-01-06_01</identifier>
<titleInfo>
<title>Town Topics</title>
</titleInfo>
<language>
<languageTerm type="code" authority="rfc3066">en</languageTerm>
</language>
<part>
<detail type="volume">
<number>6</number>
<caption>Vol. 6</caption>
</detail>
<detail type="issue">
<number>43</number>
<caption>No. 43</caption>
</detail>
</part>
<originInfo>
<dateIssued>Jan. 6-12, 1952</dateIssued>
<dateIssued keyDate="yes" encoding="iso8601">1952-01-06</dateIssued>
</originInfo>
<relatedItem type="host" xlink:href="urn:PUL:newspapers:TownTopics">
<recordInfo>
<recordIdentifier>urn:PUL:newspapers:dmd:TownTopics</recordIdentifier>
</recordInfo>
</relatedItem>
<relatedItem type="constituent" ID="c01">
<titleInfo>
<title>WE NOMINATE</title>
</titleInfo>
<typeOfResource>text</typeOfResource>
<part>
<extent unit="pages">
<list>1</list>
</extent>
</part>
</relatedItem>
<!-- The remaining constituents go here. -->
</mods>
The MODS elements are detailed below.
The root element of the document. When output as a stand-alone document, it has fixed attributes, as illustrated below:
<mods xmlns="http://www.loc.gov/mods/v3"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/mods/v3
http://www.loc.gov/mods/v3/mods-3-4.xsd">
The <recordInfo> element contains information about the MODS record itself. It shall contain a <recordIdentifier> element, as below.
The <recordIdentifier> element contains the ISSUEMODSURI, the unique identifier for the MODS record itself, as described above.
The <identifier> element is used to identify the resource the MODS record describes (the magazine issue). Its value is the resource’s ISSUEURI, as described above.
<identifier type="PUL">urn:PUL:newspapers:TownTopics_1952-01-06_01</identifier>
Each issue-level MODS record is related to the title-level record via a <relatedItem type=’host’> element.
<relatedItem type="host" xlink:type="simple" xlink:href="urn:PUL:newspapers:TownTopics">
<recordInfo>
<recordIdentifier>urn:PUL:newspapers:dmd:TownTopics</recordIdentifier>
</recordInfo>
</relatedItem>
where PUBID is the publication identifier of the title (TownTopics).
The xlink:href shows the semantic relation between the issue and its host; the <recordIdentifier> is a specific key to the title-level record.
The <language> element(s) indicates the language of the resource (the newspaper issue). If the issue contains material written in several languages, the record should include a <language> element for each one. The value of the <languageTerm> element must be drawn from iso639-2b. For example:
<language>
<languageTerm type="code" authority="iso639-2b">en</languageTerm>
</language>
The <titleInfo> element shall be determined by standard cataloging rules.
<titleInfo>
<nonSort>Le</nonSort>
<title>coeur à barbe</title>
<subTitle>journal transparent</subTitle>
</titleInfo>
The <part> element shall take the following form:
<part>
<detail type="volume">...</detail>
<detail type="issue">...</detail>
</part>
<detail type="volume">
<number>ARABICVOL</number>
<caption>Vol. MASTHEADVOL</caption>
</detail>
Where
- ARABICVOL is the volume number expressed as a non-formatted arabic numeral (e.g., 1, 2, 3, … 10, 11, …)
- MASTHEADVOL is the volume number as it appears in the masthead.
The <detail type=”issue”> element shall take one of two possible forms:
- For “normal” issues (i.e., those following the recorded
sequence of publication), record both the sequential number of
the issue as an arabic numeral and the issue number as it
appears in the masthead:
<detail type="issue"> <number>ARABICISSUE</number> <caption>No. MASTHEADISSUE</caption> </detail>
Where
- ARABICISSUE is the issue number expressed as a non-formatted arabic numeral
(e.g., 1, 2, 3, …, 10, 11, …)
- MASTHEADISSUE is the volume number as it appears in the
masthead.
- For “special” issues (e.g., supplements, etc.), for which there is no sequential number for the
issue, the <detail type=”issue”> element should take the
following form:
<detail type="issue"> <caption>CAPTIONTEXT</caption> </detail>
Where CAPTIONTEXT is determined using standard cataloging rules.
The <originInfo> element shall be used to record the date of issuance, as follows:
<originInfo>
<dateIssued>PRINTEDDATE</dateIssued>
<dateIssued encoding="iso8601" keyDate="yes">ISODATE</dateIssued>
</originInfo>
Where
- PRINTEDDATE is the date as it appears in the cover page FolioLine, or in the Masthead.
- ISODATE is the value of the date in the masthead, expressed in iso8601 format (YYYY-MM-DD) – see http://www.w3.org/TR/NOTE-datetime for details.
For each page, an encoded representation of the layout and the machine-readable text on the page shall be provided, using the ALTO schema, version 2.0 or higher, with the following specifications, adopted from the NDNP:
- The text shall be encoded in the natural reading order of the language in which the text is written;
- Point size and font data to at least the word level shall be included;
- The ALTO file shall include bounding-box coordinates to at least the word level;
- Non-rectangular blocks shall not be used. Some illustrations may format as “tight” in the document.
To generate a JP2000 using Kakadu, use the following recipe (taken from The National Digital Newspaper Program (NDNP) Technical Guidelines for Applicants):
kdu_compress -i YOURINPUT.tif -o YOUROUTPUT.jp2 -rate 1,0.84,0.7,0.6,0.5,0.4,0.35,0.3,0.25,0.21,0.18,0.15,0.125,0.1,0.088,0.0 75,0.0625,0.05,0.04419,0.03716,0.03125,0.025,0.0221,0.01858,0.015625 Clevels=6 Stiles={1024,1024} Corder=RLCP