Skip to content
This repository has been archived by the owner on Dec 20, 2019. It is now read-only.

Output Format

Cora Johnson-Roberson edited this page Apr 9, 2017 · 4 revisions

The format of the Zotero output is constrained by what Voyant is able to import. Performing the full-text extraction ourselves (as in Paper Machines) would complicate matters considerably, so it should be avoided if possible.

Currently (2017-04-08), it appears the only Voyant archive format that supports having metadata separate from the actual documents is the BagIt extractor. This extractor is presently intended to process output from the Canadian Writing Research Collaboratory, which seems to be using the Islandora BagIt module.

The example file "BagIt-Multiple-Documents.zip" serves as a model of this format.

The initial detection that the zip file is BagIt format happens in ArchiveExpander.java. This expects a bagit.txt, bag-info.txt, or CWRC.bin file to be present but does not otherwise check that the zip is a valid bag.

The BagItExtractor expects three files in each subfolder: MODS.bin, DC.xml, and CWRC.bin. If any are missing, the folder will be ignored. The minimal needed file structure looks like this:

BagIt-Multiple-Documents/
├── bagit.txt
└── data
    ├── doc1
    │   ├── CWRC.bin
    │   ├── DC.xml
    │   ├── MODS.bin
    └── doc2
        ├── CWRC.bin
        ├── DC.xml
        └── MODS.bin

Here's the details the extractor gets from each file:

  • MODS.bin
    • title
    • author
  • DC.xml
    • CWRC identifier (not needed)
  • CWRC.bin
    • text content (ideally TEI but can be XML/anything Tika reads like PDF, DOCX, etc.)

Example Content

This section offers an example of each type of required file and its contents (plus any relevant notes).

MODS.bin

<?xml version="1.0" encoding="UTF-8"?>
<mods xmlns="http://www.loc.gov/mods/v3" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/mods.xsd">
  <titleInfo>
    <title>Further Chronicles of Avonlea</title>
  </titleInfo>
  <name type="personal">
    <namePart>Lucy Maud Montgomery</namePart>
  </name>
</mods>

DC.xml

<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:identifier>[zotero key here?]</dc:identifier>
</oai_dc:dc>

CWRC.bin

[originally TEI but if not XML, this file is processed generically; can be a PDF, plain text, etc.]

Clone this wiki locally