-
Notifications
You must be signed in to change notification settings - Fork 28
Output Format
The format of the Zotero output is constrained by what Voyant is able to import. Performing the full-text extraction ourselves (as in Paper Machines) would complicate matters considerably, so it should be avoided if possible.
Currently (2017-04-08), it appears the only Voyant archive format that supports having metadata separate from the actual documents is the BagIt extractor. This extractor is presently intended to process output from the Canadian Writing Research Collaboratory, which seems to be using the Islandora BagIt module.
The example file "BagIt-Multiple-Documents.zip" serves as a model of this format.
The initial detection that the zip file is BagIt format happens in ArchiveExpander.java. This expects a bagit.txt
, bag-info.txt
, or CWRC.bin
file to be present but does not otherwise check that the zip is a valid bag.
The BagItExtractor expects three files in each subfolder: MODS.bin
, DC.xml
, and CWRC.bin
. If any are missing, the folder will be ignored. The minimal needed file structure looks like this:
BagIt-Multiple-Documents/
├── bagit.txt
└── data
├── doc1
│ ├── CWRC.bin
│ ├── DC.xml
│ ├── MODS.bin
└── doc2
├── CWRC.bin
├── DC.xml
└── MODS.bin
Here's the details the extractor gets from each file:
- MODS.bin
- title
- author
- DC.xml
- CWRC identifier (not needed)
- CWRC.bin
- text content (ideally TEI but can be XML/anything Tika reads like PDF, DOCX, etc.)
This section offers an example of each type of required file and its contents (plus any relevant notes).
<?xml version="1.0" encoding="UTF-8"?>
<mods xmlns="http://www.loc.gov/mods/v3" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/mods.xsd">
<titleInfo>
<title>Further Chronicles of Avonlea</title>
</titleInfo>
<name type="personal">
<namePart>Lucy Maud Montgomery</namePart>
</name>
</mods>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:identifier>[zotero key here?]</dc:identifier>
</oai_dc:dc>
[originally TEI but if not XML, this file is processed generically; can be a PDF, plain text, etc.]