Skip to content
This repository has been archived by the owner on Jun 27, 2020. It is now read-only.

Batch Ingest Manifest Files (2013 Redesign)

Jim Coble edited this page Apr 14, 2015 · 1 revision

The Hydra-based batch ingest process uses YAML-format "manifest" files to enumerate the objects to be ingested as well as to provide various other pieces of information needed for the ingest process. In a typical, relatively simple case (e.g., the Vica collection), there might be three manifest files -- one for the collection, one for the items, and one for the components (e.g., images).

For convenience, we use the following directive at the top of each manifest file so that we can refer to the "keys" using either strings or symbols. Without it, strings are needed. --- !map:HashWithIndifferentAccess

The batch ingest process assumes, for the most part, a standard directory structure that both contains files used in the process and is used by the process to store files created by the process. It is possible to specify alternate locations and/or file names for certain categories of files. When this is done, paths that begin with a slash ('/') are considered to be absolute and those that do not are considered to be relative to the path from which the ingest process is run (typically, the root of the Rails application).

Some of the elements noted below can occur either at the "manifest" (i.e., top) level or at the individual object level. Typically, object-level elements override those at the manifest level.

Manifest-Level Elements

basepath: path to the top of the directory structure pertinent to the ingest (e.g., /srv/fedora-working/ingest/ABC/collection/)

model: model class name pertinent to the objects being ingested (e.g., Collection) [can be overridden at the object level]

adminpolicy: pid of the Administrative Policy Object pertinent to the objects being ingested (e.g., duke-apo:abcd) [can be overridden at the object level]

label: label to be used for the object upon ingest (e.g., Example Collection) [can be overridden at the object level]

qdcsource: source metadata to be used for generating Qualified Dublin Core metadata for the descMetadata datastream if the batch ingest process needs to generate this metadata; possible values are contentdm, digitizationguide, and marcxml [can be overridden at the object level]

parentid: if the objects being ingested have a parent object, the key identifier (not pid) of the parent object (e.g., examplecoll) [can be overridden at the object level] [Note: partial coding has been done to support use of pid rather than key identifier for this element but is not complete] [Note: if both parentid and autoparentidlength are specified, the explicit parentid takes precedence]

autoparentidlength: if the objects being ingested have a parent object and the key identifier of that parent object can be determined algorithmically based on the first n characters of the (child) object's key identifier, the value of n (e.g., 7)

split: a list designating composite files (multiple objects in one file) that need to be split into individual files (one object per file). Each list entry can contain the elements below:

  • type: indicates the type of metadata file that is being split (e.g., contentdm); given the way in which this value is used in the ingest process, it should match the name of one of the subdirectories of the basepath (e.g, contentdm, digitizationguide, marcxml)
  • source: the composite file to be split, either the absolute path to the file (including file name) or just the file name, in which case it is interpreted as being relative to a directory named with the type value underneath the basepath (e.g., export.xml)
  • xpath: xpath expression indicating how to locate the data for an individual object in the composite file (e.g., /metadata/record)
  • idelement: name of the element within the individual object data that can be used as the basis for the file name of the individual file being created (e.g., localid)

metadata: list of metadata types to be added as datastreams to object; possible values are contentdm, digitizationguide, dpcmetadata, fmpexport, jhove, marcxml, and tripodmets [can be added to at the object level]

content: presence of this element indicates that digital content file is available for ingest into 'content' datastream of object [can be overridden at the object level]

  • extension: file name extension (including '.') of digital content file (e.g., .tif)
  • location: directory containing digital content file (e.g., /nas/CIFS2/Archived/na_EXM/)

checksum: presence of this element indicates that external checksum data exists for the digital content file associated with the ingested object

  • location: name of the file containing the external checksum data, assumed to be relative to the 'checksum' directory of the basepath (e.g., sha_256_checksums.xml)
  • source: string indicating the source of the external checksum data; current value is dpc

contentstructure: presence of this element indicates that structural metadata regarding content (METS fileSec and structMap elements) can be added to a 'contentMetadata' datastream for the object; this element is currently used only for "afmodel:Item" objects and refers to structural metadata regarding the "afmodel:Component" objects that are part of the object

  • type: indicates the source type of the content structural metadata; current value is generate, meaning that the structural metadata is generated by the ingest process
  • sequencestart: the index position in the content identifier that marks the start of a numeric string that can be used to sequence the content (e.g., 7)
  • sequencelength: the length of the numeric string in the content identifier that can be used to sequence the content (e.g., 3)

objects: list of the objects to be ingested; elements that can be specified for each object are noted below

Object-Level Elements

identifier: list of identifiers associated with the object (e.g., test1); the first identifier in the list is considered to be the "key identifier" for the object and is used by the ingest process whenever a single or canonical identifier is needed for the object

label: [see Manifest-Level Elements above]

qdcsource: [see Manifest-Level Elements above]

metadata: [see Manifest-Level Elements above]

digitizationguide: location of Digitization Guide data for object if not standard (e.g., DigitizationGuide.xls)

fmpexport: location of FileMaker Pro data for object if not standard (e.g., fmpExport.xls)

marcxml: location of MarcXML data for object if not standard (e.g., 0823589.xml)

tripodmets: location of Tripod METS data object if not standard (e.g., /nas/CIFS5/mets/ExampleColl/)

Manifest keys

  • :basepath
  • :batch (:id, :name, :description)
  • :datastreams
  • :label
  • :model
  • :objects
  • BatchObjectRelationship::RELATIONSHIPS (:admin_policy, :parent, :collection)

Object keys

  • :datastreams
  • :identifier
  • :label
  • :model
  • BatchObjectDatastreams::DATASTREAMS
  • BatchObjectRelationship::RELATIONSHIPS (:admin_policy, :parent, :collection)
Clone this wiki locally