Skip to content

Latest commit

 

History

History
149 lines (85 loc) · 8.06 KB

README.md

File metadata and controls

149 lines (85 loc) · 8.06 KB

SCAPE Tool-to-MapReduce Wrapper (ToMaR)

Let your Preservation Tools Scale

About

ToMaR supports the use of legacy applications in a MapReduce environment by providing third party tools as User Defined Functions. The application specifically addresses the need for processing large volumes of binary content using existing, content-specific applications. ToMaR provides a generic MapReduce wrapper that can be use with command-line and Java applications. It supports tools that read input based on local file pointers or stdin/stdout streams. ToMaR implements a custom InputFormat to take advantage of data locality and can be used both, as a MapReduce application or as part of an Apache PIG script. More documentation can be found on the SCAPE project web site.

Installation and Use

Installation

Simply git clone the repository and run mvn install. The Hadoop executable jar can be found in target/tomar-*-with-dependencies.jar.

Prerequisites

  • A running Hadoop 1.0.x installation (either standalone, pseudo-distributed or cluster)
  • SCAPE Toolspecs on HDFS and tools installed on each node
  • The control file (see below for details)

How to use

hadoop jar {path-to-jar} 
    -i {control-file} 
    -o {output-dir-for-job} 
    -r {toolspec-repo-dir}
    -n {lines-per-split}
  • path-to-jar leads to the jar file of the SCAPE ToMaR
  • control-file located on HDFS
  • output-dir-for-job is the directory on HDFS where output files will be written to. Default is out/{some random number}
  • toolspec-repo-dir is a directory on HDFS containing available toolspecs
  • lines-per-split configures the number of lines each mapper (worker node) will receive for processing, default is 10

Additionally, you can specify generic options for the Hadoop job, eg. for a custom input format or a reducer class, described here.

The Control File

ToMaR consumes a plain text control file which describes one tool invocation line per line. A control line consists of a pair of a toolspec name and an action of that toolspec. An action is associated with a specific shell command pattern by the toolspec.

Beneath the toolspec-action pair a control line may contain additional parameters for the action. These are mapped to the placeholders in the definition of the action. Parameters are specified by a list of --{placeholder}="{value}" strings after the toolspec-action pair. For example:

fancy-tool do-fancy-thing --fancy-parameter="foo" --another-fancy-parameter="bar"

For this control line to function there should be a toolspec named fancy-tool containing the action do-fancy-thing, which should have fancy-parameter and another-fancy-parameter defined in its parameters section. An action's input and output file parameters are specified the same way. For example:

fancy-tool do-fancy-file-stuff --input="hdfs:///fancy-file.foo" --output="hdfs:///fancy-output-file.bar"

Again, an input parameter input and an output parameter output needs to be defined in the correspondent sections of do-fancy-file-stuff.

Don't use the character '_' (underscore) in toolspec, action or key names. It is allowed to use it in values within quotes.

File redirection and piping

As an action's command may be reading from standard input and/or writing to standard output, a stdin and/or stdout section should be defined for the action. From the control line's perspective these properties are mapped by the > character. For example:

"hdfs:///input-file.foo" > fancy-tool do-fancy-streaming > "hdfs:///output-file.bar"

Prior to the execution of the action, the wrapper will start reading an input stream of hdfs:///input-file.foo and feeding its contents to the command of do-fancy-streaming. Respectively, the output is redirected to an output stream of hdfs:///output-file.bar.

Instead of streaming the command's output to a file, it could be streamed to another action of another toolspec imitating pipes in the UNIX shell. For example:

"hdfs:///input-file.foo" > fancy-tool do-fancy-streaming | funny-tool do-funny-streaming > "hdfs:///output-file.bar"

This control line results in the output of the command of do-fancy-streaming being piped to the command of do-funny-streaming. Then the output of the latter one will be redirected to hdfs:///output-file.bar.

There can be numerous pipes in one control line but only one input file at the beginning and one output file at the end for file redirection. Independently from this, the piped toolspec-action pairs may contain parameters as explained in the previous section, ie. input and output file parameters too.

If a control line produces standard output and there is not final redirection to an output file, then the output is written to Hadoop default output file part-r-00000. It contains the Job's output key-value pairs. Key is the hashcode of the control line.

Example

As an example the execution of ToMaR on

  1. file identification
  2. streamed file identification
  3. postscript to PDF migration of an input ps-file to an output pdf-file
  4. streamed in postscript to PDF migration of an input ps-file to an streamed out pdf-file
  5. streamed in ps-to-pdf migration with consecutive piped file identification
  6. streamed in ps-to-pdf migration with two consecutive piped file identifications

is described and demonstrated in this section. The input control file used in this example only contains one control line each. Of course in a productive environment one would have thousands of such control lines.

Prerequisites

  1. Make sure the commands file and ps2pdf are in the path of your system.
  2. Copy the toolspecs file.xml and ps2pdf.xml to a directory of your choice on HDFS (eg. /user/you/toolspecs/).
  3. Copy ps2pdf-input.ps to a directory of your choice on HDFS (eg. /user/you/input/).

File identification

Contents of the control file:

file identify --input="hdfs:///user/you/input/ps2pdf-input.ps"

After running the job, contents of part-r-00000 in output directory is:

0      PostScript document text conforming DSC level 3.0, Level 2

Streamed file identification

Contents of the control file:

"hdfs:///user/you/input/ps2pdf-input.ps" > file identify-stdin 

After running the job, contents of part-r-00000 in output directory is:

0      PostScript document text conforming DSC level 3.0, Level 2

Postscript to PDF migration

Contents of the control file:

ps2pdf convert --input="hdfs:///user/you/input/ps2pdf-input.ps" --output="hdfs:///user/you/output/ps2pdf-output.pdf"

After running the job, specified output file location references the migrated PDF.

Streamed postscript to PDF migration

Contents of the control file:

"hdfs:///user/you/input/ps2pdf-input.ps" > ps2pdf convert-streamed > "hdfs:///user/you/output/ps2pdf-output.pdf"

After running the job, specified output file location references the migrated PDF.

Streamed postscript to PDF migration with consecutive piped file identification

Contents of the control file:

"hdfs:///user/you/input/ps2pdf-input.ps" > ps2pdf convert-streamed | file identify-stdin > "hdfs:///user/you/output/file-identified.txt" 

After running the job, contents of file-identified.txt in output directory is:

PDF document, version 1.4    

Streamed postscript to PDF migration with two consecutive piped file identifications

Contents of the control file:

"hdfs:///user/you/input/ps2pdf-input.ps" > ps2pdf convert-streamed | file identify-stdin | file identify-stdin

After running the job, contents of part-r-00000 in output directory is:

0     ASCII text