hio

Command line utilities to interact with Hadoop HDFS. Think: Hadoop I/O CLI.

"hio" is a shorthand for Hadoop I/O because less typing on the CLI is better! It is also a shout-out to the hadoopio library, which we use behind the scenes.

Table of Contents

Philosophy
Usage
- Overview
- Examples
Deployment
Development
Todo
Change log
Contributing
Authors
License
References

Philosophy and how it works

We strive to mimic existing Unix CLI tools as much as possible, e.g. in terms of names and semantics of hio commands and their respective parameters. That being said, in some cases we may not be able to achieve this goal due to technical reasons or upstream limitations (e.g. HDFS).

Hio is based on hadoopio, which means hio commands typically require only constant memory, i.e. even if you process 1 TB of Avro data your machine should basically never run out of memory.

Usage

Overview

Once installed the main entry point is the hio executable:

$ hio
hio: Command line tools to interact with Hadoop HDFS and other supported file systems

Commands:
  acat       - concatenates and prints Avro files
  ahead      - displays the first records of an Avro file

Tip: You can tweak many Java/JVM related settings via hio CLI options. See hio -h for details.

Examples

# Show the first two lines of the HDFS files `tweets1.avro` and `tweets2.avro`.
$ hio ahead -n 2 /hdfs/path/to/tweets/tweets1.avro /hdfs/path/to/tweets/tweets2.avro
==> /hdfs/path/to/tweets/tweets1.avro <==
{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp": 1366150681 }
{"username":"BlizzardCS","tweet":"Works as intended.  Terran is IMBA.","timestamp": 1366154481 }

==> /hdfs/path/to/tweets/tweets2.avro <==
{"username":"Zergling","tweet":"Cthulhu R'lyeh!","timestamp": 1366154399 }
{"username":"miguno","tweet":"4-Gate is the new 6-Pool.","timestamp": 1366150900 }

Tip: Use jq to pretty-print, color, or otherwise post-process the JSON output.

# Extract only the `username` field from the JSON output.
$ hio ahead -2 /hdfs/path/to/tweets/tweets1.avro | jq '.username'
"miguno"
"BlizzardCS"

# Don't like the escape quotes?  Use `--raw-output` aka `-r`.
$ hio ahead -2 /hdfs/path/to/tweets/tweets1.avro | jq --raw-output '.username'
miguno
BlizzardCS

# Let's extract the `username` and `timestamp` fields, and separate them with tabs.
$ hio ahead -2 /hdfs/path/to/tweets/tweets1.avro | jq --raw-output '"\(.username)\t\(.timestamp)"'
miguno  1366150681
BlizzardCS      1366154481

Deployment

Run-time requirements

Java 7, preferably Oracle JRE/JDK 1.7
RHEL/CentOS (required only because we package exclusively in RPM format at the moment)
A "compatible" HDFS cluster running Hadoop 2.x. See below for details.

A note on Hadoop versions: We bundle all required libraries (read: jar files) when packaging hio, which means that hio does not require any additional software packages or libraries. Be aware though that you will need to keep the libraries used by hio in sync with the Hadoop version of your cluster.

build.sbt lists the exact version of Hadoop that hio has been built against. At the moment, we are targeting Cloudera CDH 5.x, which is essentially Hadoop 2.5.x. Feel free to run hio against other Hadoop versions or Hadoop distributions such as HortonWorks HDP, and report back the results.

Installation

You can package hio as an RPM (see section below) and then install the package via yum, rpm, or deployment tools such as Puppet or Ansible.

Non-RHEL users: If you need different packaging formats (say, .deb for Debian or Ubuntu), please let us know!

Configuration

The most important configuration aspect is making hio aware of your Hadoop HDFS cluster. Fortunately, in most cases hio will "just work" out of the box because it follows Hadoop best practices.

If hio does not work automagically, then you must tell hio where to find your HDFS cluster. Here, hio expects to find the Hadoop configuration files -- typically named core-site.xml and hdfs-site.xml -- in the directory specified by the standard HADOOP_CONF_DIR variable. If this environment variable does not exist, hio will fall back to /etc/hadoop/conf/ (see hio.conf, which is installed to /usr/share/hio/conf/hio.conf by the RPM).

You can apply standard shell practices if you need to override the HADOOP_CONF_DIR variable for some reason:

HADOOP_CONF_DIR=/my/custom/hadoop/conf hio ...

Development

Build requirements

Java 7, preferably Oracle JDK 1.7

Building the code

$ ./sbt compile

Running the tests

Run the test suite:

$ ./sbt test

Packaging

For details see sbt-native-packager.

Create an RPM (preferred package format):

$ ./sbt rpm:packageBin

>>> Creates target/rpm/RPMS/noarch/hio-<VERSION>.noarch.rpm

Create a Tarball:

$ ./sbt universal:packageZipTarball

>>> Creates ./target/universal/hio-<VERSION>.tgz

Another helpful task is stage, which (quickly) e.g. generates the shell wrapper scripts but does not put them into an RPM -- so they are easier to inspect while developing:

$ ./sbt stage

>>> Creates files under ./target/universal/stage/ (e.g. bin/, conf/, lib/)

TODO

Support wildcards/globbing of Hadoop paths (cf. hadoopio's include patterns) so that we understand paths such as /foo/bar*.avro.

Change log

See CHANGELOG.

Contributing to this project

Code contributions, bug reports, feature requests etc. are all welcome.

If you are new to GitHub please read Contributing to a project for how to send patches and pull requests to this project.

Authors

License

See LICENSE for licensing information.

References

Alternative ways to parse CLI options for Hadoop-aware apps:

GenericOptionsParser (for Hadoop 1.x only)
ToolRunner and Tool (for Hadoop 2.x)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
project		project
sh		sh
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
assembly.sbt		assembly.sbt
build.sbt		build.sbt
sbt		sbt
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hio

Philosophy and how it works

Usage

Overview

Examples

Deployment

Run-time requirements

Installation

Configuration

Development

Build requirements

Building the code

Running the tests

Packaging

TODO

Change log

Contributing to this project

Authors

License

References

About

Releases

Packages

Languages

License

verisign/hio

Folders and files

Latest commit

History

Repository files navigation

hio

Philosophy and how it works

Usage

Overview

Examples

Deployment

Run-time requirements

Installation

Configuration

Development

Build requirements

Building the code

Running the tests

Packaging

TODO

Change log

Contributing to this project

Authors

License

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages