forked from mbulat/zohmg
-
Notifications
You must be signed in to change notification settings - Fork 0
Zohmg is a data store for aggregation of multi-dimensional time series data, built on top of Hadoop, Dumbo and HBase.
License
tidewinds/zohmg
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
INTRODUCING: ZOHMG. ------------------- Zohmg is a data store for aggregation of multi-dimensional time series data. It is built on top of Hadoop, Dumbo and HBase. The core idea is to pre-compute aggregates and store them in a read-efficient manner – Zohmg is wasteful with storage in order to answer queries faster. This README assumes a working installation of zohmg. Please see INSTALL for installation instructions. Zohmg is alpha software. Be gentle. CONTACT ------- IRC: #zohmg at freenode Code: http://github.com/zohmg/zohmg/tree/master User list: http://groups.google.com/group/zohmg-user Dev list: http://groups.google.com/group/zohmg-dev RUN THE ZOHMG ------------- Zohmg is installed as a command line script. Try it out: $> zohmg help AN EXAMPLE: TELEVISION! ----------------------- Imagine that you are the director of operations of a web-based television channel. Your company streams videos, called "clips". Every time a clip is streamed across the intertubes, your software makes a note of it in a log file. The log contains information about the clip (id, length, producer) and about the user watching it (country of origin, player). Your objective is to analyze the logs to find out how many people watch a certain clip over time, broken down by country and player, etc. The rest of this text will show you by example how to make sense of logs using Zohmg. THE ANATOMY OF YOUR DATA SET ---------------------------- Each line of the logfile has the following space-delimited format: timestamp clipid producerid length usercountry player love Examples: 1245775664 1440 201 722 GB VLC 0 1245775680 1394 710 2512 GB VLC 1 1245776010 1440 201 722 DE QUICKTIME 0 The timestamp is a UNIX timestamp, the length is counted in seconds and the id's are all integers. Usercountry is a two-letter ISO standard code while the player is an arbitrary string. The last field -- "love" -- is a boolean that indicates whether the user clicked the heart shaped icon, meaning that she was truly moved by the clip. DO YOU SPEAK ZOHMG? ------------------- In the parlance of Zohmg, clip and producer as well as country and player are dimensions. 1440, 'GB', 'VLC', etc, are called attributes of those dimensions. The length of a clip, whether it was loved, etc, are called measurements. In the configuration, which we will take a look at shortly, we define the units of measurements. Measurements must be integers or floats so that they can be summed or averaged. Simple enough! CREATE YOUR FIRST 'PROJECT' --------------------------- Every Zohmg project lives in its own directory. A project is created like so: $> zohmg create television This creates a project directory named 'television'. Its contents are: config - environment and dataset configuration. lib - eggs or jars that will be automatically included in job jar. mappers - mapreduce mappers (you will write these!) CONFIGURE YOUR 'PROJECT' ------------------------ The next step is to configure environment.py and dataset.yaml. config/environment.py: Define HADOOP_HOME and set the paths for all three jars (hadoop, hadoop-streaming, hbase). You might need to run 'ant package' in $HADOOP_HOME to have the streaming jar built for you. config/dataset.yaml: Defining your data set means defining dimensions, projections, units and aggregations. The dimensions lets Zohmg know what your data looks like while the projections hints at what queries you will want to ask. Once you get more comfortable with Zohmg you will want to optimize your projections for efficiency. For example, if you wish to find out how many times a certain clip has been watched broken down by country, you would want to set up a projection where clip is the first dimension and country is the second one. Aggregations specify how each unit will be aggregated. Currently each unit can be summed or averaged. For the television station, something like the following will do just fine. ## config/dataset.yaml dataset: television dimensions: - clip - producer - country - player projections: - clip - player - country - clip-country - producer-country - producer-clip-player units: - plays - loves - seconds - stars aggregations: plays: sum loves: sum seconds: sum stars: average After you've edited environment.py and dataset.yaml: $> zohmg setup This command creates an HBase table with the same name as your dataset. Verify that the table was created: $> hbase shell hbase(main):001:0> list television 1 row(s) in 0.1337 seconds Brilliant! DATA IMPORT ----------- After the project is created and setup correctly it is time to import some data. Data import is a process in two steps: write a map function that analyzes the data line for line, and run that function over the data. The data is normally stored on HDFS, but it is also possible to run on local data. WRITE A MAPPER ---------------- First we'll write the mapper. It will have this signature: def map(key, value) The 'key' argument defines the line number of the input data and is usually (but not always!) not very interesting. The 'value' argument is a string - it represents a single line of input. Analyzing a single line of input is straightforward: split the line on spaces and collect the debris. ## mappers/mapper.py import time def map(key, value): # split on space; make sure there are 7 parts. parts = value.split(' ') if len(parts) < 7: return # extract values. epoch = parts[0] clipid, producerid, length = parts[1:4] country, player, love = parts[4:7] # format timestamp as yyyymmdd. ymd = "%d%02d%02d" % time.localtime(float(epoch))[0:3] # dimension attributes are strings. dimensions = {} dimensions['clip'] = str(clipid) dimensions['producer'] = str(producerid) dimensions['country'] = country dimensions['player'] = player # measurements are integers. measurements = {} measurements['plays'] = 1 measurements['seconds'] = int(length) measurements['loves'] = int(love) measurements['stars'] = float(stars) yield ymd, dimensions, measurements The output of the mapper is a three-tuple: the first element is a string of format yyyymmdd (i.e. "20090601") and the other two elements are dictionaries. The mapper's output is fed to a reducer that sums the values of the units and passes the data on to the underlying data store. RUN THE MAPPER -------------- Populate a file with a small data sample: $> cat > data/short.log 1245775664 1440 201 722 GB VLC 0 1245775680 1394 710 2512 GB VLC 1 1245776010 1440 201 722 DE QUICKTIME 0 ^D Perform a local test-run: $> zohmg import mappers/mapper.py data/short.log --local The first argument to import is the path to the python file containing the map function, the second is a path on the local file system. The local run will direct its output to a file instead of writing to the data store. Inspect it like so: $> cat /tmp/zohmg-output 'clip-1440-country-DE-20090623' '{"unit:length": {"value": 722}}' 'clip-1440-country-DE-20090623' '{"unit:plays": {"value": 1}}' 'clip-all-country-DE-20090623' '{"unit:length": {"value": 722}}' [..] If Hadoop and HBase are up, run the mapper on the cluster: $> zohmg import mappers/mapper.py /data/television/20090620.log Assuming the mapper finished successfully there is now some data in the HBase table. Verify this by firing up the HBase shell: $> hbase shell hbase(main):001:0> scan 'television' [..] Lots of data scrolling by? Good! (Interrupt with CTRL-C at your leisure.) SERVE DATA ---------- Start the Zohmg server: $> zohmg server The Zohmg server listens on port 8086 at localhost by default. Browse http://localhost:8086/ and have a look! THE API ------- Zohmg's data server exposes the data store through an HTTP API. Every request returns a JSON-formatted string. The JSON looks something like this: [{"20090601": {"DE": 270, "US": 21, "SE": 5547}}, {"20090602": {"DE": 9020, "US": 109, "SE": 11497}}, {"20090603": {"DE": 10091, "US": 186, "SE": 8863}}] The API is extremely simple: it supports a single type of GET request and there is no authentication. There are four required parameters: t0, t1, unit and d0. The parameters t0 and t1 define the time span. They are strings of the format "yyyymmdd", i.e. "t0=20090601&t1=20090630". The unit parameter defines the one unit for which you query, for example 'unit=plays'. The parameter d0 defines the base dimension, for example 'd0=country'. Values for d0 can be defined by setting the parameter d0v to a comma-separated string of values, for example "d0v=US,SE,DE". If d0v is empty, all values of the base dimension will be returned. A typical query string looks like this: http://localhost:8086/?t0=20090601&t1=20090630&unit=plays&d0=country&d0v=DE,SE,US This example query would return the number of clips streamed for the time span between the first and last of June broken down by the countries Sweden, Germany and the United States. The API supports JSONP via the jsonp parameter. By setting this parameter, the returned JSON is wrapped in a function call. PLOTTING THE DATA ----------------- There is an example client bundled with Zohmg. It is served by the Zohmg server at http://localhost:8086/graph/ and is quite useful for exploring your dataset. You may also want to peek at the javascript code to gain some inspiration and insight into how the beast works. The graphing tool uses Javascript to query the data server, and plots graphs with the help of Google Charts. KNOWN ISSUES ------------ Zohmg is alpha software and is still learning how to behave properly in mixed company. You will spot more than a few minor quirks while using it and might perhaps even run into the odd show-stopper. If you do, please let us know! Zohmg currently only supports mappers written in Python. Eventually, you will also be able to write mappers in Java.
About
Zohmg is a data store for aggregation of multi-dimensional time series data, built on top of Hadoop, Dumbo and HBase.
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published