This is the work-in-progress version of the upcoming O’Reilly book, Big Data for Chimps: A Seriously Fun guide to Hadoop and Terabyte-scale data processing.
Our intent is to provide the best guide for exploratory data analytics using Hadoop — for data science in practice. We use high-level languages (Pig and Ruby) that make Hadoop a tool, not a framework, allowing re-use and rapid development. We’ll cover enough Hadoop internals to save you from diving into the source code, and enough tuning advice to let you know where to drill deep.
In all cases, the focus is on maximizing your time and creativity — on helping you uncover what question to ask and the right way to ask it.
O’Reilly has courageouly agreed to release the book under an CC-BY-NC-SA. To buy a physical copy of the book, or a Kindle (.mobi
) or iOS/Nook (.epub
), visite the early release O’Reilly bookstore (TODO: link to early release page). Buy it now, and you’ll get frequently-updated access and the final version once available.
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
Code is Apache licensed unless specifically labeled otherwise.
(TODO: needs updating)
File organization:
aaNN-name.asciidoc -- preface material: outline, TODO, about baNN-name.asciidoc -- basic chapters: the mechanics of working with data at scale fuNN-name.asciidoc -- advanced fu: algorithms and methods for specific data challenges muNN-name.asciidoc -- musings: how to think at scale and other omphaloskepses. (Later, these will be interleaved with the basic and algorithm sections) prNN-name.asciidoc -- programmer appendices: datasets, code, etc trNN-name.asciidoc -- follow-on material xxNN-name.asciidoc -- chopping block: these may not make the final draft of the book. book.asciidoc aa01-about.asciidoc aa02-TODO.asciidoc ba01-chimps_and_elephants.asciidoc ba02-simple_stream.asciidoc ba04-toolset.asciidoc ba05-semi_structured_data-airline_flights.asciidoc ba05-semi_structured_data-wikipedia_corpus.asciidoc ba05-semi_structured_data-wikipedia_other.asciidoc ba05-semi_structured_data.asciidoc ba06-herding_cats.asciidoc ba06-overview_of_datasets.asciidoc fu06-overview_of_problems.asciidoc fu06-statistics.asciidoc fu06-tuning.asciidoc fu07-data_formats.asciidoc fu07-time_series_data.asciidoc fu08-advanced_pig.asciidoc fu11-processing_text.asciidoc fu12-geographic_data.asciidoc fu14-processing_graphs.asciidoc fu19-pig_udfs.asciidoc ff01-authors.asciidoc mu02-why_hadoop.asciidoc mu03-how_to_think.asciidoc mu04-data_modeling.asciidoc mu05-rules_of_scaling.asciidoc mu06-best_practices_and_pedantic_points_of_style.asciidoc pr01-datasets.asciidoc pr03-airline_flights.asciidoc xx01-simple_machine_learning.asciidoc xx02-hbase_and_databases.asciidoc xx03-flume_and_stream_processing.asciidoc xx16-operations.asciidoc
-
Chimpanzee and Elephant Save Christmas link:<chimpanzee_and_elephant>
-
stream of disordered records
-
group/sort records by their label
-
process each group of records
-
-
Heraclitus and the Stream
-
Simple disordered stream (map-only) in Wukong
-
Simple ordered-group transform (map+reduce) in Wukong
-
-
Musing: Why Hadoop Works
-
the locality problem
-
the Hadoop haiku
-
robots are inexpensive, programmers are not
-
-
Herding `cat`s: the mechanics of wrangling massive data
-
getting data within Hadoop’s reach
-
launching jobs
-
seeing the data
-
seeing the logs
-
clicking to
-
simple debugging
-
wu-lign
-
-
Data Formats
-
Semi-structured Data
-
Wikipedia
-
Datasets:
-
Full-text of Articles (
wikipedia_articles
) — TSV -
Wikipedia Page properties (
wikipedia_pageinfos
) — TSV -
Wikipedia Pagelinks (
wikipedia_links
) — TSV -
Pageview Counts (
wikipedia_pageviews
) — TSV -
(Page Properties from DBpedia) (
wikipedia_dbpedia
) — TSV -
Munging:
-
parse_raw_articles
(xml splitter, xml parser) -
figure out splitter
-
make it be one line per file (by `&#XX;’ing the newlines
-
keep any interesting metadata
-
parse_raw_links
(sql dump) -
parse_pageinfos
(sql dump) -
parse_raw_pageviews
(simple tsv load) -
prepare_articles
-
add minimal metadata
-
prepare_links
-
minimal metadata; label category pages, redirect, etc
-
adjacency list? labelled low-id-first edge list
-
prepare_pages
-
calculate degree (in, out, symmetric) & other simple stats, add to page metadata table.
-
Airline Flights and Flight Delays
-
Datasets:
-
Airline Flights with delay information (
airline_flights/flights
) -
Airlines (
airline_flights/airlines
) -
Airports (
airline_flights/airports
) -
Airplanes (
airline_flights/airplanes
) -
Munging:
-
parse_raw_wikipedia_identifiers
-
parse_raw_openflights_airports
-
parse_raw_dataexpo_airports
-
prepare_timezone_mapping
-
parse_dataexpo_flights
-
reconcile_airports
-
timezoneize_flights
-
Global Weather
-
Datasets
-
Daily observations (
weather/daily_observations
) -
Hourly observations (
weather/hourly_observations
) (we’ll only use one of daily vs hourly) -
Weather stations (
weather/weather_stations
) -
Munging:
-
Logs
-
World Cup (
weblogs/worldcup_apachelogs
) -
Star Wars Kid (
weblogs/starwarskid_apachelogs
)
-
-
Logs
-
figure out apache log parser in pig
-
-
page links
-
X prepare
-
Statistics
-
-
sum, average, standard deviation, etc (airline_flights)
-
medians and percentiles
-
construct a histogram
-
normalize data by mapping to percentile
-
normalize data by mapping to Z-score
-
Advanced Pig
-
-
map-side join
-
merge join
-
skew joins
-
Performance and efficiency
-
Processing Text
-
-
grep’ing for simple matches
-
tokenize text
-
simple document analysis
-
minhash clustering
-
Geo Data
-
-
quadkeys and grid coordinate system
-
skkkkkkkkk
— map wikipedia -
k-means clustering to produce readable summaries
-
partial quad keys for "area" data
-
voronoi cells to do "nearby"-ness
-
Scripts:
-
calculate_voronoi_cells
— use weather station locations to calculate voronoi polygons -
voronoi_grid_assignment
— cells that have a piece of border, or the largest grid cell that has no border on it -
a
-
Using polymaps to see results
-
Processing Graphs
-
-
subuniverse extraction
-
Pagerank
-
identify strong links
-
clustering coefficient
-
Black-Box Machine Learning
-
-
Simple Naive Bayes classification
-
Document clustering
-
Flume and Stream Processing
-
-
sources, sinks and decorators
-
deploying a wukong script as a decorator
-
parse the twitter stream API feed
-
Time Series
-
-
windowing
-
simple anomaly detection
-
rolling statistics
-
Pig UDFs
-
-
Basic UDF
-
why algebraic is awesome and how to be algebraic
-
Wonderdog: a LoadFunc / StoreFunc for elasticsearch
-
Installing and Operating a Cluster
-
Tuning
-
HBase and Databases
-
How to Scale Dirty and its Influence on People
-
-
How to think at scale
-
Pedantic Points of Style
-
Best Practices
-