Skip to content

shoogie/big_data_for_chimps

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Big Data for Chimps: A Seriously Fun guide to Hadoop and Terabyte-scale data processing

This is the work-in-progress version of the upcoming O’Reilly book, Big Data for Chimps: A Seriously Fun guide to Hadoop and Terabyte-scale data processing.

Our intent is to provide the best guide for exploratory data analytics using Hadoop — for data science in practice. We use high-level languages (Pig and Ruby) that make Hadoop a tool, not a framework, allowing re-use and rapid development. We’ll cover enough Hadoop internals to save you from diving into the source code, and enough tuning advice to let you know where to drill deep.

In all cases, the focus is on maximizing your time and creativity — on helping you uncover what question to ask and the right way to ask it.

O’Reilly has courageouly agreed to release the book under an CC-BY-NC-SA. To buy a physical copy of the book, or a Kindle (.mobi) or iOS/Nook (.epub), visite the early release O’Reilly bookstore (TODO: link to early release page). Buy it now, and you’ll get frequently-updated access and the final version once available.

License

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Code is Apache licensed unless specifically labeled otherwise.

Outline

(TODO: needs updating)

File organization:

    aaNN-name.asciidoc  -- preface material: outline, TODO, about
    baNN-name.asciidoc  -- basic chapters: the mechanics of working with data at scale
    fuNN-name.asciidoc  -- advanced fu: algorithms and methods for specific data challenges
    muNN-name.asciidoc  -- musings: how to think at scale and other omphaloskepses. (Later, these will be interleaved with the basic and algorithm sections)
    prNN-name.asciidoc  -- programmer appendices: datasets, code, etc
    trNN-name.asciidoc  -- follow-on material

    xxNN-name.asciidoc  -- chopping block: these may not make the final draft of the book.


	book.asciidoc

	aa01-about.asciidoc
	aa02-TODO.asciidoc

	ba01-chimps_and_elephants.asciidoc
	ba02-simple_stream.asciidoc

	ba04-toolset.asciidoc
	ba05-semi_structured_data-airline_flights.asciidoc
	ba05-semi_structured_data-wikipedia_corpus.asciidoc
	ba05-semi_structured_data-wikipedia_other.asciidoc
	ba05-semi_structured_data.asciidoc
	ba06-herding_cats.asciidoc
	ba06-overview_of_datasets.asciidoc

	fu06-overview_of_problems.asciidoc
	fu06-statistics.asciidoc
	fu06-tuning.asciidoc
	fu07-data_formats.asciidoc
	fu07-time_series_data.asciidoc
	fu08-advanced_pig.asciidoc
	fu11-processing_text.asciidoc
	fu12-geographic_data.asciidoc
	fu14-processing_graphs.asciidoc
	fu19-pig_udfs.asciidoc
	ff01-authors.asciidoc
	mu02-why_hadoop.asciidoc
	mu03-how_to_think.asciidoc
	mu04-data_modeling.asciidoc
	mu05-rules_of_scaling.asciidoc
	mu06-best_practices_and_pedantic_points_of_style.asciidoc
	pr01-datasets.asciidoc
	pr03-airline_flights.asciidoc
	xx01-simple_machine_learning.asciidoc
	xx02-hbase_and_databases.asciidoc
	xx03-flume_and_stream_processing.asciidoc
	xx16-operations.asciidoc

Mechanics of Working with Data at Scale

  1. Chimpanzee and Elephant Save Christmas link:<chimpanzee_and_elephant>

    • stream of disordered records

    • group/sort records by their label

    • process each group of records

  2. Heraclitus and the Stream

    • Simple disordered stream (map-only) in Wukong

    • Simple ordered-group transform (map+reduce) in Wukong

  3. Musing: Why Hadoop Works

    • the locality problem

    • the Hadoop haiku

    • robots are inexpensive, programmers are not

  4. Herding `cat`s: the mechanics of wrangling massive data

    • getting data within Hadoop’s reach

    • launching jobs

    • seeing the data

    • seeing the logs

    • clicking to

    • simple debugging

    • wu-lign

  5. Data Formats

  6. Semi-structured Data

    • Wikipedia

    • Datasets:

    • Full-text of Articles (wikipedia_articles) — TSV

    • Wikipedia Page properties (wikipedia_pageinfos) — TSV

    • Wikipedia Pagelinks (wikipedia_links) — TSV

    • Pageview Counts (wikipedia_pageviews) — TSV

    • (Page Properties from DBpedia) (wikipedia_dbpedia) — TSV

    • Munging:

    • parse_raw_articles (xml splitter, xml parser)

    • figure out splitter

    • make it be one line per file (by `&#XX;’ing the newlines

    • keep any interesting metadata

    • parse_raw_links (sql dump)

    • parse_pageinfos (sql dump)

    • parse_raw_pageviews (simple tsv load)

    • prepare_articles

    • add minimal metadata

    • prepare_links

    • minimal metadata; label category pages, redirect, etc

    • adjacency list? labelled low-id-first edge list

    • prepare_pages

    • calculate degree (in, out, symmetric) & other simple stats, add to page metadata table.

    • Airline Flights and Flight Delays

    • Datasets:

    • Airline Flights with delay information (airline_flights/flights)

    • Airlines (airline_flights/airlines)

    • Airports (airline_flights/airports)

    • Airplanes (airline_flights/airplanes)

    • Munging:

    • parse_raw_wikipedia_identifiers

    • parse_raw_openflights_airports

    • parse_raw_dataexpo_airports

    • prepare_timezone_mapping

    • parse_dataexpo_flights

    • reconcile_airports

    • timezoneize_flights

    • Global Weather

    • Datasets

    • Daily observations (weather/daily_observations)

    • Hourly observations (weather/hourly_observations) (we’ll only use one of daily vs hourly)

    • Weather stations (weather/weather_stations)

    • Munging:

    • Logs

    • World Cup (weblogs/worldcup_apachelogs)

    • Star Wars Kid (weblogs/starwarskid_apachelogs)

  • Logs

    • figure out apache log parser in pig

  • page links

    • X prepare

      1. Statistics

    • sum, average, standard deviation, etc (airline_flights)

    • medians and percentiles

    • construct a histogram

    • normalize data by mapping to percentile

    • normalize data by mapping to Z-score

      1. Advanced Pig

    • map-side join

    • merge join

    • skew joins

    • Performance and efficiency

      1. Processing Text

    • grep’ing for simple matches

    • tokenize text

    • simple document analysis

    • minhash clustering

      1. Geo Data

    • quadkeys and grid coordinate system

    • skkkkkkkkk — map wikipedia

    • k-means clustering to produce readable summaries

    • partial quad keys for "area" data

    • voronoi cells to do "nearby"-ness

    • Scripts:

    • calculate_voronoi_cells — use weather station locations to calculate voronoi polygons

    • voronoi_grid_assignment — cells that have a piece of border, or the largest grid cell that has no border on it

    • a

    • Using polymaps to see results

      1. Processing Graphs

    • subuniverse extraction

    • Pagerank

    • identify strong links

    • clustering coefficient

      1. Black-Box Machine Learning

    • Simple Naive Bayes classification

    • Document clustering

      1. Flume and Stream Processing

    • sources, sinks and decorators

    • deploying a wukong script as a decorator

    • parse the twitter stream API feed

      1. Time Series

    • windowing

    • simple anomaly detection

    • rolling statistics

      1. Pig UDFs

    • Basic UDF

    • why algebraic is awesome and how to be algebraic

    • Wonderdog: a LoadFunc / StoreFunc for elasticsearch

      1. Installing and Operating a Cluster

      2. Tuning

      3. HBase and Databases

      4. How to Scale Dirty and its Influence on People

    • How to think at scale

    • Pedantic Points of Style

    • Best Practices

About

A Seriously Fun guide to Big Data Analytics in Practice

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published