GitHub - shoogie/big_data_for_chimps: A Seriously Fun guide to Big Data Analytics in Practice

Big Data for Chimps: A Seriously Fun guide to Hadoop and Terabyte-scale data processing

This is the work-in-progress version of the upcoming O’Reilly book, Big Data for Chimps: A Seriously Fun guide to Hadoop and Terabyte-scale data processing.

Our intent is to provide the best guide for exploratory data analytics using Hadoop — for data science in practice. We use high-level languages (Pig and Ruby) that make Hadoop a tool, not a framework, allowing re-use and rapid development. We’ll cover enough Hadoop internals to save you from diving into the source code, and enough tuning advice to let you know where to drill deep.

In all cases, the focus is on maximizing your time and creativity — on helping you uncover what question to ask and the right way to ask it.

O’Reilly has courageouly agreed to release the book under an CC-BY-NC-SA. To buy a physical copy of the book, or a Kindle (.mobi) or iOS/Nook (.epub), visite the early release O’Reilly bookstore (TODO: link to early release page). Buy it now, and you’ll get frequently-updated access and the final version once available.

License

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Code is Apache licensed unless specifically labeled otherwise.

Outline

(TODO: needs updating)

File organization:

    aaNN-name.asciidoc  -- preface material: outline, TODO, about
    baNN-name.asciidoc  -- basic chapters: the mechanics of working with data at scale
    fuNN-name.asciidoc  -- advanced fu: algorithms and methods for specific data challenges
    muNN-name.asciidoc  -- musings: how to think at scale and other omphaloskepses. (Later, these will be interleaved with the basic and algorithm sections)
    prNN-name.asciidoc  -- programmer appendices: datasets, code, etc
    trNN-name.asciidoc  -- follow-on material

    xxNN-name.asciidoc  -- chopping block: these may not make the final draft of the book.


	book.asciidoc

	aa01-about.asciidoc
	aa02-TODO.asciidoc

	ba01-chimps_and_elephants.asciidoc
	ba02-simple_stream.asciidoc

	ba04-toolset.asciidoc
	ba05-semi_structured_data-airline_flights.asciidoc
	ba05-semi_structured_data-wikipedia_corpus.asciidoc
	ba05-semi_structured_data-wikipedia_other.asciidoc
	ba05-semi_structured_data.asciidoc
	ba06-herding_cats.asciidoc
	ba06-overview_of_datasets.asciidoc

	fu06-overview_of_problems.asciidoc
	fu06-statistics.asciidoc
	fu06-tuning.asciidoc
	fu07-data_formats.asciidoc
	fu07-time_series_data.asciidoc
	fu08-advanced_pig.asciidoc
	fu11-processing_text.asciidoc
	fu12-geographic_data.asciidoc
	fu14-processing_graphs.asciidoc
	fu19-pig_udfs.asciidoc
	ff01-authors.asciidoc
	mu02-why_hadoop.asciidoc
	mu03-how_to_think.asciidoc
	mu04-data_modeling.asciidoc
	mu05-rules_of_scaling.asciidoc
	mu06-best_practices_and_pedantic_points_of_style.asciidoc
	pr01-datasets.asciidoc
	pr03-airline_flights.asciidoc
	xx01-simple_machine_learning.asciidoc
	xx02-hbase_and_databases.asciidoc
	xx03-flume_and_stream_processing.asciidoc
	xx16-operations.asciidoc

Mechanics of Working with Data at Scale

Chimpanzee and Elephant Save Christmas link:<chimpanzee_and_elephant>
- stream of disordered records
- group/sort records by their label
- process each group of records
Heraclitus and the Stream
- Simple disordered stream (map-only) in Wukong
- Simple ordered-group transform (map+reduce) in Wukong
Musing: Why Hadoop Works
- the locality problem
- the Hadoop haiku
- robots are inexpensive, programmers are not
Herding `cat`s: the mechanics of wrangling massive data
- getting data within Hadoop’s reach
- launching jobs
- seeing the data
- seeing the logs
- clicking to
- simple debugging
- wu-lign
Data Formats
Semi-structured Data
- Wikipedia
- Datasets:
- Full-text of Articles (wikipedia_articles) — TSV
- Wikipedia Page properties (wikipedia_pageinfos) — TSV
- Wikipedia Pagelinks (wikipedia_links) — TSV
- Pageview Counts (wikipedia_pageviews) — TSV
- (Page Properties from DBpedia) (wikipedia_dbpedia) — TSV
- Munging:
- parse_raw_articles (xml splitter, xml parser)
- figure out splitter
- make it be one line per file (by `&#XX;’ing the newlines
- keep any interesting metadata
- parse_raw_links (sql dump)
- parse_pageinfos (sql dump)
- parse_raw_pageviews (simple tsv load)
- prepare_articles
- add minimal metadata
- prepare_links
- minimal metadata; label category pages, redirect, etc
- adjacency list? labelled low-id-first edge list
- prepare_pages
- calculate degree (in, out, symmetric) & other simple stats, add to page metadata table.
- Airline Flights and Flight Delays
- Datasets:
- Airline Flights with delay information (airline_flights/flights)
- Airlines (airline_flights/airlines)
- Airports (airline_flights/airports)
- Airplanes (airline_flights/airplanes)
- Munging:
- parse_raw_wikipedia_identifiers
- parse_raw_openflights_airports
- parse_raw_dataexpo_airports
- prepare_timezone_mapping
- parse_dataexpo_flights
- reconcile_airports
- timezoneize_flights
- Global Weather
- Datasets
- Daily observations (weather/daily_observations)
- Hourly observations (weather/hourly_observations) (we’ll only use one of daily vs hourly)
- Weather stations (weather/weather_stations)
- Munging:
- Logs
- World Cup (weblogs/worldcup_apachelogs)
- Star Wars Kid (weblogs/starwarskid_apachelogs)

Logs
- figure out apache log parser in pig
page links
- X prepare
  1. Statistics
- sum, average, standard deviation, etc (airline_flights)
- medians and percentiles
- construct a histogram
- normalize data by mapping to percentile
- normalize data by mapping to Z-score
  1. Advanced Pig
- map-side join
- merge join
- skew joins
- Performance and efficiency
  1. Processing Text
- grep’ing for simple matches
- tokenize text
- simple document analysis
- minhash clustering
  1. Geo Data
- quadkeys and grid coordinate system
- skkkkkkkkk — map wikipedia
- k-means clustering to produce readable summaries
- partial quad keys for "area" data
- voronoi cells to do "nearby"-ness
- Scripts:
- calculate_voronoi_cells — use weather station locations to calculate voronoi polygons
- voronoi_grid_assignment — cells that have a piece of border, or the largest grid cell that has no border on it
- a
- Using polymaps to see results
  1. Processing Graphs
- subuniverse extraction
- Pagerank
- identify strong links
- clustering coefficient
  1. Black-Box Machine Learning
- Simple Naive Bayes classification
- Document clustering
  1. Flume and Stream Processing
- sources, sinks and decorators
- deploying a wukong script as a decorator
- parse the twitter stream API feed
  1. Time Series
- windowing
- simple anomaly detection
- rolling statistics
  1. Pig UDFs
- Basic UDF
- why algebraic is awesome and how to be algebraic
- Wonderdog: a LoadFunc / StoreFunc for elasticsearch
  1. Installing and Operating a Cluster
  2. Tuning
  3. HBase and Databases
  4. How to Scale Dirty and its Influence on People
- How to think at scale
- Pedantic Points of Style
- Best Practices

Name		Name	Last commit message	Last commit date
Latest commit History 327 Commits
code		code
data @ 8ae2140		data @ 8ae2140
images		images
notes		notes
.dexy		.dexy
.gitignore		.gitignore
.gitmodules		.gitmodules
.gitscribe		.gitscribe
MANIFEST-early_release.asciidoc		MANIFEST-early_release.asciidoc
MANIFEST-tech_review.asciidoc		MANIFEST-tech_review.asciidoc
README.asciidoc		README.asciidoc
Rakefile		Rakefile
aa0a-topics.asciidoc		aa0a-topics.asciidoc
aa0b-about.asciidoc		aa0b-about.asciidoc
aa0c-hello_reviewers.asciidoc		aa0c-hello_reviewers.asciidoc
ba1a-first_exploration.asciidoc		ba1a-first_exploration.asciidoc
ba2a-simple_stream.asciidoc		ba2a-simple_stream.asciidoc
ba3a-locality-reshape.asciidoc		ba3a-locality-reshape.asciidoc
ba3b-locality-saving_christmas.asciidoc		ba3b-locality-saving_christmas.asciidoc
ba3c-locality-simple_reshape.asciidoc		ba3c-locality-simple_reshape.asciidoc
ba3d-locality-efficient_santa.asciidoc		ba3d-locality-efficient_santa.asciidoc
ba3e-jt_and_nanette_at_work_for_you.asciidoc		ba3e-jt_and_nanette_at_work_for_you.asciidoc
ba3e-locality-partition_and_sort_keys.asciidoc		ba3e-locality-partition_and_sort_keys.asciidoc
ba6a-semi_structured_data.asciidoc		ba6a-semi_structured_data.asciidoc
ba6b-semi_structured_data-wikipedia_other.asciidoc		ba6b-semi_structured_data-wikipedia_other.asciidoc
ba6c-semi_structured_data-wikipedia_corpus.asciidoc		ba6c-semi_structured_data-wikipedia_corpus.asciidoc
ba6d-semi_structured_data-patterns.asciidoc		ba6d-semi_structured_data-patterns.asciidoc
ba6e-semi_structured_data-airline_flights.asciidoc		ba6e-semi_structured_data-airline_flights.asciidoc
ba6f-semi_structured_data-daily_weather.asciidoc		ba6f-semi_structured_data-daily_weather.asciidoc
ba6g-semi_structured_data-truth_and_error.asciidoc		ba6g-semi_structured_data-truth_and_error.asciidoc
ba6h-semi_structured_data-other_strategies.asciidoc		ba6h-semi_structured_data-other_strategies.asciidoc
ba7a-data_formats.asciidoc		ba7a-data_formats.asciidoc
big_data_for_chimps.pdf		big_data_for_chimps.pdf
book.asciidoc		book.asciidoc
ca4a-tools.asciidoc		ca4a-tools.asciidoc
ca4b-tools-intro_to_wukong.asciidoc		ca4b-tools-intro_to_wukong.asciidoc
ca4c-tools-intro_to_pig.asciidoc		ca4c-tools-intro_to_pig.asciidoc
ca4d-overview_of_datasets.asciidoc		ca4d-overview_of_datasets.asciidoc
ca4e-herding_cats.asciidoc		ca4e-herding_cats.asciidoc
fu00-gotchas.asciidoc		fu00-gotchas.asciidoc
fu01-sampling.asciidoc		fu01-sampling.asciidoc
fu02-server_logs.asciidoc		fu02-server_logs.asciidoc
fu02-statistics-distribution_of_weather_measurements.asciidoc		fu02-statistics-distribution_of_weather_measurements.asciidoc
fu02-statistics.asciidoc		fu02-statistics.asciidoc
fu02e-statistics-exercises.asciidoc		fu02e-statistics-exercises.asciidoc
fu03-advanced_pig.asciidoc		fu03-advanced_pig.asciidoc
fu04-processing_text.asciidoc		fu04-processing_text.asciidoc
fu05-an_elephants_eye_view_of_the_world.asciidoc		fu05-an_elephants_eye_view_of_the_world.asciidoc
fu05-geographic_data.asciidoc		fu05-geographic_data.asciidoc
fu06-processing_graphs-pagerank.asciidoc		fu06-processing_graphs-pagerank.asciidoc
fu06-processing_graphs.asciidoc		fu06-processing_graphs.asciidoc
fu06c-processing_graphs-community.asciidoc		fu06c-processing_graphs-community.asciidoc
fu06d-misc.asciidoc		fu06d-misc.asciidoc
fu07-time_series_data.asciidoc		fu07-time_series_data.asciidoc
fu08-hadoop_api.asciidoc		fu08-hadoop_api.asciidoc
fu08-pig_udfs.asciidoc		fu08-pig_udfs.asciidoc
ha1a-hadoop_internals.asciidoc		ha1a-hadoop_internals.asciidoc
ha1b-hadoop_internals-logs.asciidoc		ha1b-hadoop_internals-logs.asciidoc
ha2a-tuning-wise_and_lazy.asciidoc		ha2a-tuning-wise_and_lazy.asciidoc
ha2b-tuning-pathology.asciidoc		ha2b-tuning-pathology.asciidoc
ha2c-use_method_checklist.asciidoc		ha2c-use_method_checklist.asciidoc
ha3a-tuning-brave_and_foolish.asciidoc		ha3a-tuning-brave_and_foolish.asciidoc
ha4a-hbase_schema.asciidoc		ha4a-hbase_schema.asciidoc
mu02-why_hadoop.asciidoc		mu02-why_hadoop.asciidoc
mu03-how_to_think.asciidoc		mu03-how_to_think.asciidoc
mu04-data_modeling.asciidoc		mu04-data_modeling.asciidoc
mu05-cloud-vs-static.asciidoc		mu05-cloud-vs-static.asciidoc
mu05-rules_of_scaling.asciidoc		mu05-rules_of_scaling.asciidoc
mu06-best_practices_and_pedantic_points_of_style.asciidoc		mu06-best_practices_and_pedantic_points_of_style.asciidoc
mu07-tao_te_chimp.asciidoc		mu07-tao_te_chimp.asciidoc
mu9a-thinking_on_locality.asciidoc		mu9a-thinking_on_locality.asciidoc
pr01-datasets.asciidoc		pr01-datasets.asciidoc
pr04-wikipedia_dbpedia.asciidoc		pr04-wikipedia_dbpedia.asciidoc
pr05-airline_flights.asciidoc		pr05-airline_flights.asciidoc
pr06-access_logs.asciidoc		pr06-access_logs.asciidoc
pr08-data_formats-arc.asciidoc		pr08-data_formats-arc.asciidoc
pr1a-acquiring_a_hadoop_cluster.asciidoc		pr1a-acquiring_a_hadoop_cluster.asciidoc
tr1a-authors.asciidoc		tr1a-authors.asciidoc
tr2a-LICENSE.asciidoc		tr2a-LICENSE.asciidoc
tr3a-colophon.asciidoc		tr3a-colophon.asciidoc
tr4a-references.asciidoc		tr4a-references.asciidoc
tr5a-back_cover.asciidoc		tr5a-back_cover.asciidoc
tr6a-glossary.asciidoc		tr6a-glossary.asciidoc
tr7a-cheatsheets.asciidoc		tr7a-cheatsheets.asciidoc
working.asciidoc		working.asciidoc
xx-use_method_fiddle.asciidoc		xx-use_method_fiddle.asciidoc
xx01-simple_machine_learning.asciidoc		xx01-simple_machine_learning.asciidoc
xx02-hbase_and_databases.asciidoc		xx02-hbase_and_databases.asciidoc
xx03-flume_and_stream_processing.asciidoc		xx03-flume_and_stream_processing.asciidoc
zz00-other_datasets_on_the_web.asciidoc		zz00-other_datasets_on_the_web.asciidoc
zz01-notes_for_chimpmark.asciidoc		zz01-notes_for_chimpmark.asciidoc
zz0a-TODO.asciidoc		zz0a-TODO.asciidoc
zz1a-asciidoc_cheatsheet_and_style_guide.asciidoc		zz1a-asciidoc_cheatsheet_and_style_guide.asciidoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data for Chimps: A Seriously Fun guide to Hadoop and Terabyte-scale data processing

License

Outline

Mechanics of Working with Data at Scale

About

Releases

Packages

shoogie/big_data_for_chimps

Folders and files

Latest commit

History

Repository files navigation

Big Data for Chimps: A Seriously Fun guide to Hadoop and Terabyte-scale data processing

License

Outline

Mechanics of Working with Data at Scale

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages