Hello and Thanks, Courageous and Farsighted Early Released-To’er! I want to make sure the book delivers value to you now, and rewards your early confidence by becoming the book you’re proud to own.
-
The rule of thumb I’m using on introductory material is "If it’s well-covered on the internet, leave it out". I hate it when tech books give a topic the bus-tour-of-London ("On your window to the left is the outside of the British Museum!") treatment, but you should never find yourself completely stranded. Please let me know if that’s the case.
-
Analogies: We’ll be accompanied on part of our journey by Chimpanzee and Elephant, whose adventures are surprisingly relevant to understanding the internals of Hadoop. I don’t want to waste your time laboriously remapping those adventures back to the problem at hand, but I definitely don’t want to get too cute with the analogy. Again, please let me know if I err on either side.
This is the plan. We’ll roll material out over the next few months. Should we find we need to cut things (I hope not to), I’ve flagged a few chapters as (bubble).
-
First Exploration: A walkthrough of problem you’d use Hadoop to solve, showing the workflow and thought process. Hadoop asks you to write code poems that compose what we’ll call transforms (process records independently) and pivots (restructure data).
-
Transform-only Job: Chimpanzee and Elephant are hired to translate the works of Shakespeare to every language; you’ll take over the task of translating text to Pig Latin. This is an "embarrasingly parallel" problem, so we can learn the mechanics of launching a job and a coarse understanding of the HDFS without having to think too hard.
-
Chimpanzee and Elephant start a business
-
Pig Latin translation
-
Your first job: test at commandline
-
Run it on cluster
-
Input Splits
-
Why Hadoop I: Simple Parallelism
-
-
Transform-Pivot Job: C&E help SantaCorp optimize the Christmas toymaking process, demonstrating the essential problem of data locality (the central challenge of Big Data). We’ll follow along with a job requiring map and reduce, and learn a bit more about Wukong (a Ruby-language framework for Hadoop).
-
Locality: the central challenge of distributed computing
-
The Hadoop Haiku
-
-
Server Log Processing:
-
Parsing logs and using regular expressions
-
Histograms and time series of pageviews
-
Geolocate visitors based on IP
-
(Ab)Using Hadoop to stress-test your web server
-
-
Semi-Structured Data: The dirty art of data munging. It’s a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We’ll show you street-fighting tactics that lessen the time and pain. Along the way, we’ll prepare the datasets to be used throughout the book:
-
Wikipedia Articles: Every English-language article (12 million) from Wikipedia.
-
Wikipedia Pageviews: Hour-by-hour counts of pageviews for every Wikipedia article since 2007.
-
US Commercial Airline Flights: every commercial airline flight since 1987
-
Hourly Weather Data: a century of weather reports, with hourly global coverage since the 1950s.
-
"Star Wars Kid" weblogs: large collection of apache webserver logs from a popular internet site (Andy Baio’s waxy.org).
-
-
Data Models & Data Formats: How to design your data models, transmit their contents, and organize their whole.
-
How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they’re equivalent; with some experience under your belt it’s worth learning how to fluidly shift among these different models.
-
Geographic Data:
-
C&E
-
-
Why Hadoop: Why and when is Hadoop called for? What changed about the world that we weren’talways doing it this way?
-
Statistics:
-
Sampling responsibly: it’s harder and more important than you think
-
Statistical aggregates and the danger of large numbers
-
Histograms and Percentiles
-
-
Graph Processing:
-
Community Extraction: Use the page-to-page links in Wikipedia to identify similar documents
-
Pagerank (centrality): Reconstruct pageview paths from web logs, and use them to identify important pages
-
(bubble)
-
-
Text Processing: We’ll show how to combine powerful existing libraries with hadoop to do effective text handling and Natural Language Processing:
-
Indexing documents
-
Tokenizing documents using Lucene
-
Pointwise Mutual Information
-
(bubble) Minhashing to combat a massive feature space
-
(bubble) How to cheat with Bloom filters
-
(bubble) Topic extraction using (to be determined)
-
-
Machine Learning for Busy Folks with Things to Do: We’ll combine the record of every commercial flight since 1987 with the hour-by-hour weather data to predict flight delays using
-
Naive Bayes
-
Logistic Regression
-
Random Forest (using Mahout) We’ll equip you with a picture of how they work, but won’t go into the math of how or why. We will show you how to choose a method, and how to cheat to win.
-
-
Hadoop Internals:
-
Tuning for the Wise and Lazy
-
Tuning for the Brave and Foolish
-
The USE Method for understanding performance and diagnosing problems
-
-
Advanced Pig:
-
Specialized joins that can dramatically speed up (or make feasible) your data transformations
-
Pig UDFs (User-Defined Functions) to extend Pig’s capabilities
-
-
Cheatsheets:
-
Regular Expressions
-
Sizes of the Universe
-
Hadoop Tuning & Configuration Variables
-
I’m not currently planning to cover Hive — I believe the pig scripts will translate naturally for folks who are already familiar with it. There will be a brief section explaining why you might choose it over Pig, and why I chose it over Hive. If there’s popular pressure I may add a "translation guide".
Other things I don’t plan to include:
-
Installing or maintaining Hadoop
-
HBase (or other databases)
-
Other map-reduce-like platforms (disco, spark, etc), or other frameworks (MrJob, Scalding, Cascading)
-
Stream processing with Trident. (A likely sequel should this go well?)
-
At a few points we’ll use Mahout, R, D3.js and Unix text utils (cut/wc/etc), but only as tools for an immediate purpose. I can’t justify going deep into any of them; there are whole O’Reilly books on each.
-
The source code for the book — all the prose, images, the whole works — is on github at
http://github.com/infochimps-labs/big_data_for_chimps
. -
Contact us! If you have questions, comments or complaints, the issue tracker http://github.com/infochimps-labs/big_data_for_chimps/issues is the best forum for sharing those. If you’d like something more direct, please email [email protected] (the ever-patient editor) and [email protected] (your eager author). Please include both of us.
OK! On to the book. Or, on to the introductory parts of the book and then the book.
'Big Data for Chimps' shows you how to solve hard problems using simple, fun, elegant tools.
It contains
-
Detailed example programs applying Hadoop to interesting problems in context
-
Advice and best practices for efficient software development
-
How to think at scale — equipping you with a deep understanding of how to break a problem into efficient data transformations, and of how data must flow through the cluster to effect those transformations.
All of the examples use real data, and describe patterns found in many problem domains:
-
Statistical Summaries
-
Identify patterns and groups in the data
-
Searching, filtering and herding records in bulk
-
Advanced queries against spatial or time-series data sets.
This is not a beginner’s book. The emphasis on simplicity and fun should make it especially appealing to beginners, but this is not an approach you’ll outgrow. The emphasis is on simplicity and fun because it’s the most powerful approach, and generates the most value, for creative analytics: humans are important, robots are cheap. The code you see is adapted from programs we write at Infochimps. There are sections describing how and when to integrate custom components or extend the toolkit, but simple high-level transformations meet almost all of our needs.
Most of the chapters have exercises included. If you’re a beginning user, I highly recommend you work out at least one exercise from each chapter. Deep learning will come less from having the book in front of you as you read it than from having the book next to you while you write code inspired by it. There are sample solutions and result datasets on the book’s website.
Feel free to hop around among chapters; the application chapters don’t have large dependencies on earlier chapters.
You should be familiar with at least one programming language, but it doesn’t have to be Ruby. Ruby is a very readable language, and the code samples provided should map cleanly to languages like Python or Scala. Familiarity with SQL will help a bit, but isn’t essential.
This book picks up where the internet leaves off — apart from cheatsheets at the end of the book, I’m not going to spend any real time on information well-covered by basic tutorials and core documentation.
All of the code in this book will run unmodified on your laptop computer and on an industrial-strength Hadoop cluster (though you will want to use a reduced data set for the laptop). You do need a Hadoop installation of some sort, even if it’s a single machine. While a multi-machine cluster isn’t essential, you’ll learn best by spending some time on a real environment with real data. Appendix (TODO: ref) describes your options for installing Hadoop.
Most importantly, you should have an actual project in mind that requires a big data toolkit to solve — a problem that requires scaling out across multiple machines. If you don’t already have a project in mind but really want to learn about the big data toolkit, take a quick browse through the exercises. At least a few of them should have you jumping up and down with excitement to learn this stuff.
This is not "Hadoop the Definitive Guide" (that’s been written, and well); this is more like "Hadoop: a Highly Opinionated Guide". The only coverage of how to use the bare Hadoop API is to say "In most cases, don’t". We recommend storing your data in one of several highly space-inefficient formats and in many other ways encourage you to willingly trade a small performance hit for a large increase in programmer joy. The book has a relentless emphasis on writing scalable code, but no content on writing performant code beyond the advice that the best path to a 2x speedup is to launch twice as many machines.
That is because for almost everyone, the cost of the cluster is far less than the opportunity cost of the data scientists using it. If you have not just big data but huge data — let’s say somewhere north of 100 terabytes — then you will need to make different tradeoffs for jobs that you expect to run repeatedly in production.
The book does have some content on machine learning with Hadoop, on provisioning and deploying Hadoop, and on a few important settings. But it does not cover advanced algorithms, operations or tuning in any real depth.
I plan to push chapters to the publicly-viewable 'Hadoop for Chimps' git repo as they are written, and to post them periodically to the Infochimps blog after minor cleanup.
We really mean it about the git social-coding thing — please comment on the text, file issues and send pull requests. However! We might not use your feedback, no matter how dazzlingly cogent it is; and while we are soliciting comments from readers, we are not seeking content from collaborators.