Skip to content

Latest commit

 

History

History
81 lines (61 loc) · 1.49 KB

aa0a-topics.asciidoc

File metadata and controls

81 lines (61 loc) · 1.49 KB
  1. First exploration

    • motivation

    • walkthrough

    • reflection

  2. Stream

    • Why Hadoop I: Simple Parallelism

    • Chimps at typewriters

    • Pig Latin translation

    • Testing it at commandline

    • Running it on cluster

    • Input Splits

  3. Reshape

    • Locality

    • Elves pt1

    • Simple Join

    • Elves pt2

    • Partition key + sort key

  4. Using Hadoop and herding `cat`s

    • overview of wukong

    • overview of pig

    • toolset overview

  5. cat herding

    • Simple (!) munging

    • total sort

    • sampling

  6. Data munging (Semi-structured data)

  7. Statistics

    • First pig -

      • Log Processing

    • Sessionizing a log

  8. Statistics

    • Average, StdDev, etc of a huge spreadsheet

    • Exact Percentiles (Median) of a huge spreadsheet

    • Approximate Percentiles (Median) of a huge spreadsheet

    • Histogram

      • Geographic

    • mechanics of handling geo data

    • Statistics on grid cells

    • Clustering

    • Pointwise mutual information

      • Text Processing

    • Inverted Index (word count)

    • Minhash

      • Time Series

    • weather & flight delays for prediction

    • Anomaly detection

    • Wikipedia Pageview

    • Flight delays

    • World Cup

      • Graph

    • Adjacency List / Edge List conversion

    • Minimal Spanning Tree

    • Pagerank

    • Undirecting a graph

    • Assemble a min-index Adj. list

    • Breadth-First Search

    • Min-degree undirected graph

      • Hadoop Internals

      • Tuning, for the wise and lazy

      • Tuning, for the brave and foolish