Scaling up the split-apply-combine paradigm

The split-apply-combine philosophy for data tables scales up to much larger problems. Sometimes the tables themselves grow too big for one computer (or CPU); other times the tables may be reasonably sized, but the tasks are numerous. This split-apply-combine at scale has been called map-reduce, hadoop and high throughput computing. The idea is to take a big task; split it into many (1000s or millions) small tasks; send each of those to a separate CPU; and combine the results back together (reduce) once all are completed. These projects typically take 1000s of compute hours, and in some cases millions.

This works best when tasks are "parallel" -- can be done independently of each other, such as separate simulations from the same model system. Unfortunately, some loosely coupled tasks depend on each other and require more complicated arrangements, perhaps even needing high performance computing such as graphical CPUs, or GPUs.

MapReduce
Hadoop
Berkeley Spark | sparlyr package
HTCondor
DeltaRho

High Throughput & High Performance Computing

R/parallel
Parallel R book (Q. Ethan McCallum, Stephen Weston, 2011, O'Reilly Pub.)
Vaughan LK, Srinivasasainagendra V (2013) Where in the genome are we? A cautionary tale of database use in genomics research. Front. Genet. 4:38.
Data-Intensive Science think-piece from Data-Enabled Life Science Research (DELSA)
Cloud Computing (Wikipedia)
Grid Computing (Wikipedia: see list of national projects
Open Science Grid
HTCondor Project | Center for High Throughput Computing, UW-Madison
IEEE High Performance Computing Conference
NVIDIA High Performance Computing
Simulation Based Engineering Lab, UW-Madison
TurnKeyLinux.org | OwnCloud.org
OpenData Exchange | OpenData Foundation
Algorithms, Machines, People (AMP Lab, UC-Berkeley)
BIGDATA: White House Initiative
BackBlaze Online Backup (BackBlaze Storage Pod Details)
Dremel: Interactive Analysis of Web-Scale Datasets (Google)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaling_up.md

scaling_up.md

Scaling up the split-apply-combine paradigm

High Throughput & High Performance Computing

Files

scaling_up.md

Latest commit

History

scaling_up.md

File metadata and controls

Scaling up the split-apply-combine paradigm

High Throughput & High Performance Computing