The split-apply-combine philosophy for data tables scales up to much larger problems. Sometimes the tables themselves grow too big for one computer (or CPU); other times the tables may be reasonably sized, but the tasks are numerous. This split-apply-combine at scale has been called map-reduce, hadoop and high throughput computing. The idea is to take a big task; split it into many (1000s or millions) small tasks; send each of those to a separate CPU; and combine the results back together (reduce) once all are completed. These projects typically take 1000s of compute hours, and in some cases millions.
This works best when tasks are "parallel" -- can be done independently of each other, such as separate simulations from the same model system. Unfortunately, some loosely coupled tasks depend on each other and require more complicated arrangements, perhaps even needing high performance computing such as graphical CPUs, or GPUs.
- R/parallel
- Parallel R book (Q. Ethan McCallum, Stephen Weston, 2011, O'Reilly Pub.)
- Vaughan LK, Srinivasasainagendra V (2013) Where in the genome are we? A cautionary tale of database use in genomics research. Front. Genet. 4:38.
- Data-Intensive Science think-piece from Data-Enabled Life Science Research (DELSA)
- Cloud Computing (Wikipedia)
- Grid Computing (Wikipedia: see list of national projects
- Open Science Grid
- HTCondor Project | Center for High Throughput Computing, UW-Madison
- IEEE High Performance Computing Conference
- NVIDIA High Performance Computing
- Simulation Based Engineering Lab, UW-Madison
- TurnKeyLinux.org | OwnCloud.org
- OpenData Exchange | OpenData Foundation
- Algorithms, Machines, People (AMP Lab, UC-Berkeley)
- BIGDATA: White House Initiative
- BackBlaze Online Backup (BackBlaze Storage Pod Details)
- Dremel: Interactive Analysis of Web-Scale Datasets (Google)