-
Notifications
You must be signed in to change notification settings - Fork 59
Latency
In the Parallel Programming course we learned about:
- Data Parallelism in the single machine, multi-core, multiprocessor world.
- Parallel Collections as an implementation of this paradigm.
Here we will learn:
- Data Parallelism in a distributed (multi node) setting
- Distributed collections abstraction from Apache Spark as an implementation of this paradigm.
Because of the distribution, we have 2 new issues:
- Partial Failure: crash failures on a subset of machines in the cluster.
- Latency: network communication causes higher latency in some operations - cannot be masked and always present; impacts programming model as well as code directly as we try to reduce network communication.
Apache Spark stands out in the way it handles these issues.
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet US -> Europe -> US 150,000,000 ns 150,000 us 150 ms
- See that reading 1MB sequentially from disk is 100x more expensive than reading 1MB sequentially from memory.
- Also, sending packet over the network from US -> Europe -> US is a million times expensive than a main memory reference.
- The general trend is:
- memory operations = fastest
- disk operations = slow
- network operations = slowest
Lets multiply all these durations by a billion:
# Minute:
==========
L1 cache reference 0.5 s One heart beat (0.5 s)
Branch mispredict 5 s Yawn
L2 cache reference 7 s Long yawn
Mutex lock/unlock 25 s Making a coffee
# Hour:
==========
Main memory reference 100 s Brushing your teeth
Compress 1K bytes with Zippy 50 min One episode of a TV show (including ad breaks)
# Day
==========
Send 2K bytes over 1 Gbps network 5.5 hr From lunch to end of work day
# Week
==========
SSD random read 1.7 days A normal weekend
Read 1 MB sequentially from memory 2.9 days A long weekend
Round trip within same datacenter 5.8 days A medium vacation
Read 1 MB sequentially from SSD 11.6 days Waiting for almost 2 weeks for a delivery
# Year
==========
Disk seek 16.5 weeks A semester in university
Read 1 MB sequentially from disk 7.8 months Almost producing a new human being
The above 2 together 1 year
# Decade
==========
Send packet US -> Europe -> US 4.8 years Average time it takes to complete a bachelor's degree
Hadoop is the predecessor of Spark. Hadoop had a simple API and good fault tolerance (tolerance to nodes failing midway through a processing job).
This though came at a cost, where between each map and reduce step, to recover from potential failures, the data is shuffled and the intermediate results are written to disk and over the network in the cluster, which we know from above is expensive.
Spark maintains the above advantages by using functional programming, and has a different strategy to handle latency. It keeps all the data immutable and in-memory. All operations on data are just functional transformations, like the transformation functions offered by regular Scala collections. Thus we can just keep a track of chains of the transformations that were supposed to be done on a node that failed, and then replay it somewhere else easily. No intermediate results is needed to be written. Thus Fault tolerance is achieved by replaying functional transformations over original dataset. Disk operations are done only when data is needed to be formally/finally read or written.
Thus Hadoop taps Disk and Network more for its operations, whereas Spark tries to tap into Memory as much as possible for operations while minimizing Network operations. Hence Spark is significantly faster than hadoop.
Week 1
- Introduction
- Data Parallel to Distributed Data Parallel
- Latency
- RDDs: Spark's Distributed Collection
- RDDs: Transformation and Action
- Evaluation in Spark: Unlike Scala Collections!
- Cluster Topology Matters!
Week 2
- Reduction Operations (fold, foldLeft, aggregate)
- Pair RDDs
- Pair RDDs: Transformations and Actions
- Pair RDDs: Joins
Week 3
- Shuffling: What it is and why it's important
- Partitioning
- Optimizing with Partitioners
- Wide vs Narrow Dependencies
Week 4