This repository has been archived by the owner on Feb 21, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
95ce5ac
commit 07b075c
Showing
1 changed file
with
27 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# Chapter 8: The Trouble with Distributed Systems | ||
|
||
### Problems that can occur in distributed systems | ||
|
||
- Whenever you try to send a packet over the network, it may be lost or arbitrarily delayed. Likewise, the reply may be | ||
lost or delayed, so if you don’t get a reply, you have no idea whether the message got through. | ||
- A node's clock may be significantly out of sync with other nodes (despite your best efforts to set up NTP), it may | ||
suddenly jump forward or back in time, and relying on it is dangerous because you most likely don’t have a good | ||
measure of your clock's error interval. | ||
- A process may pause for a substantial amount of time at any point in its execution (perhaps due to a stop-the-world | ||
garbage collector), be declared dead by other nodes, and then come back to life again without realizing that it was | ||
paused. | ||
|
||
Whenever software tries to do anything involving other nodes, there is the possibility that it may occasionally fail, | ||
or randomly go slow, or not respond at all (and eventually time out). | ||
|
||
To tolerate faults, the first step is to _detect_ them. | ||
|
||
Major decisions cannot be safely made by a single node, so we require protocols that enlist help from other | ||
nodes and try to get a quorum to agree. | ||
|
||
It is possible to give hard real-time response guarantees and bounded delays in networks, but doing so is very | ||
expensive and results in lower utilization of hardware resources. | ||
|
||
Supercomputers assume reliable components and thus have to be stopped and restarted entirely when a component fails. | ||
By contrast, distributed systems can run forever without being interrupted at the service level, because all faults and | ||
maintenance can be handled at the node level (at least in theory). |