01-Distributed System Foundations
Overview
A distributed system is a collection of autonomous computing nodes that communicate over a network to achieve a common goal. Unlike centralized systems, distributed systems face unique challenges arising from the lack of shared memory, independent node failures, unpredictable network delays, and the fundamental impossibility of perfect coordination. This document establishes the foundational concepts necessary for understanding and building distributed systems: defining characteristics, system models, failure modes, fundamental impossibility results, and key design principles.
Table of Contents
Defining Characteristics
A distributed system is characterized by a set of core properties that distinguish it from centralized and parallel systems.
Core Properties
Interconnected Autonomous Nodes
A distributed system is a collection of independent nodes, each with its own processor, memory, and local state, interconnected by a Local Area Network (LAN) or Wide Area Network (WAN). Each node operates autonomously and can make local decisions.
No Shared Physical Memory
Nodes do not share physical memory. All communication and coordination must occur through explicit message passing over the network. This fundamental constraint shapes the design of all distributed algorithms.
Communication Time Dominance
The time required for message communication between nodes (Tm) is significantly greater than the time to execute local computations (Te). This inequality, Tm >> Te, is the defining characteristic of distributed systems according to Lamport's definition:
"A system is distributed if the message transmission time is not negligible compared to the time between events in a single process."
Implications: Because communication dominates, distributed algorithms must minimize message passing and maximize local computation. Even modern data center clusters qualify as distributed systems, as network latency remains orders of magnitude higher than local memory access.
Partial Failure
Individual nodes or network links can fail independently while the rest of the system continues operating. Unlike centralized systems where a single failure halts everything, distributed systems must gracefully handle partial failures.
Concurrency
Multiple nodes execute simultaneously, leading to inherent concurrency. Without perfect synchronization, different nodes may observe events in different orders, creating consistency challenges.
System Models
System models define the assumptions about timing, synchronization, and behavior that algorithms can rely upon. The choice of model profoundly affects what is achievable.
Timing Models
Synchronous System
A system where there exist known upper bounds on:
Message delivery time
Processing time for each step
Clock drift rate
Advantages: Enables timeout-based failure detection. If a node doesn't respond within the bound, it can be declared failed with certainty.
Disadvantages: Difficult to achieve in practice, especially in WANs. Requires conservative bounds, leading to poor performance.
Example: Hard real-time systems, tightly-coupled clusters with reliable networks.
Asynchronous System
A system with no timing assumptions. Messages can be arbitrarily delayed, processes can be arbitrarily slow, and there are no clock bounds.
Advantages: Realistic model for the Internet and most distributed systems.
Disadvantages: Fundamentally limits what can be achieved. The FLP impossibility result shows that consensus is impossible in asynchronous systems with even one crash failure.
Key Challenge: Cannot distinguish between a slow node and a crashed node using timeouts alone.
Partially Synchronous System
A middle ground: the system behaves asynchronously during periods of instability but eventually stabilizes to synchronous behavior. Most real-world systems fall into this category.
Characteristics: Bounds exist but are unknown, or the system eventually satisfies bounds after an unknown stabilization period.
Examples: Modern cloud environments, Internet protocols with adaptive timeouts.
Communication Models
Reliable Communication
Messages are eventually delivered without corruption, duplication, or reordering. Lost messages are automatically retransmitted by lower layers (e.g., TCP).
Unreliable Communication
Messages may be lost, duplicated, corrupted, or reordered. The application must handle these scenarios (e.g., UDP).
Network Partitions
The network may split into disconnected components, preventing communication between partitions. Nodes within a partition can communicate, but not across partitions.
Failure Models
Failure models characterize the types of failures that can occur and define the behavior of faulty components.
Crash Failures (Fail-Stop)
A node stops executing and never recovers. Once crashed, it sends no further messages and performs no further computations. This is the simplest and most commonly assumed failure model.
Detection: Difficult in asynchronous systems (indistinguishable from slow nodes) but straightforward in synchronous systems using timeouts.
Recovery: No recovery—the node is permanently failed. However, state may be recovered from persistent storage or replicas.
Omission Failures
A node fails to send or receive messages, but otherwise continues executing correctly. Can be subdivided into:
Send omission: Node fails to send some messages
Receive omission: Node fails to receive some messages
Relationship to Crash: Crash failures are a special case where all subsequent sends and receives are omitted.
Timing Failures
In synchronous systems, a node fails to respond within the expected time bounds. Examples include:
Response arrives too late
Clock drift exceeds allowed bounds
These are critical in real-time systems but less relevant in asynchronous models.
Byzantine Failures (Arbitrary Failures)
The most severe failure model: a faulty node can exhibit arbitrary, even malicious, behavior. It may:
Send incorrect or contradictory messages
Collude with other faulty nodes
Corrupt local state
Impersonate other nodes
Examples: Security attacks, software bugs, hardware corruption, compromised nodes.
Cost: Byzantine fault tolerance requires significantly more resources (typically 3f+1 replicas to tolerate f failures, compared to f+1 for crash failures).
Byzantine Generals Problem: The classic formulation: how can loyal generals coordinate an attack when some generals may be traitors sending conflicting messages?
Network Partition Failures
The network splits into disconnected subnetworks. Nodes within each partition can communicate, but not across partitions.
Split-Brain Problem: Each partition may independently elect a leader or modify state, leading to inconsistency when partitions heal.
Common Causes: Network cable cuts, router failures, firewall misconfigurations, geographic isolation.
The CAP Theorem
The CAP theorem (also called Brewer's theorem) is a fundamental impossibility result stating that a distributed system cannot simultaneously guarantee all three of the following properties:
The Three Properties
Consistency (C)
Every read receives the most recent write or an error. All nodes see the same data at the same time. Equivalent to linearizability or strong consistency.
Availability (A)
Every request (read or write) receives a non-error response, without guarantee that it contains the most recent write. The system remains operational for all requests.
Partition Tolerance (P)
The system continues to operate despite arbitrary message loss or network partitions. Nodes may be temporarily unable to communicate.
The Impossibility
In the presence of a network partition (P), which is unavoidable in real-world distributed systems, you must choose between consistency (C) and availability (A):
CP Systems (Consistency + Partition Tolerance)
Sacrifice availability during partitions. Some nodes may refuse requests to maintain consistency.
Examples: HBase, MongoDB (with majority writes), Redis (with synchronous replication), distributed databases with strong consistency.
Use Case: Financial systems, inventory management, any system where returning stale data is unacceptable.
AP Systems (Availability + Partition Tolerance)
Sacrifice consistency during partitions. All nodes remain available but may return stale data.
Examples: Cassandra, DynamoDB, Riak, Couchbase, DNS.
Use Case: Social media feeds, content delivery networks, shopping carts, systems where availability trumps consistency.
CA Systems (Consistency + Availability)
Only possible without network partitions, which means the system is not truly distributed or operates within a single data center with reliable networking.
Examples: Traditional RDBMS (PostgreSQL, MySQL) on a single node or with synchronous replication in a tightly-coupled cluster.
Practical Implications
Networks are unreliable: Partitions happen in practice, so P is not optional. The real choice is between C and A during partitions.
Not binary: Real systems often provide tunable consistency levels (e.g., Cassandra's quorum reads/writes) allowing applications to choose their trade-off point.
Eventually consistent systems: Many AP systems provide eventual consistency—if writes stop, all replicas eventually converge to the same state.
Fundamental Problems
Certain problems are fundamental to distributed systems and have been extensively studied. Solutions to these problems form the building blocks of larger systems.
Consensus
The consensus problem requires a group of processes to agree on a single value, despite failures and asynchrony.
Properties:
Agreement: All non-faulty processes decide on the same value
Validity: The decided value was proposed by some process
Termination: All non-faulty processes eventually decide
FLP Impossibility: Fischer, Lynch, and Paterson proved that consensus is impossible in an asynchronous system with even one crash failure if we require deterministic termination.
Practical Solutions: Real systems circumvent FLP through:
Randomization (e.g., Ben-Or algorithm)
Partial synchrony assumptions (e.g., Paxos, Raft)
Failure detectors (detecting suspected failures)
Applications: Leader election, atomic commit (2PC, 3PC), state machine replication.
Leader Election
Select a single node from a group to coordinate actions, preventing conflicts and providing a single point of decision.
Challenges:
Handling simultaneous elections
Dealing with network partitions (split-brain)
Detecting leader failures
Algorithms: Bully algorithm, Ring algorithm, Paxos-based election, Raft leader election.
Use Cases: Distributed lock managers, cluster coordination (e.g., Zookeeper), primary-backup replication.
Distributed Transactions
Ensure atomicity across multiple nodes: either all operations commit or all abort.
Two-Phase Commit (2PC):
Phase 1 (Prepare): Coordinator asks all participants if they can commit
Phase 2 (Commit/Abort): Based on votes, coordinator tells all to commit or abort
Problem: Blocking protocol—if coordinator crashes after participants vote yes, they're stuck waiting.
Three-Phase Commit (3PC): Non-blocking variant that adds a pre-commit phase, but requires synchrony assumptions.
Modern Approaches: Eventual consistency, CRDTs (Conflict-free Replicated Data Types), consensus-based transactions (e.g., Spanner).
Mutual Exclusion
Ensure that only one process accesses a critical section at a time, without shared memory.
Approaches:
Token-based (circulating token)
Permission-based (Ricart-Agrawala algorithm)
Quorum-based
Coordinator-based (centralized lock manager)
Metrics: Message complexity, latency, fault tolerance.
Consistency Models
Consistency models define the guarantees about the order and visibility of operations in a distributed system. Stronger models are easier to reason about but harder to implement efficiently.
Strong Consistency (Linearizability)
Operations appear to occur instantaneously at some point between invocation and completion. Equivalent to a single, global, correct order.
Guarantee: If operation A completes before operation B begins, then A appears before B in the global order.
Cost: High latency, reduced availability (requires coordination).
Example: Read returns the value of the most recent completed write.
Sequential Consistency
Operations appear to execute in some sequential order, and operations of each process appear in program order.
Weaker than linearizability: Doesn't respect real-time ordering across processes.
Example: All processes see writes in the same order, but not necessarily the real-time order.
Causal Consistency
Operations that are causally related must be seen in causal order by all processes. Concurrent operations may be seen in different orders by different processes.
Based on: Lamport's "Happened Before" relationship.
Example: If write W1 happens before write W2 (causally), all processes see W1 before W2. But if W1 and W2 are concurrent, different processes may see them in different orders.
Eventual Consistency
If no new updates are made, all replicas eventually converge to the same state.
Weakest useful model: Provides no guarantees about the order or timing of convergence.
Highly available: No coordination required for reads or writes.
Examples: DNS, Amazon's shopping cart, Cassandra with eventual consistency settings.
Challenges: Application must handle stale reads, conflicts, and convergence detection.
Replication and Fault Tolerance
Replication is the primary mechanism for achieving fault tolerance and high availability in distributed systems.
Why Replicate?
Fault Tolerance: If one replica fails, others can continue serving requests.
High Availability: Distribute load across replicas, reducing latency and increasing throughput.
Locality: Place replicas geographically close to users for faster access.
Replication Strategies
Primary-Backup (Master-Slave)
One primary replica handles all writes; backups passively replicate the primary's state.
Advantages: Simple, strong consistency.
Disadvantages: Primary is a bottleneck and single point of failure.
Multi-Primary (Multi-Master)
Multiple replicas can accept writes simultaneously.
Advantages: Higher write throughput, no single point of failure.
Disadvantages: Conflict resolution required, eventual consistency.
State Machine Replication
Replicas start in the same state and deterministically apply the same operations in the same order.
Key Requirement: Consensus on operation order (e.g., via Paxos or Raft).
Advantages: Strong consistency, fault tolerance.
Quorum Systems
Require agreement from a majority (or weighted quorum) of replicas before completing operations.
Read Quorum (R) + Write Quorum (W) > N: Ensures reads see most recent writes.
Example: In a system with 5 replicas, W=3 and R=3 ensures consistency.
Trade-off: Tune R and W for consistency vs. availability. High W = strong consistency but lower write availability.
Architectural Patterns
Common architectural patterns shape how distributed systems are structured.
Client-Server
Clients request services from centralized servers. Servers manage state and provide responses.
Advantages: Simple, centralized control, easier to secure.
Disadvantages: Server is a bottleneck and single point of failure.
Examples: Web applications, database systems, traditional file servers.
Peer-to-Peer (P2P)
All nodes are equal peers, both clients and servers. No central coordination.
Advantages: Highly scalable, no single point of failure, self-organizing.
Disadvantages: Complex coordination, challenging to secure, harder to maintain consistency.
Examples: BitTorrent, blockchain networks, distributed hash tables (Chord, Kademlia).
Multi-Tier (N-Tier)
System is divided into layers: presentation tier (UI), application tier (business logic), data tier (storage).
Advantages: Separation of concerns, independent scaling of tiers, easier to maintain.
Disadvantages: Increased complexity, network latency between tiers.
Examples: Enterprise web applications, microservices architectures.
Microservices
System is composed of small, independent services that communicate via APIs.
Advantages: Independent deployment, technology diversity, fault isolation.
Disadvantages: Distributed system complexity (consensus, consistency), operational overhead.
Examples: Netflix, Amazon, Uber, modern cloud-native applications.
Design Principles
The fundamental characteristics of distributed systems lead to several important design principles:
Minimize Communication
Since communication dominates computation time (Tm >> Te), algorithms should maximize local work and minimize message passing. Batch operations when possible.
Embrace Asynchrony
Systems cannot rely on global clocks or instantaneous communication. Design for asynchronous message passing with unbounded delays. Use event-driven architectures.
Design for Partial Failure
Individual components will fail independently. Systems must detect failures, isolate faulty components, and continue operating with reduced capacity.
Techniques: Health checks, timeouts, circuit breakers, graceful degradation.
Avoid Distributed State When Possible
Shared mutable state across nodes is expensive to maintain consistently. Prefer:
Immutable data
Local state
Eventual consistency where appropriate
Stateless services
Use Idempotent Operations
Operations that can be safely retried without changing the result. Critical for handling message duplication and failures.
Example: SET x = 5 is idempotent, but INCREMENT x is not.
Plan for Scalability
Design systems to scale horizontally (add more nodes) rather than vertically (larger nodes). Use partitioning/sharding to distribute data and load.
Observe and Monitor
Distributed systems are inherently complex. Comprehensive logging, metrics, and distributed tracing are essential for understanding behavior and diagnosing issues.
Accept Trade-offs
Perfect solutions don't exist. CAP theorem, FLP impossibility, and latency constraints force trade-offs between consistency, availability, performance, and simplicity. Choose trade-offs that align with application requirements.
Summary
Distributed systems are characterized by autonomous nodes communicating over networks where message delays dominate computation time. The absence of shared memory, inevitability of partial failures, and impossibility of perfect coordination create fundamental challenges. System models define assumptions about timing and communication, while failure models characterize how components can fail, from simple crash failures to Byzantine faults.
The CAP theorem establishes that distributed systems must choose between consistency and availability during network partitions, which are unavoidable in practice. Fundamental problems like consensus, leader election, and distributed transactions form the building blocks of distributed systems, though impossibility results like FLP show that some problems cannot be solved perfectly in asynchronous systems.
Consistency models range from strong linearizability (expensive but simple to reason about) to eventual consistency (highly available but complex to use correctly). Replication is the primary mechanism for fault tolerance, with various strategies trading consistency for availability. Common architectural patterns like client-server, peer-to-peer, and microservices shape system structure, each with distinct trade-offs.
Successful distributed systems embrace these challenges through careful design: minimizing communication, designing for partial failure, avoiding shared state, using idempotent operations, and accepting that perfect solutions don't exist. The art of distributed systems lies in understanding these fundamental principles and making informed trade-offs that align with application requirements.
References
Lamport, L. (1978). "Time, Clocks, and the Ordering of Events in a Distributed System." Communications of the ACM, 21(7), 558-565.
Fischer, M. J., Lynch, N. A., & Paterson, M. S. (1985). "Impossibility of Distributed Consensus with One Faulty Process." Journal of the ACM, 32(2), 374-382.
Gilbert, S., & Lynch, N. (2002). "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services." ACM SIGACT News, 33(2), 51-59.
Brewer, E. A. (2000). "Towards Robust Distributed Systems." Proceedings of the 19th Annual ACM Symposium on Principles of Distributed Computing.
Vogels, W. (2009). "Eventually Consistent." Communications of the ACM, 52(1), 40-44.
Tanenbaum, A. S., & Van Steen, M. (2017). Distributed Systems: Principles and Paradigms (3rd ed.). CreateSpace Independent Publishing Platform.
Graduate courses from Georgia Institute of Technology and Columbia University
Last updated