01-Distributed System Foundations
Overview
A distributed system is a collection of autonomous computing nodes that communicate over a network to achieve a common goal. Unlike centralized systems, distributed systems face unique challenges arising from the lack of shared memory, independent node failures, unpredictable network delays, and the fundamental impossibility of perfect coordination.
Table of Contents
Introduction
Distributed systems are built because single machines hit limits (performance, reliability, geography, isolation). The price is that the network becomes part of your computer: latency, partial failure, and concurrency become the default.
Defining Characteristics
Interconnected Autonomous Nodes
A distributed system is a collection of independent nodes, each with its own processor, memory, and local state, interconnected by a Local Area Network (LAN) or Wide Area Network (WAN). Each node operates autonomously and can make local decisions.
No Shared Physical Memory
Nodes do not share physical memory. All communication and coordination must occur through explicit message passing over the network. This fundamental constraint shapes the design of all distributed algorithms.
Communication Time Dominance
The time required for message communication between nodes (Tm) is significantly greater than the time to execute local computations (Te). This inequality, Tm >> Te, is the defining characteristic of distributed systems according to Lamport's definition:
"A system is distributed if the message transmission time is not negligible compared to the time between events in a single process."
Implications: Because communication dominates, distributed algorithms must minimize message passing and maximize local computation. Even modern data center clusters qualify as distributed systems, as network latency remains orders of magnitude higher than local memory access.
Partial Failure
Individual nodes or network links can fail independently while the rest of the system continues operating. Unlike centralized systems where a single failure halts everything, distributed systems must gracefully handle partial failures.
Concurrency
Multiple nodes execute simultaneously, leading to inherent concurrency. Without perfect synchronization, different nodes may observe events in different orders, creating consistency challenges.
Motivations for Building Distributed Systems
Despite their inherent complexity, engineers build distributed systems for four primary reasons:
High Performance: Achieve parallelism by harnessing many CPUs, memories, and disks. Goal is N× speedup from N machines. Examples: MapReduce, Spark, scientific computing clusters.
Fault Tolerance: Survive component failures through redundancy. If one computer fails, another takes over. Examples: Replicated databases, distributed file systems (HDFS, GFS).
Inherent Distribution: Some problems are naturally geographically distributed, requiring distributed coordination. Examples: Interbank transfers, global CDNs, distributed sensor networks.
Security & Isolation: Isolate untrusted components through well-defined network protocols. Examples: Microservices, sandboxed code execution, multi-tenant cloud platforms.
Modern distributed systems design primarily focuses on achieving high performance and fault tolerance.
Core Challenges
Concurrency: Multiple computers executing concurrently face complex interactions, timing-dependent bugs (race conditions), non-deterministic behavior, and need for synchronization without shared memory.
Partial Failure: Unlike a single computer that either works or crashes completely, distributed systems experience partial failures where some components crash while others continue. Components may run slowly rather than fail, making it difficult to distinguish slow from failed. Failure detection is inherently unreliable in asynchronous systems.
Achieving Performance: The goal of N× speedup from N machines is rarely achieved due to communication overhead, bottlenecks (shared resources like databases), load imbalance, and sequential dependencies (Amdahl's Law). Careful architectural design is required to achieve true scalable performance.
System and Failure Models
System models define the assumptions about timing, synchronization, and behavior that algorithms can rely upon. Failure models characterize the types of failures that can occur. Together, these models define the "rules of the game" for distributed systems design.
Safety vs. Liveness
Safety: "nothing bad happens" (e.g., never return two different committed values).
Liveness: "something good eventually happens" (e.g., requests eventually complete).
Many results are phrased as: under certain assumptions you can guarantee safety, but liveness may be impossible (or vice versa).
Timing / System Models
Synchronous
Known upper bounds on message delay, processing time, and clock drift
Reliable timeouts → accurate failure detection
Hard to guarantee in practice (especially WANs)
Asynchronous
No timing bounds (messages can be arbitrarily delayed; processes arbitrarily slow)
Simple modeling of the Internet
No perfect failure detection; FLP shows consensus can’t be guaranteed with 1 crash fault
Partially synchronous
Bounds exist but are unknown, or hold only after some unknown stabilization time
Practical consensus (e.g., Raft) via timeouts + eventual stability
During unstable periods it behaves like async
Communication Assumptions
Reliable channel
Messages aren’t lost/corrupted (or are retransmitted)
TCP-like behavior (still can have partitions)
Unreliable channel
Loss/duplication/reordering may occur
UDP-like; app must add retries/dedup
Partition possible
Network can split so some nodes can’t talk
Treat as a first-class fault in design
Failure Models
Crash (fail-stop)
Node halts and stops responding
Common model for consensus/replication
Omission
Node drops some sends/receives
Includes message loss behaviors
Timing
Node misses known deadlines
Mostly relevant under synchronous assumptions
Byzantine
Node behaves arbitrarily/maliciously
Expensive: typically 3f+1 replicas to tolerate f faults
Partition
Network splits into components
Can cause split-brain without coordination
Rule of thumb: most production data systems assume crash failures + partial synchrony, and then engineer around partitions via quorums/consensus.
Time, Ordering, and Coordination
Distributed systems often need to answer: in what order did things happen? On a single machine, a shared clock and shared memory make this feel easy. In a distributed system, it’s not.
Why ordering is hard (3 bullets)
No global clock: clocks drift; synchronization reduces error but never eliminates it.
Unpredictable delay: message latency varies and reordering can occur.
Failures look like slowness: timeouts can’t perfectly distinguish a slow node from a crashed/partitioned one.
What systems do about it
Use logical time / causality tools (happened-before, Lamport clocks, vector clocks) to reason about ordering.
Use coordination (e.g., consensus + state machine replication) when replicas must agree on one order.
For the detailed treatment of causality and Lamport clocks, see Distributed System/02-Communication.md.
CAP Theorem
The CAP theorem (Brewer's theorem) is a fundamental impossibility result: a distributed system cannot simultaneously guarantee all three of the following properties:
The Three Properties
Consistency (C)
All clients observe operations as if there were a single, up-to-date copy of the data (commonly interpreted as linearizability for reads/writes).
Availability (A)
Every request to a non-failed node eventually receives a (non-error) response. The response may be stale if you choose availability over consistency during a partition.
Partition Tolerance (P)
The system continues to operate despite arbitrary message loss or network partitions. Nodes may be temporarily unable to communicate.
The Impossibility
In the presence of a network partition (P), which is unavoidable in real-world distributed systems, you must choose between consistency (C) and availability (A):
CP Systems (Consistency + Partition Tolerance)
Sacrifice availability during partitions. Some nodes may refuse requests to maintain consistency.
Examples: HBase, MongoDB (with majority writes), Redis (with synchronous replication), distributed databases with strong consistency.
Use Case: Financial systems, inventory management, any system where returning stale data is unacceptable.
AP Systems (Availability + Partition Tolerance)
Sacrifice consistency during partitions. All nodes remain available but may return stale data.
Examples: Cassandra, DynamoDB, Riak, Couchbase, DNS.
Use Case: Social media feeds, content delivery networks, shopping carts, systems where availability trumps consistency.
CA Systems (Consistency + Availability)
Only possible if you assume partitions do not occur (or you are willing to treat partitions as failures that break availability/operation). In practice, partitions can happen even within a single data center.
Examples: Traditional RDBMS (PostgreSQL, MySQL) on a single node or with synchronous replication in a tightly-coupled cluster.
Practical Implications
Networks are unreliable: Partitions happen in practice, so P is not optional. The real choice is between C and A during partitions.
Not binary: Real systems often provide tunable consistency levels (e.g., Cassandra's quorum reads/writes) allowing applications to choose their trade-off point.
Eventually consistent systems: Many AP systems provide eventual consistency—if writes stop, all replicas eventually converge to the same state.
Consistency
Consistency models define the guarantees about the order and visibility of operations in a distributed system. In a replicated storage system, consistency defines the rules about what a client will see when reading data after it has been written. With multiple replicas of data distributed across different machines, ambiguity can arise about which version of the data is "correct."
The Fundamental Trade-off
Example: A distributed key-value store has two replicas with k1=20. A client updates Replica A to k1=21, but the update doesn't reach Replica B before the client crashes. A subsequent get("k1") could return either 21 or 20, depending on which replica responds.
This leads to a fundamental trade-off:
Strong Consistency: Always returns the most recent write (
21). Requires expensive coordination between replicas on every operation. Use for financial transactions and inventory systems.Weak Consistency: May return stale data (
20). No coordination needed, providing low latency and high availability. Use for social media feeds and content delivery networks.
The trade-off becomes acute with geographic distribution. Cross-continent communication latency (100-200ms) makes strong consistency prohibitively expensive for interactive applications.
Strong Consistency (Linearizability)
Operations appear to occur instantaneously at some point between invocation and completion. Equivalent to a single, global, correct order.
Guarantee: If operation A completes before operation B begins, then A appears before B in the global order. A read always returns the value from the most recent completed write.
Cost: High latency, reduced availability (requires coordination between replicas).
Implementation: Typically requires consensus protocols (Paxos, Raft) or quorum-based coordination.
Example: Read returns the value of the most recent completed write, even if replicas are distributed globally.
Sequential Consistency
Operations appear to execute in some sequential order, and operations of each process appear in program order.
Weaker than linearizability: Doesn't respect real-time ordering across processes.
Example: All processes see writes in the same order, but not necessarily the real-time order.
Causal Consistency
Operations that are causally related must be seen in causal order by all processes. Concurrent operations may be seen in different orders by different processes.
Based on: Lamport's "Happened Before" relationship.
Example: If write W1 happens before write W2 (causally), all processes see W1 before W2. But if W1 and W2 are concurrent, different processes may see them in different orders.
Eventual Consistency
If no new updates are made, all replicas eventually converge to the same state.
Weakest useful model: Provides no guarantees about the order or timing of convergence.
Highly available: No coordination required for reads or writes.
Examples: DNS, Amazon's shopping cart, Cassandra with eventual consistency settings.
Challenges: Application must handle stale reads, conflicts, and convergence detection.
Fault Tolerance
At scale, component failures become a constant reality rather than exceptional events. While a single server might have a mean time between failures of one year, a system built from 1,000 computers will experience approximately three failures per day (1000 machines ÷ 365 days). Failures—crashed machines, faulty network cables, overheated switches, power outages—must be treated as a normal, constant state of operation, not a rare anomaly.
Replication is the primary mechanism for achieving fault tolerance and high availability in distributed systems.
Availability vs. Recoverability
Availability: System continues operating during failures. Achieved through replication—multiple copies of services and data route around failures. Example: A web service with 5 replicas continues serving requests even if 2 crash.
Recoverability: System can restart correctly after failures are repaired. Achieved through non-volatile storage—persisting state to disk via checkpoints or logs. Example: A database restarts from its write-ahead log and recovers all committed transactions.
Two Primary Tools for Fault Tolerance
Non-Volatile Storage: Persists system state using checkpointing (periodic snapshots), write-ahead logging (recording operations before applying), or journaling (maintaining a log of recent changes). Essential for recoverability but does not provide availability.
Replication: Maintains multiple copies of data or services on different machines. If one replica fails, others immediately take over. The central challenge is ensuring replicas stay consistent (see Consistency Models section).
Why Replicate?
Fault Tolerance: If one replica fails, others can continue serving requests without interruption.
High Availability: Distribute load across replicas, reducing latency and increasing throughput.
Locality: Place replicas geographically close to users for faster access (e.g., CDNs with servers on every continent).
Replication Strategies
Primary-Backup (Master-Slave)
One primary replica handles all writes; backups passively replicate the primary's state.
Advantages: Simple, strong consistency.
Disadvantages: Primary can be a bottleneck and, without automated failover, a single point of failure.
Multi-Primary (Multi-Master)
Multiple replicas can accept writes simultaneously.
Advantages: Higher write throughput, no single point of failure.
Disadvantages: Conflict resolution required, eventual consistency.
State Machine Replication
Replicas start in the same state and deterministically apply the same operations in the same order.
Key Requirement: Consensus on operation order (e.g., via Paxos or Raft).
Advantages: Strong consistency, fault tolerance.
Quorum Systems
Require agreement from a majority (or weighted quorum) of replicas before completing operations.
Read Quorum (R) + Write Quorum (W) > N: Ensures read and write quorums overlap (quorum intersection), which is a building block for strong behavior.
Example: In a system with 5 replicas, W=3 and R=3 ensures consistency.
Trade-off: Tune R and W for consistency vs. availability/latency. Higher W/R increases coordination cost and reduces availability during failures.
Scalability
Scalability is the ability of a system to handle increasing load by adding resources. The ultimate goal is scalable speed-up: N× resources yields N× performance or throughput.
Horizontal vs. Vertical Scaling
Horizontal Scaling (Scale Out): Add more nodes to the system. Preferred approach for distributed systems as it's more cost-effective and provides better fault tolerance through redundancy.
Vertical Scaling (Scale Up): Upgrade to more powerful nodes (more CPU, RAM, storage). Limited by hardware constraints and cost. Single point of failure remains.
Bottlenecks and Performance Challenges
Scalability is rarely infinite. Bottlenecks emerge as systems grow:
Common Bottleneck Pattern: A website scales by adding web servers until the shared database becomes the bottleneck. Adding more web servers yields no improvement—further scaling requires re-architecting the database through sharding, replication, or caching. Each resolved bottleneck often reveals the next.
Key Challenges:
Communication Overhead: Network latency dominates, making coordination expensive (Tm >> Te)
Shared Resources: Databases, coordinators, and shared state limit parallelism
Load Imbalance: Work may not distribute evenly across nodes
Amdahl's Law: Sequential dependencies limit maximum speedup
Techniques for Scalability
Partitioning/Sharding: Divide data across multiple nodes based on a partition key. Enables parallel processing and distributes load.
Caching: Store frequently accessed data in fast storage (memory) closer to clients. Reduces load on backend systems.
Load Balancing: Distribute requests evenly across multiple servers. Prevents hotspots and maximizes resource utilization.
Asynchronous Processing: Decouple request handling from processing. Use message queues to handle bursts and smooth load.
Design Principles
Key principles for designing distributed systems:
Design for Partial Failure: Individual components will fail independently. Systems must detect failures, isolate faulty components, and continue operating with reduced capacity. Use health checks, timeouts, circuit breakers, and graceful degradation.
Minimize Communication: Since communication dominates computation time (Tm >> Te), algorithms should maximize local work and minimize message passing. Batch operations when possible. Avoid distributed state—shared mutable state across nodes is expensive to maintain consistently. Prefer immutable data, local state, and stateless services.
Use Idempotent Operations: Design operations that can be safely retried without changing the result. Critical for handling message duplication and failures. Example: SET x = 5 is idempotent, but INCREMENT x is not.
Plan for Retries and Deduplication: At-least-once delivery is common in real systems. Add request IDs, dedupe tables, and careful semantics at boundaries (HTTP, queues, RPC).
Accept Trade-offs: Perfect solutions don't exist. CAP theorem, FLP impossibility, and latency constraints force trade-offs between consistency, availability, performance, and simplicity. Choose trade-offs that align with application requirements.
References
CS 6210: Advanced Operating Systems - Georgia Tech OMSCS
MIT 6.824: Distributed Systems
Last updated