02-Distributed Communication
Overview
This document explores four fundamental aspects of distributed systems: understanding causality and event ordering, establishing temporal ordering through clock mechanisms, optimizing inter-process communication for performance, and the evolution toward programmable networks. The "Happened Before" relationship provides a formal basis for reasoning about causality without global clocks. Lamport's Clocks build on this foundation to assign timestamps to events. Performance in distributed systems is critically dependent on minimizing communication latency, particularly for Remote Procedure Calls (RPCs). Finally, Active Networks and Software Defined Networking (SDN) represent the evolution toward programmable, virtualized network infrastructure that has become essential in modern cloud and data center environments.
Table of Contents
Event Ordering in Distributed Systems
In distributed systems where communication time dominates computation time (Tm >> Te), establishing a consistent ordering of events across autonomous nodes becomes a fundamental challenge. Without a shared global clock or shared memory, we need formal mechanisms to reason about causality and event relationships.
The "Happened Before" Relationship
To reason about causality and event order without a global clock, Lamport introduced the "Happened Before" relationship, denoted by A -> B ("A happened before B"). This relationship establishes a partial ordering of events based on causal relationships.
Definition
The "Happened Before" relationship is defined by two conditions:
Events within the same process: If events A and B occur in the same process and A occurs before B in the process's sequential execution, then
A -> B.Events connected by communication: If A is the event of sending a message from one process and B is the event of receiving that same message in another process, then
A -> B.
Transitivity
This relationship is also transitive: if A -> B and B -> C, then A -> C.
The transitivity property allows us to establish causal chains across multiple processes and multiple message exchanges, enabling reasoning about complex distributed computations.
Concurrent Events and Partial Ordering
Events that are not related by the "Happened Before" relationship (either directly or transitively) are considered concurrent. Understanding concurrency is critical to building correct distributed systems.
Concurrency Definition
If events A and B are concurrent, it is impossible to definitively state that one occurred before the other. In different executions of the same program, the real-time ordering of concurrent events can vary.
Partial Order
This concurrency means the "Happened Before" relationship provides only a partial order of all events in the system. Not all pairs of events can be ordered with respect to each other.
Common Pitfall
Assuming a specific order for concurrent events is a common source of timing bugs in distributed programs. Robust distributed algorithms must explicitly recognize which events are causally related and which are concurrent.
Clocks and Synchronization
Lamport's Logical Clock
The logical clock is a simple, monotonically increasing counter (or "clock"), Ci, maintained by each process Pi. It associates a numerical timestamp C(e) with every event e.
Timestamp Assignment Rules
Local Events: For two successive events
aandbwithin the same process, the timestamp ofbmust be greater than the timestamp ofa(C(b) > C(a)). This is achieved by incrementing the local clock between events.Communication Events: A message sent from process
Picarries the timestamp of the send event,C(send). When processPjreceives this message, it must assign a timestampC(receive)to the receive event such thatC(receive) > C(send).Specifically, the receiving process
Pjsets its local clockCjtoMAX(current Cj, incoming C(send)) + 1(or some other increment). This ensures the causality principle (a message cannot be received before it is sent) is respected in the logical timeline.
Partial Order Implication
If an event a happened before b (a -> b), then it is guaranteed that C(a) < C(b). However, the converse is not true. If C(x) < C(y), it does not necessarily mean that x -> y. The events x and y could be concurrent, and their timestamp ordering might be arbitrary.
Lamport's Logical Clock thus provides a partial order, not a total one.
Example
Consider three processes P1, P2, and P3:
P1: e1(1) -> e2(2) -> send(3) -> e4(4)
|
v
P2: e5(1) -> e6(2) -> recv(4) -> e7(5) -> send(6)
|
v
P3: e8(1) -> e9(2) -> e10(3) -> recv(7) -> e11(8)Note how the receive operations update their clocks to be greater than the timestamp carried by the incoming message.
Achieving Total Order
While partial ordering is sufficient for many applications, some problems (e.g., distributed mutual exclusion) require a definitive, total order of all events. Lamport's total order extends the logical clock by introducing a deterministic tie-breaking rule.
The Need for Total Order
In a car-sharing example, if two family members request the car with the same logical timestamp, a conflict arises. A tie-breaking rule, such as "age wins" or "lower process ID wins," is needed to make an unambiguous decision.
Formulation
An event a in process Pi is said to precede event b in process Pj in the total order if:
The logical timestamp
C(a) < C(b), ORThe timestamps are equal (
C(a) = C(b)) AND the process IDPi < Pj.
Any arbitrary, globally-known condition can be used for tie-breaking. This mechanism allows every node to independently derive the exact same total ordering of all events in the system.
Application: Distributed Mutual Exclusion
This algorithm uses Lamport's total order to implement a lock without shared memory, demonstrating the power of logical timestamps in coordination protocols.
Protocol
Request: A process
Piwanting to acquire a lock sends a timestamped request message to all other processes and adds the request to its own local queue.Receipt: When a process
Pjreceives a request, it places it in its local queue, ordered by the request's total order (timestamp, then process ID), and sends an acknowledgment (ACK) back toPi.Acquire Lock: Process
Pican acquire the lock only when two conditions are met:Its own request is at the top of its local queue.
It has received messages (either ACKs or later-timestamped requests) from all other processes. This confirms that no other process has an older, outstanding request.
Release Lock: To release the lock, the process removes its request from its own queue and sends an
unlockmessage to all other processes, which then remove the corresponding entry from their queues.
Message Complexity
This algorithm requires 3(N-1) messages per lock-unlock cycle:
N-1request messagesN-1ACK messagesN-1unlock messages
Optimization: This can be reduced to 2(N-1) messages by deferring an ACK if the receiving node has an older pending request. Its eventual unlock message then serves as the implicit acknowledgment.
Correctness
The algorithm guarantees mutual exclusion because:
All processes agree on the same total order of requests
A process can only enter the critical section when its request is at the head of this order
The requirement to hear from all other processes ensures no older request exists
Lamport's Physical Clock
Logical clocks are insufficient for real-world scenarios that depend on wall-clock time, as they are vulnerable to anomalies caused by clock drift. Lamport's Physical Clock establishes conditions to guarantee that if event a happened before b in real time, their timestamps will reflect this order.
The Clock Drift Problem
Computer clocks are imperfect and can drift, running faster or slower than real time. This drift can be:
Individual: One clock relative to real time
Mutual: Two clocks relative to each other
Real-World Anomaly Example
Consider an email system:
Alice sends an email at 10:00:05 AM (according to her clock)
Bob receives it instantly and replies at 10:00:03 AM (according to his clock)
Alice receives the reply at 10:00:04 AM (according to her clock)
The timestamps suggest Bob replied before Alice sent the original message, creating a causality violation.
Physical Clock Conditions (PC1 & PC2)
To prevent time-related anomalies, two conditions must be met:
PC1 - Bounded Individual Drift: The rate of drift for any clock
Ci(t)with respect to real timetmust be bounded by a very small constantk.|dCi(t)/dt - 1| < kPC2 - Bounded Mutual Drift: The difference in readings between any two clocks
Ci(t)andCj(t)at the same real timetmust be bounded by a small constantε.|Ci(t) - Cj(t)| < ε
Key Insight
Essentially, these conditions ensure that the inter-process communication time μ is significantly larger than any potential clock drift. Formally:
μ ≥ ε / (1 - k)If communication time dominates drift, anomalies can be avoided. This relationship shows that physical clocks can be synchronized well enough for distributed systems as long as the network is reasonably fast relative to clock drift rates.
Practical Implementation
In practice, this is achieved through clock synchronization protocols such as:
Network Time Protocol (NTP): Synchronizes clocks across the Internet
Precision Time Protocol (PTP): Provides microsecond-level synchronization in LANs
GPS-based synchronization: Uses atomic clocks from satellites
Communication and Remote Procedure Calls
Components of RPC Latency
The end-to-end latency of an RPC involves numerous steps, with overhead added by both software and hardware. Understanding each step is crucial for identifying optimization opportunities.
1
Client Call
Client application sets up arguments and makes a kernel call. Kernel marshals arguments into a packet.
Software
2
Controller Latency
Network controller DMAs the packet from system memory to its buffer and puts it on the wire.
Hardware
3
Time on Wire
Packet travels across the network to the server.
Physical
4
Interrupt Handling
Server's network card interrupts the OS, which moves the packet from the controller into memory.
Software/Hardware
5
Server Setup
OS locates and dispatches the server procedure, unmarshaling the arguments from the packet.
Software
6
Server Execution
The server procedure executes its logic and prepares a reply.
Application
7-10
Return Path
Steps 2, 3, and 4 are repeated in reverse as the reply is sent back to the client.
Hardware/Physical/Software
11
Client Setup
Client OS receives the interrupt for the reply and reactivates the original client process to receive the results.
Software
Dominant Overhead Sources
The key sources of software overhead that OS designers focus on optimizing are:
Marshaling and data copying (Steps 1, 5, 11)
Control transfers/context switches (Steps 1, 5, 11)
Protocol processing (Steps 1-5, 7-11)
Marshaling and Data Copying
This is often the largest source of overhead. A naive implementation can involve three distinct memory copies before a packet is sent.
The Three-Copy Problem
Client Stub Copy: The client stub copies arguments from the process stack into a contiguous RPC message buffer in user space.
Kernel Copy: The kernel copies the RPC message from user space into its own internal buffer.
DMA Copy: The network controller DMAs the message from the kernel's buffer to its own hardware buffer.
Optimization Techniques
The DMA copy is generally unavoidable due to hardware constraints. However, the first two copies can be eliminated or reduced:
Push Stub into Kernel:
The client stub code can be installed directly into the kernel at bind time
The stub can then marshal arguments directly from the client's stack into the kernel buffer
Eliminates the intermediate user-space copy
Reduces data movement from 3 copies to 2 copies
Shared Descriptors:
The user-space stub creates a "shared descriptor" that describes the layout (address and length) of each argument on the stack
The kernel reads this descriptor and performs a "gather" operation
Copies disparate data directly into its network buffer
Again avoids the intermediate copy
More flexible than pushing stub into kernel, maintains user-space programming model
Performance Impact
Traditional approach: Stack -> User buffer -> Kernel buffer -> NIC buffer (3 copies)
Optimized approach: Stack -> Kernel buffer -> NIC buffer (2 copies)
Reduction: ~33% fewer memory operationsControl Transfer (Context Switches)
A full RPC round-trip can involve four context switches, two of which are on the critical path for latency.
The Four Context Switches
Switch 1 (Client): Client calls RPC → OS blocks client → OS switches to another process
C1. (Not critical)Switch 2 (Server): RPC arrives → OS switches from current process
S1to the server processS. (Critical)Switch 3 (Server): Server replies → OS switches from
Sto another processS2. (Not critical)Switch 4 (Client): Reply arrives → OS switches from current process
C2back to the original clientC. (Critical)
Optimization Techniques
Reduce to Two Switches: The non-critical switches (1 and 3) can be overlapped with network transmission time, effectively removing them from the latency calculation. The OS can schedule other work while the packet is in transit.
Reduce to One Switch: For very fast RPCs on a LAN, the client can spin-wait instead of being context-switched out:
Eliminates switches 1 and 4
Client remains scheduled, polling for the reply
Underutilizes the client CPU but minimizes latency
Only remaining switch is the critical one on the server side (Switch 2)
Effective when RPC latency < scheduling quantum
Trade-offs
Four switches: Maximum CPU utilization, higher latency
Two switches: Balanced approach, overlap with I/O
One switch: Minimum latency, CPU underutilization
Zero switches: Only possible with dedicated hardware (e.g., RDMA)Protocol Processing
For reliable LANs, general-purpose protocols like TCP/IP introduce unnecessary overhead. A leaner, RPC-specific protocol can be designed by making certain assumptions about the network and RPC semantics.
Optimization Techniques
Eliminate Low-Level ACKs:
In RPC, the reply from the server serves as an implicit acknowledgment of the request
The client's next action (or timeout) serves as an implicit ACK of the reply
Eliminates the need for separate, low-level ACK packets
Reduces message count by ~40%
Use Hardware Checksum:
Offload checksum calculation from software to the network hardware
Modern NICs support this in hardware at line speed
Frees CPU cycles for application work
Avoid Client-Side Buffering:
Since the client is blocked waiting for a response, if a request is lost, it can be regenerated and resent
No need for the OS to buffer the outgoing packet after transmission
Saves kernel memory and reduces buffer management overhead
Overlap Server-Side Buffering:
The server should buffer the reply in case a retransmission is needed
This buffering can be overlapped with the actual transmission of the reply packet
Hides buffering latency from the critical path
Reply buffer can be freed after timeout period or upon receiving next request from same client
RPC-Specific Protocol Features
A specialized RPC protocol might include:
Connection-less operation for single request-reply exchanges
Idempotency tracking to handle duplicate requests safely
At-most-once semantics through request IDs
Adaptive timeout based on measured RTT
Batching for multiple small RPCs to the same server
Active Networks and Software Defined Networking
The Vision of Active Networks
Active Networks is a research paradigm that explores making the network itself programmable and intelligent, representing a visionary approach to network architecture that has influenced modern cloud computing and SDN.
Instead of routers being passive devices that simply forward packets based on static routing tables, Active Networks proposes that packets can carry code that is executed by intermediate routers. This allows for customized, application-specific routing and in-network processing.
Key Concepts
Programmable Routers: Network nodes can execute application-supplied code, enabling dynamic behavior adaptation.
In-Network Processing: Computation can occur at intermediate nodes, not just endpoints.
Application-Specific Behavior: Each application can define custom packet processing logic.
Example Use Case
A single "greeting" message could be sent, and an active router near the destination could execute code to demultiplex it into multiple copies for different recipients, saving bandwidth:
Traditional approach:
Client -> Send N messages -> Router (forward) -> N recipients
Bandwidth: N × message_size
Active Networks approach:
Client -> Send 1 message with demux code -> Router (execute, replicate) -> N recipients
Bandwidth: 1 × message_size + code_size (amortized)The ANTS Toolkit: A Practical Approach
The Active Node Transfer System (ANTS) toolkit was developed to implement this vision practically, without requiring a complete overhaul of the internet.
Design Principles
Application-Level Toolkit:
ANTS operates as a user-level library
Avoids complex kernel modifications
Can be deployed incrementally
Edge-Focused:
Assumes that intelligence (active nodes) resides at the edges of the network
Leaves the core IP network unchanged
Pragmatic approach to deployment
The ANTS Capsule
The ANTS toolkit wraps the application payload into a "capsule":
Structure: An ANTS packet consists of [IP Header | ANTS Header | Payload].
ANTS Header: Contains two key fields:
type: A cryptographic fingerprint (e.g., MD5 hash) of the code needed to process the capsuleprev: The address of the upstream node that last processed a capsule of this type
Capsule Implementation and Processing
ANTS uses a "code-by-reference" model to avoid sending executable code in every packet:
Code Retrieval: When an active node receives a capsule, it checks its local cache ("soft-store") for the code corresponding to the capsule's
typefield.Cache Miss: If the code is not in the cache, the node requests it from the
prevnode specified in the header. The previous node, having just processed it, likely has the code cached.Verification: Upon receiving the code, the node computes its hash and verifies that it matches the
typefingerprint in the capsule to prevent code spoofing.Execution and Caching: The node executes the verified code to process the capsule (e.g., forward it, modify it, duplicate it) and caches the code for future packets in the same flow.
Failure: If a node cannot retrieve the code, it simply drops the capsule, relying on higher-level transport protocols for retransmission.
Processing Flow:
Capsule arrives at active node
↓
Check local cache for code (type field)
↓
Cache hit? → Execute code
↓
Cache miss? → Request code from prev node
↓
Verify hash matches type
↓
Cache code
↓
Execute codeFeasibility and Evolution to SDN
While a powerful concept, Active Networks faced significant roadblocks in achieving widespread adoption.
Challenges
Vendor Resistance: Reluctance of router vendors to open their hardware platforms to arbitrary code execution.
Performance: Immense challenge of software-based routing in the high-speed internet core where hardware switching dominates.
Security: Concerns about malicious code execution in critical network infrastructure.
Standardization: Difficulty in establishing common frameworks and APIs across vendors.
Evolution to Software Defined Networking
However, the core idea of virtualizing the network and separating the control plane from the data plane has been highly influential. This vision has found new life in Software Defined Networking (SDN), a dominant paradigm in modern data centers and cloud computing.
Key SDN Principles from Active Networks:
Separation of control and data planes
Programmable network behavior
Network virtualization
Centralized control with distributed forwarding
Modern Applications:
Multi-tenant cloud environments
Traffic isolation in shared infrastructure
Dynamic policy enforcement
Network function virtualization (NFV)
Examples:
OpenFlow protocol
Google's B4 wide-area network
Amazon VPC (Virtual Private Cloud)
Azure Virtual Network
Summary
This document explored four fundamental pillars of distributed systems communication:
Event Ordering: The "Happened Before" relationship provides a formal foundation for reasoning about causality in distributed systems without requiring global clocks. It establishes a partial order of events based on process execution order and message passing, explicitly acknowledging the existence of concurrent events that cannot be definitively ordered.
Clocks and Synchronization: Lamport's Clocks build on the event ordering foundation to assign timestamps. Logical Clocks establish a partial order using timestamps, which can be extended to a total order through deterministic tie-breaking rules—a concept essential for algorithms like distributed mutual exclusion. To address anomalies in real-world scenarios, Lamport's Physical Clocks introduce conditions to bound clock drift relative to inter-process communication time, ensuring that timestamps reflect actual causality.
Communication Optimization: Optimizing communication latency requires a systematic approach to reducing overhead at multiple layers. The three key optimization areas are: (1) Marshaling/Data Copying—reduce memory copies from 3 to 2 through kernel-integrated stubs or shared descriptors; (2) Context Switches—reduce critical switches from 4 to 1 through overlapping and spin-waiting techniques; (3) Protocol Processing—design lean, RPC-specific protocols that eliminate unnecessary acknowledgments and leverage hardware offloads.
Programmable Networks: Active Networks proposed making routers intelligent and programmable through in-network code execution. While not widely adopted in its original form, its core concepts—separation of control and data planes, network virtualization, and programmable behavior—have profoundly influenced modern Software Defined Networking (SDN), which dominates cloud and data center networking today. SDN enables dynamic policy enforcement, multi-tenant isolation, and network function virtualization essential for modern cloud infrastructure.
Together, these mechanisms enable building efficient distributed systems that can reduce RPC latency by orders of magnitude, maintain correct temporal ordering and causality tracking, and dynamically adapt network behavior to application requirements.
References
Lamport, L. (1978). "Time, Clocks, and the Ordering of Events in a Distributed System." Communications of the ACM, 21(7), 558-565.
Birrell, A. D., & Nelson, B. J. (1984). "Implementing Remote Procedure Calls." ACM Transactions on Computer Systems, 2(1), 39-59.
Schroeder, M. D., & Burrows, M. (1990). "Performance of Firefly RPC." ACM Transactions on Computer Systems, 8(1), 1-17.
Thekkath, C. A., et al. (1993). "Implementing Network Protocols at User Level." IEEE/ACM Transactions on Networking, 1(5), 554-565.
Ricciardi, A., & Birman, K. (1991). "Using Process Groups to Implement Failure Detection in Asynchronous Environments." Proceedings of the 10th Annual ACM Symposium on Principles of Distributed Computing.
Wetherall, D. J., et al. (1998). "ANTS: A Toolkit for Building and Dynamically Deploying Network Protocols." IEEE OPENARCH.
Tennenhouse, D. L., & Wetherall, D. J. (1996). "Towards an Active Network Architecture." ACM SIGCOMM Computer Communication Review, 26(2), 5-17.
McKeown, N., et al. (2008). "OpenFlow: Enabling Innovation in Campus Networks." ACM SIGCOMM Computer Communication Review, 38(2), 69-74.
Graduate courses from Georgia Institute of Technology and Columbia University
Last updated