03 Synchronization

Overview

Synchronization is a fundamental mechanism in concurrent and parallel programming that coordinates the execution of multiple threads or processes to ensure proper order of execution, prevent race conditions, and avoid data corruption. As modern systems increasingly rely on parallelism across multiple cores and processors, effective synchronization becomes critical for correctness and performance.

Synchronization mechanisms range from basic primitives like mutexes and semaphores used in single-machine multithreading to advanced algorithms like queueing locks and distributed barriers designed for scalable parallel systems. The choice of synchronization mechanism depends on factors including contention levels, architectural characteristics (cache-coherent vs. non-cache-coherent), and scalability requirements.

Key Challenges

Correctness:

Preventing race conditions when multiple threads access shared data
Ensuring atomicity of compound operations
Maintaining data consistency across execution units

Performance:

Minimizing latency when acquiring free locks
Reducing contention when multiple threads compete for resources
Avoiding excessive cache coherence traffic

Scalability:

Ensuring performance improves with additional processors
Managing overhead that grows with system size
Balancing fairness with efficiency

Fundamentals

Race Conditions

A race condition occurs when multiple threads access shared resources simultaneously, and the final result depends on the unpredictable timing or ordering of thread execution. Race conditions lead to non-deterministic behavior and program errors.

Example Scenario:

// Shared variable
int counter = 0;

// Two threads executing simultaneously
Thread 1: counter = counter + 1;  // Read: 0, Write: 1
Thread 2: counter = counter + 1;  // Read: 0, Write: 1
// Expected result: 2, Actual result: 1

Without synchronization, both threads may read the same initial value, perform their increment, and write back the same incremented value, losing one update.

Critical Sections

A critical section is a segment of code that accesses shared resources and must not be executed by more than one thread at a time. Synchronization mechanisms protect critical sections to ensure mutual exclusion.

Properties of Correct Critical Section Implementation:

Mutual Exclusion: Only one thread can execute in the critical section at a time
Progress: If no thread is in the critical section and threads wish to enter, selection of the next thread cannot be postponed indefinitely
Bounded Waiting: A limit exists on the number of times other threads can enter the critical section after a thread requests entry
Performance: Overhead of entering and exiting the critical section should be minimal

Thread Safety

Thread safety refers to code that behaves correctly when accessed by multiple threads simultaneously. Thread-safe code avoids data races and maintains invariants even under concurrent access.

Approaches to Thread Safety:

Synchronization: Use locks, semaphores, or other primitives to serialize access
Immutability: Use immutable data structures that cannot be modified after creation
Thread-Local Storage: Give each thread its own copy of data
Lock-Free Programming: Use atomic operations and careful algorithm design

Deadlock

Deadlock is a situation where two or more threads are permanently blocked, each waiting for a resource held by another thread in the cycle. Deadlocks result from improper synchronization and can freeze program execution.

Four Necessary Conditions for Deadlock:

Mutual Exclusion: Resources cannot be shared
Hold and Wait: Threads hold resources while waiting for others
No Preemption: Resources cannot be forcibly taken from threads
Circular Wait: A circular chain of threads exists, each waiting for a resource held by the next

Prevention Strategies:

Lock ordering: Always acquire locks in the same order
Timeout mechanisms: Release locks if acquisition takes too long
Deadlock detection: Periodically check for cycles and break them

Basic Synchronization Primitives

Locks (Mutexes)

A mutex (mutual exclusion lock) is the most fundamental synchronization primitive. It allows threads to acquire exclusive access to shared resources, ensuring only one thread executes in the critical section at a time.

Exclusive Lock

An exclusive or mutual exclusion lock can be held by only one thread at a time. While one thread holds the lock, all other threads attempting to acquire it must wait.

Operations:

lock(): Acquire the lock, blocking if already held
unlock(): Release the lock
trylock(): Attempt to acquire without blocking, returns success/failure

Use Cases:

Protecting shared data structures during modification
Ensuring atomicity of compound operations
Preventing race conditions

Shared Lock (Read-Write Lock)

A shared lock allows multiple threads to hold the lock simultaneously for read-only access while providing exclusive access for writes. This enables concurrent readers while still providing mutual exclusion with writers.

Operations:

read_lock(): Acquire shared read access (multiple readers allowed)
write_lock(): Acquire exclusive write access (only one writer)
unlock(): Release the lock

Use Cases:

Data structures with frequent reads and infrequent writes
Database systems with read-heavy workloads
Configuration data accessed by many threads

POSIX Mutex Example

#include <pthread.h>

// Shared data
int sharedValue = 0;
pthread_mutex_t mutex;

void* incrementSharedValue(void* arg) {
    for (int i = 0; i < 100000; i++) {
        // Lock the mutex to protect the critical section
        pthread_mutex_lock(&mutex);
        
        // Critical section: increment the shared value
        sharedValue++;
        
        // Unlock the mutex to release the critical section
        pthread_mutex_unlock(&mutex);
    }
    return NULL;
}

int main() {
    // Initialize the mutex
    pthread_mutex_init(&mutex, NULL);
    
    // Create two threads
    pthread_t thread1, thread2;
    pthread_create(&thread1, NULL, incrementSharedValue, NULL);
    pthread_create(&thread2, NULL, incrementSharedValue, NULL);
    
    // Wait for threads to finish
    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);
    
    // Destroy the mutex
    pthread_mutex_destroy(&mutex);
    
    printf("Final Value: %d\n", sharedValue);
    return 0;
}

Read-Write Lock Example

#include <pthread.h>

int sharedData = 0;
pthread_rwlock_t rwlock;

void* reader(void* arg) {
    pthread_rwlock_rdlock(&rwlock);
    printf("Reader %ld: Read data: %d\n", (long)arg, sharedData);
    pthread_rwlock_unlock(&rwlock);
    return NULL;
}

void* writer(void* arg) {
    pthread_rwlock_wrlock(&rwlock);
    sharedData++;
    printf("Writer %ld: Wrote data: %d\n", (long)arg, sharedData);
    pthread_rwlock_unlock(&rwlock);
    return NULL;
}

int main() {
    pthread_rwlock_init(&rwlock, NULL);
    
    // Create reader and writer threads
    pthread_t readerThreads[3], writerThreads[2];
    for (long i = 0; i < 3; i++) {
        pthread_create(&readerThreads[i], NULL, reader, (void*)i);
    }
    for (long i = 0; i < 2; i++) {
        pthread_create(&writerThreads[i], NULL, writer, (void*)i);
    }
    
    // Wait for threads to finish
    for (long i = 0; i < 3; i++) {
        pthread_join(readerThreads[i], NULL);
    }
    for (long i = 0; i < 2; i++) {
        pthread_join(writerThreads[i], NULL);
    }
    
    pthread_rwlock_destroy(&rwlock);
    return 0;
}

Semaphores

Semaphores are synchronization primitives that maintain a count and allow a specified number of threads to access a shared resource concurrently. They are more general than mutexes and can coordinate multiple threads.

Types:

Binary Semaphore: Count is either 0 or 1, similar to a mutex
Counting Semaphore: Count can be any non-negative integer

Operations:

wait() or P(): Decrement the count, blocking if count is 0
signal() or V(): Increment the count, potentially waking a waiting thread

Use Cases:

Limiting concurrent access to a resource pool
Producer-consumer synchronization
Signaling between threads

POSIX Semaphore Example

#include <semaphore.h>

int sharedData = 0;
sem_t semaphore;

void* producer(void* arg) {
    for (int i = 1; i <= 10; i++) {
        // Produce data
        sharedData = i;
        printf("Produced: %d\n", sharedData);
        
        // Signal the semaphore to indicate new data is available
        sem_post(&semaphore);
        
        sleep(1);
    }
    return NULL;
}

void* consumer(void* arg) {
    for (int i = 1; i <= 10; i++) {
        // Wait for new data to be available
        sem_wait(&semaphore);
        
        // Consume the data
        printf("Consumed: %d\n", sharedData);
        sharedData = 0;
        
        sleep(1);
    }
    return NULL;
}

int main() {
    // Initialize the semaphore with initial value 0
    sem_init(&semaphore, 0, 0);
    
    pthread_t producerThread, consumerThread;
    pthread_create(&producerThread, NULL, producer, NULL);
    pthread_create(&consumerThread, NULL, consumer, NULL);
    
    pthread_join(producerThread, NULL);
    pthread_join(consumerThread, NULL);
    
    sem_destroy(&semaphore);
    return 0;
}

Condition Variables

Condition variables provide a mechanism for threads to wait until a specific condition is met before proceeding. They are always used in conjunction with a mutex to protect the condition check.

Operations:

wait(mutex): Atomically release mutex and wait for signal
signal(): Wake up one waiting thread
broadcast(): Wake up all waiting threads

Use Cases:

Producer-consumer patterns with complex conditions
Implementing monitors
Thread coordination based on state changes

Condition Variable Example

#include <pthread.h>

int sharedData = 0;
pthread_mutex_t mutex;
pthread_cond_t condition;

void* producer(void* arg) {
    for (int i = 1; i <= 10; i++) {
        pthread_mutex_lock(&mutex);
        
        sharedData = i;
        printf("Produced: %d\n", sharedData);
        
        // Signal the consumer that new data is available
        pthread_cond_signal(&condition);
        
        pthread_mutex_unlock(&mutex);
        sleep(1);
    }
    return NULL;
}

void* consumer(void* arg) {
    for (int i = 1; i <= 10; i++) {
        pthread_mutex_lock(&mutex);
        
        // Wait for new data to be available
        while (sharedData == 0) {
            pthread_cond_wait(&condition, &mutex);
        }
        
        printf("Consumed: %d\n", sharedData);
        sharedData = 0;
        
        pthread_mutex_unlock(&mutex);
    }
    return NULL;
}

int main() {
    pthread_mutex_init(&mutex, NULL);
    pthread_cond_init(&condition, NULL);
    
    pthread_t producerThread, consumerThread;
    pthread_create(&producerThread, NULL, producer, NULL);
    pthread_create(&consumerThread, NULL, consumer, NULL);
    
    pthread_join(producerThread, NULL);
    pthread_join(consumerThread, NULL);
    
    pthread_mutex_destroy(&mutex);
    pthread_cond_destroy(&condition);
    return 0;
}

Atomic Operations

Individual load and store instructions are inherently atomic, but implementing synchronization primitives requires atomicity across multiple operations. This necessitates specialized Read-Modify-Write (RMW) instructions that execute as indivisible units.

Read-Modify-Write (RMW) Instructions

RMW operations are sometimes generically called Fetch and φ instructions, where φ represents some operation. These instructions read from memory, modify the value in a register, and write it back atomically.

Test and Set

Returns the current value of a memory location and atomically sets it to one.

int TestAndSet(int *L) {
    int old = *L;
    *L = 1;
    return old;
}

Use Case: Basic spinlock implementation

Properties:

Single atomic operation
Returns previous state
Sets to locked state (1)

Fetch and Increment

Fetches the current value and atomically increments the memory location.

int FetchAndIncrement(int *L) {
    int old = *L;
    *L = old + 1;
    return old;
}

Use Case:

Ticket locks for fair ordering
Generating unique sequence numbers
Atomic counters

Properties:

Returns old value
Increments by 1 (or specified amount)
Useful for FIFO ordering

Fetch and Store

Atomically returns the old value and stores a new value.

int FetchAndStore(int *L, int new_value) {
    int old = *L;
    *L = new_value;
    return old;
}

Use Case:

MCS lock implementation
Atomic swaps
Building lock-free data structures

Properties:

Atomic swap operation
Returns previous value
Stores arbitrary new value

Compare and Swap (CAS)

Conditionally updates a memory location if its current value matches an expected value.

bool CompareAndSwap(int *L, int expected, int new_value) {
    if (*L == expected) {
        *L = new_value;
        return true;
    }
    return false;
}

Use Case:

Lock-free algorithms
Handling race conditions in complex scenarios
Building sophisticated synchronization primitives

Properties:

Conditional atomic update
Returns success/failure
Enables optimistic concurrency
Foundation for lock-free programming

Fetch and Decrement

Atomically fetches the current value and decrements it.

int FetchAndDecrement(int *L) {
    int old = *L;
    *L = old - 1;
    return old;
}

Use Case:

Counting barriers
Reference counting
Resource allocation

Advanced Lock Algorithms

Implementing scalable synchronization primitives requires careful algorithm design to minimize latency, contention, and network traffic in parallel systems.

Performance Metrics

Latency: Time required for a thread to acquire a free lock. This should be minimized for optimal performance in the uncontended case.

Waiting Time: Time a thread waits when the lock is held by another thread. Largely application-dependent but affected by implementation fairness.

Contention: Time required for one thread to acquire a lock when multiple threads simultaneously attempt acquisition after a release. Critical for scalability in high-contention scenarios.

Spinlock Implementations

Native Spinlock (Spin on Test and Set)

The simplest spinlock implementation where threads repeatedly execute TestAndSet until successful.

void Lock(int *L) {
    while (TestAndSet(L) == 1) {
        // spin (busy wait)
    }
}

void Unlock(int *L) {
    *L = 0;
}

Problems:

Heavy network contention from continuous TestAndSet operations
Cannot exploit caches (TestAndSet bypasses cache, accesses memory directly)
Disrupts useful work on other processors
O(N) bus transactions per lock acquisition under contention

When to Use:

Very low contention scenarios
When advanced RMW instructions are unavailable
Short critical sections

Caching Spinlock (Spin on Read)

Threads spin on a cached copy of the lock variable, only attempting TestAndSet when they observe the lock becoming free through cache coherence.

void Lock(int *L) {
    while (true) {
        // Spin on cached value
        while (*L == 1) {
            // read from cache
        }
        // Lock appears free, try to acquire
        if (TestAndSet(L) == 0) {
            return;  // acquired
        }
    }
}

void Unlock(int *L) {
    *L = 0;
}

Improvement:

Reduces contention while lock is held
Exploits cache coherence mechanisms
Less network traffic than native spinlock

Remaining Problem:

Thundering herd when lock is released
All waiters simultaneously attempt TestAndSet
O(N²) bus transactions for write-invalidate coherence
Still generates significant contention on release

Spinlock with Exponential Backoff

After failing to acquire a lock, threads delay for a period before retrying. The delay increases exponentially if contention persists, dynamically adapting to load.

void Lock(int *L) {
    int delay = BASE_DELAY;
    while (TestAndSet(L) == 1) {
        sleep(delay);
        delay = min(delay * 2, MAX_DELAY);
    }
}

void Unlock(int *L) {
    *L = 0;
}

Advantages:

Works on non-cache-coherent (NCC) multiprocessors
Adapts to contention dynamically
Reduces network traffic significantly
Simple to implement

Disadvantages:

Non-deterministic acquisition order (unfair)
Delay tuning affects performance
May introduce unnecessary latency
No guarantee of bounded waiting

When to Use:

Systems with only TestAndSet available
Variable contention scenarios
When fairness is not critical

Ticket Lock

Uses two counters to ensure FIFO ordering: next_ticket (incremented for each requester) and now_serving (incremented on release).

typedef struct {
    int next_ticket;
    int now_serving;
} TicketLock;

void Lock(TicketLock *L) {
    int my_ticket = FetchAndIncrement(&L->next_ticket);
    while (L->now_serving != my_ticket) {
        // spin
    }
}

void Unlock(TicketLock *L) {
    L->now_serving++;
}

Advantages:

Fair: first-come, first-served ordering
Single RMW operation per acquisition
Bounded waiting time
Simple implementation

Disadvantages:

Still noisy: all waiters observe now_serving updates
O(N) cache coherence traffic per release
Not scalable for large N
All threads spin on same variable

When to Use:

Fairness is important
Moderate number of threads
Cache-coherent systems

Queueing Locks

Queueing locks address fairness and noisiness by giving each waiting thread a unique spin location and signaling only the next thread when the lock is released.

Array-Based Queueing Lock (Anderson)

Uses an array of flags sized to the number of processors, functioning as a circular queue.

typedef struct {
    int flags[N];      // has_lock or must_wait
    int queue_last;
} AndersonLock;

void Lock(AndersonLock *L, int *my_slot) {
    *my_slot = FetchAndIncrement(&L->queue_last) % N;
    while (L->flags[*my_slot] == MUST_WAIT) {
        // spin on unique location
    }
    L->flags[*my_slot] = MUST_WAIT;
}

void Unlock(AndersonLock *L, int my_slot) {
    int next_slot = (my_slot + 1) % N;
    L->flags[next_slot] = HAS_LOCK;
}

Advantages:

Fair: FIFO ordering
Each thread spins on unique location
Exactly one thread signaled on release
Single RMW per critical section
Predictable performance

Disadvantages:

O(N) space per lock
Wasteful for large N or many locks
Array size must accommodate worst-case contention
Poor cache utilization if N is large

When to Use:

Cache-coherent systems
Known maximum thread count
High contention scenarios
When fairness is critical

Link-Based Queueing Lock (MCS)

Uses a linked list of queue nodes with dynamic space allocation, developed by Mellor-Crummey and Scott.

typedef struct QNode {
    bool got_it;
    struct QNode *next;
} QNode;

typedef QNode* MCSLock;

void Lock(MCSLock *L, QNode *my_node) {
    my_node->next = NULL;
    my_node->got_it = false;
    
    QNode *predecessor = FetchAndStore(L, my_node);
    if (predecessor != NULL) {
        predecessor->next = my_node;
        while (!my_node->got_it) {
            // spin on local variable
        }
    }
}

void Unlock(MCSLock *L, QNode *my_node) {
    if (my_node->next == NULL) {
        if (CompareAndSwap(L, my_node, NULL)) {
            return;  // no successor
        }
        // Wait for successor to link in
        while (my_node->next == NULL) {
            // spin briefly
        }
    }
    my_node->next->got_it = true;
}

Advantages:

Fair: FIFO ordering
Space proportional to actual contention (dynamic)
Each thread spins on local variable
Typically one RMW per critical section
Excellent scalability

Disadvantages:

More complex implementation
Unlock requires CompareAndSwap for corner cases
Two RMW operations in rare scenarios
Link list maintenance overhead

When to Use:

High scalability requirements
Unknown or variable thread counts
Cache-coherent systems
Long critical sections where fairness matters

Algorithm Selection Guidelines

Scenario

Recommended Algorithm

Rationale

Low contention, short critical sections

Exponential backoff

Simple, adapts well, low overhead

High contention, fairness required

Anderson or MCS

Fair ordering, scalable

Only TestAndSet available

Exponential backoff

Best option without advanced RMW

Variable contention

Exponential backoff

Adapts dynamically to load

Large-scale parallel systems

MCS Lock

Best scalability, dynamic space

Cache-coherent with known N

Anderson Lock

Excellent performance, simple

Non-cache-coherent systems

Exponential backoff

Doesn't rely on cache coherence

Barrier Synchronization

Barrier synchronization ensures all participating threads reach a designated point before any proceed. This is essential for coordinating phases in parallel algorithms.

Semantics

All threads must arrive at the barrier and wait until all threads have arrived before any can proceed to the next computation phase. This establishes a global synchronization point.

Centralized Barrier (Counting Barrier)

Uses a shared counter initialized to N (number of threads). Each arriving thread atomically decrements the counter and spins until it reaches zero.

typedef struct {
    int count;
    int N;
} CountingBarrier;

void Barrier(CountingBarrier *B) {
    FetchAndDecrement(&B->count);
    
    // Wait for all to arrive
    while (B->count > 0) {
        // spin
    }
    
    // Wait for reset (avoid race on re-entry)
    while (B->count != B->N) {
        // spin
    }
    
    // Last thread resets (not shown in simplified version)
}

Problem: Race condition if threads proceed before count is reset. Requires two spinning episodes for safety.

Advantages:

Simple implementation
Easy to understand

Disadvantages:

Two spin loops required
Centralized counter creates hot spot
Not scalable

Sense Reversing Barrier

Eliminates one spin loop by using a sense flag that reverses when all threads arrive.

typedef struct {
    int count;
    int N;
    bool sense;
} SenseBarrier;

// Per-thread variable
thread_local bool local_sense = true;

void Barrier(SenseBarrier *B) {
    local_sense = !local_sense;
    
    if (FetchAndDecrement(&B->count) == 1) {
        // Last thread: reset and release all
        B->count = B->N;
        B->sense = local_sense;
    } else {
        // Wait for sense to change
        while (B->sense != local_sense) {
            // spin
        }
    }
}

Advantages:

Single spinning episode
Simpler than counting barrier
Correct re-entry handling

Disadvantages:

Centralized count and sense create hot spots
Limited scalability for large N
High cache coherence traffic

When to Use:

Small to medium number of threads
Cache-coherent systems
Simple barrier needs

Tree Barrier

Hierarchical barrier using divide-and-conquer to limit sharing to small groups.

Structure:

N processors grouped into clusters of K processors
Forms a tree with log_K(N) levels
Each cluster has local count and sense variables

Arrival Phase:

Thread decrements local group counter
Last thread in group propagates arrival to parent level
Continues up to root

Wake-up Phase:

Root (last arriving thread) reverses sense
Signal propagates down tree level by level
Each parent releases its children

Advantages:

Reduces contention through hierarchical grouping
O(log N) communication complexity
More scalable than centralized barriers

Disadvantages:

Spin locations dynamically determined (problematic for NCC NUMA)
Arity K affects contention levels
Complex implementation

MCS Barrier

Modified tree barrier with statically determined spin locations, using a 4-ary arrival tree and binary wake-up tree.

Arrival Tree (4-ary):

Each child signals parent at unique, pre-assigned location
Parent checks ChildNotReady vector
Static assignment enables efficient NUMA implementation

Wake-up Tree (Binary):

Parent uses ChildPointer structures to signal children
Children spin on statically determined locations
Binary structure based on theoretical performance results

Advantages:

Static spin locations optimize for NUMA
Reduces cache coherence traffic
Suitable for various architectures
Excellent scalability

Disadvantages:

More complex implementation
Less spatial locality than single cache line per group
Requires careful initialization

Tournament Barrier

Organizes barrier as a tournament with predetermined winners advancing each round.

Arrival Phase:

for round = 0 to log(N)-1:
    if I am winner in this round:
        wait for signal from opponent
        advance to next round
    else:
        signal opponent
        break

Wake-up Phase:

for round = log(N)-1 down to 0:
    if I was winner:
        signal opponent in this round

Advantages:

Statically determined spin locations (excellent for NCC NUMA)
Uses only atomic read and write (no RMW required)
O(log N) communication complexity
Exploits network parallelism
Works with message passing (clusters)

Disadvantages:

Cannot pack multiple spins in one cache line
Less spatial locality than MCS
Fixed tournament structure

Dissemination Barrier

Information diffusion where in round k, processor P_i sends to P_{(i + 2^k) mod N}.

for round = 0 to ceiling(log N)-1:
    target = (my_id + 2^round) mod N
    send message to target
    wait for message from source

Advantages:

No hierarchical structure
Works on NCC NUMA and clusters
Independent, distributed decisions
N need not be a power of 2
Simple implementation

Disadvantages:

O(N log N) total messages
Higher communication complexity than tree barriers
More network traffic

Barrier Algorithm Comparison

Algorithm

Communication

Spin Location

RMW Required

Best Architecture

Counting

O(N)

Dynamic

Yes

Small CC systems

Sense Reversal

O(N)

Dynamic

Yes

Small CC systems

Tree

O(log N)

Dynamic

Yes

CC systems

MCS

O(log N)

Static

Yes

NUMA, CC

Tournament

O(log N)

Static

NCC NUMA, Clusters

Dissemination

O(N log N)

Static

NCC NUMA, Clusters

Performance Considerations

Contention and Scalability

As the number of threads or processors increases, synchronization overhead can limit scalability. Key factors include:

Cache Coherence Traffic:

Write operations to shared lock variables trigger invalidations
More threads mean more cache coherence messages
Can saturate interconnection network

Memory Latency:

Access to shared synchronization variables may miss cache
NUMA effects make remote access expensive
Deep memory hierarchies increase latency variance

Fairness vs. Performance:

Fair algorithms (FIFO ordering) may have higher overhead
Unfair algorithms can lead to starvation
Trade-off depends on application requirements

False sharing occurs when independent variables reside in the same cache line, causing unnecessary cache coherence traffic.

Example:

struct {
    int counter_A;  // Updated by Thread 1
    int counter_B;  // Updated by Thread 2
} shared_data;  // Both in same cache line

Solutions:

Pad structures to cache line boundaries (typically 64 bytes)
Align frequently updated variables separately
Use thread-local storage when possible

Lock-Free Programming

Lock-free programming designs algorithms that guarantee system-wide progress without using locks, typically using atomic operations like CAS.

Advantages:

No deadlock possible
Better scalability in some scenarios
Eliminates lock contention overhead

Disadvantages:

Complex algorithm design
Difficult to verify correctness
May have higher overhead in low-contention scenarios
ABA problem requires careful handling

Use Cases:

High-contention data structures
Real-time systems requiring bounded latency
Systems where lock overhead is prohibitive

Wait-Free Programming

Wait-free programming is stronger than lock-free: every operation completes in a bounded number of steps regardless of other threads' actions.

Properties:

Strongest progress guarantee
No thread can be indefinitely delayed
Ideal for real-time systems

Challenges:

Very difficult to implement
Often requires helping mechanisms
Higher overhead than lock-free

Implementation Examples

Producer-Consumer with Semaphores

#include <semaphore.h>

#define BUFFER_SIZE 10

int buffer[BUFFER_SIZE];
int in = 0, out = 0;

sem_t empty, full;
pthread_mutex_t mutex;

void* producer(void* arg) {
    for (int i = 0; i < 100; i++) {
        int item = produce_item();
        
        sem_wait(&empty);
        pthread_mutex_lock(&mutex);
        
        buffer[in] = item;
        in = (in + 1) % BUFFER_SIZE;
        
        pthread_mutex_unlock(&mutex);
        sem_post(&full);
    }
    return NULL;
}

void* consumer(void* arg) {
    for (int i = 0; i < 100; i++) {
        sem_wait(&full);
        pthread_mutex_lock(&mutex);
        
        int item = buffer[out];
        out = (out + 1) % BUFFER_SIZE;
        
        pthread_mutex_unlock(&mutex);
        sem_post(&empty);
        
        consume_item(item);
    }
    return NULL;
}

int main() {
    sem_init(&empty, 0, BUFFER_SIZE);
    sem_init(&full, 0, 0);
    pthread_mutex_init(&mutex, NULL);
    
    pthread_t prod, cons;
    pthread_create(&prod, NULL, producer, NULL);
    pthread_create(&cons, NULL, consumer, NULL);
    
    pthread_join(prod, NULL);
    pthread_join(cons, NULL);
    
    sem_destroy(&empty);
    sem_destroy(&full);
    pthread_mutex_destroy(&mutex);
    
    return 0;
}

Reader-Writer Problem

#include <pthread.h>

int readers = 0;
pthread_mutex_t mutex;
pthread_mutex_t write_lock;

void* reader(void* arg) {
    pthread_mutex_lock(&mutex);
    readers++;
    if (readers == 1) {
        pthread_mutex_lock(&write_lock);  // First reader locks writers
    }
    pthread_mutex_unlock(&mutex);
    
    // Read shared data
    read_data();
    
    pthread_mutex_lock(&mutex);
    readers--;
    if (readers == 0) {
        pthread_mutex_unlock(&write_lock);  // Last reader unlocks writers
    }
    pthread_mutex_unlock(&mutex);
    
    return NULL;
}

void* writer(void* arg) {
    pthread_mutex_lock(&write_lock);
    
    // Write to shared data
    write_data();
    
    pthread_mutex_unlock(&write_lock);
    return NULL;
}

Dining Philosophers (Deadlock Avoidance)

#define N 5

pthread_mutex_t forks[N];

void* philosopher(void* arg) {
    int id = *(int*)arg;
    int left = id;
    int right = (id + 1) % N;
    
    while (1) {
        think();
        
        // Avoid deadlock: acquire lower-numbered fork first
        if (left < right) {
            pthread_mutex_lock(&forks[left]);
            pthread_mutex_lock(&forks[right]);
        } else {
            pthread_mutex_lock(&forks[right]);
            pthread_mutex_lock(&forks[left]);
        }
        
        eat();
        
        pthread_mutex_unlock(&forks[left]);
        pthread_mutex_unlock(&forks[right]);
    }
    return NULL;
}

References

Course Materials:

CS 6210: Advanced Operating Systems - Georgia Tech OMSCS
CS 6200: Introduction to Operating Systems - Georgia Tech OMSCS
COMS W4118: Operating Systems I - Columbia University

Previous02 Process and Thread Management Next04 Memory Management

Last updated 12 days ago

Table of Contents

Overview

Key Challenges

Fundamentals

Race Conditions

Critical Sections

Thread Safety

Deadlock

Basic Synchronization Primitives

Locks (Mutexes)

Exclusive Lock

Shared Lock (Read-Write Lock)

POSIX Mutex Example

Read-Write Lock Example

Semaphores

POSIX Semaphore Example

Condition Variables

Condition Variable Example

Atomic Operations

Read-Modify-Write (RMW) Instructions

Test and Set

Fetch and Increment

Fetch and Store

Compare and Swap (CAS)

Fetch and Decrement

Advanced Lock Algorithms

Performance Metrics

Spinlock Implementations

Native Spinlock (Spin on Test and Set)

Caching Spinlock (Spin on Read)

Spinlock with Exponential Backoff

Ticket Lock

Queueing Locks

Array-Based Queueing Lock (Anderson)

Link-Based Queueing Lock (MCS)

Algorithm Selection Guidelines

Barrier Synchronization

Semantics

Centralized Barrier (Counting Barrier)

Sense Reversing Barrier

Tree Barrier

MCS Barrier

Tournament Barrier

Dissemination Barrier

Barrier Algorithm Comparison

Performance Considerations

Contention and Scalability

False Sharing

Lock-Free Programming

Wait-Free Programming

Implementation Examples

Producer-Consumer with Semaphores

Reader-Writer Problem

Dining Philosophers (Deadlock Avoidance)

References