05 Database Implementation

Overview

This chapter covers the internal implementation of database systems, exploring how databases efficiently manage data storage, retrieval, and processing. Topics include storage management, buffer management, indexing structures, query execution models, and performance optimizations.

Key Implementation Concerns

Performance:

Minimize disk I/O (primary bottleneck)
Efficient memory utilization
Fast query execution

Reliability:

ACID transaction guarantees
Crash recovery mechanisms
Data durability

Scalability:

Handle growing data volumes
Support concurrent users
Efficient resource utilization

Why C++ for Databases?

Modern database systems are often implemented in C++ for several reasons:

Advantage

Benefit

Performance

Compiled to machine code, minimal runtime overhead

Hardware Control

Direct memory management, precise resource control

Type Safety

Compile-time error detection

Low-Level Access

Direct manipulation of bits, bytes, and memory

Performance Comparison:

C++ sum of 100M integers: 100 microseconds
Python equivalent: 5.24 seconds
Performance ratio: ~50,000× faster

Transactions and ACID Properties

Transactions are logical units of work that maintain database consistency and reliability. They are governed by ACID properties.

ACID Properties

Atomicity

"All or Nothing" - Transactions execute completely or not at all.

Characteristics:

No partial updates visible
Transaction failure causes rollback
Ensures database never in intermediate state

Implementation:

Write-Ahead Logging (WAL)
Undo logs for rollback
Redo logs for recovery

Example:

Bank transfer: $100 from Account A to Account B

BEGIN TRANSACTION;
  UPDATE accounts SET balance = balance - 100 WHERE id = 'A';  -- Deduct
  UPDATE accounts SET balance = balance + 100 WHERE id = 'B';  -- Add
COMMIT;

If system crashes after first UPDATE:
- Atomicity ensures rollback
- Neither account modified
- No money lost or created

Consistency

Maintains database integrity constraints.

Characteristics:

Database moves from one valid state to another
All constraints satisfied before and after transaction
Business rules preserved

Example:

Constraint: Total money in system remains constant

Before: A=$1000, B=$500, Total=$1500
Transfer $100 from A to B
After: A=$900, B=$600, Total=$1500 ✓

Consistency maintained: total unchanged

Isolation

Concurrent transactions don't interfere with each other.

Characteristics:

Transactions appear to execute sequentially
Intermediate states not visible to others
Prevents read/write conflicts

Isolation Levels:

Level

Dirty Read

Non-Repeatable Read

Phantom Read

Read Uncommitted

Possible

Read Committed

Prevented

Possible

Repeatable Read

Prevented

Possible

Serializable

Prevented

Implementation:

Locking protocols (two-phase locking)
Multi-Version Concurrency Control (MVCC)
Timestamp ordering

Durability

Committed changes survive system failures.

Characteristics:

Changes persisted to non-volatile storage
Survive crashes, power failures
Write-Ahead Logging ensures durability

Implementation:

Force log records to disk before commit
Redo logs reconstruct committed transactions
Checkpoint mechanisms for recovery

History of ACID

The ACID acronym was formalized by Andreas Reuter (1983) and Theo Härder, building on earlier concepts from database theory.

Storage Management

Storage management handles the physical organization of data on disk and its transfer to memory.

Storage Hierarchy

Storage Type

Latency

Capacity

Volatile

Cost/GB

DRAM

~30 ns

Yes

High

SSD

~100 μs

Medium

HDD

~10 ms

Low

Key Insight: Memory is ~100,000× faster than disk, driving all storage management decisions.

Data Lifecycle

Hot Data (DRAM):

Frequently accessed
Cached for speed
Volatile (lost on power failure)
Limited capacity

Durable Data (Disk):

Permanent storage
Slower access
Persistent across failures
Large capacity

Database manages movement between tiers based on access patterns.

File I/O in C++

Databases use file I/O for persistence:

#include <fstream>

// Write data to file
std::ofstream outputFile("data.db");
outputFile << "Record data";
outputFile.close();

// Read data from file
std::ifstream inputFile("data.db");
std::string data;
inputFile >> data;
inputFile.close();

// Error handling critical
if (!inputFile) {
    std::cerr << "Error opening file" << std::endl;
    // Handle error appropriately
}

Memory Management

Heap vs. Stack Allocation

Stack:

Fast allocation/deallocation
Limited size
Automatic lifetime management
Scope-bound

Heap:

Dynamic allocation
Large pool of memory
Manual lifetime management (or smart pointers)
Survives beyond function scope

Database Usage:

Large data structures → heap
Temporary local variables → stack
Page buffers → heap
Control structures → stack

Smart Pointers (RAII)

Resource Acquisition Is Initialization (RAII) ties resource lifetime to object scope.

Problems with Raw Pointers:

Memory leaks
Dangling pointers
Double-free errors
Ownership ambiguity

Smart Pointer Solution:

// std::unique_ptr - exclusive ownership
std::unique_ptr<Page> page = std::make_unique<Page>();
// Automatically deleted when out of scope

// std::shared_ptr - shared ownership
std::shared_ptr<Buffer> buffer = std::make_shared<Buffer>();
// Deleted when last reference removed

// Move semantics for transfer
std::unique_ptr<Page> page2 = std::move(page);
// page is now null, page2 owns the resource

Page Structure

Page is the basic unit of disk storage:

Characteristics:

Fixed size (typically 4KB, 8KB, 16KB)
Atomic unit of I/O
Contains multiple tuples
Includes metadata (used space, free space)

Slotted Page Structure:

┌─────────────────────────────────────┐
│         Page Header                 │
├─────────────────────────────────────┤
│  Slot Array (offset, length pairs) │
├─────────────────────────────────────┤
│         Free Space                  │
├─────────────────────────────────────┤
│         Tuple N                     │
│         Tuple N-1                   │
│         ...                         │
│         Tuple 1                     │
└─────────────────────────────────────┘

Advantages:

Direct access to any tuple by slot index
Efficient space utilization
Simplified compaction
Support for variable-length tuples

Serialization and Deserialization

Purpose: Convert in-memory structures to on-disk format and vice versa.

Serialization (Write):

void Page::serialize(const std::string& filename) {
    std::ofstream file(filename, std::ios::binary);

    // Write number of tuples
    size_t num_tuples = tuples.size();
    file.write(reinterpret_cast<const char*>(&num_tuples), sizeof(num_tuples));

    // Write each tuple
    for (const auto& tuple : tuples) {
        tuple->serialize(file);
    }
}

Deserialization (Read):

void Page::deserialize(const std::string& filename) {
    std::ifstream file(filename, std::ios::binary);

    // Read number of tuples
    size_t num_tuples;
    file.read(reinterpret_cast<char*>(&num_tuples), sizeof(num_tuples));

    // Read each tuple
    for (size_t i = 0; i < num_tuples; ++i) {
        auto tuple = std::make_unique<Tuple>();
        tuple->deserialize(file);
        tuples.push_back(std::move(tuple));
    }
}

Buffer Management

The buffer manager caches frequently accessed pages in memory to minimize disk I/O.

Buffer Pool Architecture

┌────────────────────────────────────┐
│        Query Processor             │
├────────────────────────────────────┤
│       Buffer Manager               │
│  ┌──────────────────────────────┐  │
│  │    Buffer Pool (DRAM)        │  │
│  │  ┌─────┐  ┌─────┐  ┌─────┐   │  │
│  │  │Page │  │Page │  │Page │   │  │
│  │  │ 1   │  │ 2   │  │ 3   │   │  │
│  │  └─────┘  └─────┘  └─────┘   │  │
│  └──────────────────────────────┘  │
├────────────────────────────────────┤
│      Storage Manager               │
├────────────────────────────────────┤
│          Disk Storage              │
└────────────────────────────────────┘

Buffer Manager Components

Data Structures:

Page Table (Hash Map): Maps page IDs to buffer pool frames
Replacement Policy: Determines which page to evict
Dirty Bit: Tracks modified pages requiring write-back

Operations:

class BufferManager {
public:
    Page* getPage(PageID page_id);      // Fetch page (from buffer or disk)
    void unpinPage(PageID page_id);     // Release page reference
    void flushPage(PageID page_id);     // Write page to disk
    void evictPage();                   // Remove page from buffer
private:
    std::unordered_map<PageID, Frame*> page_table;
    ReplacementPolicy* policy;
};

Cache Eviction Policies

FIFO (First-In-First-Out)

Strategy: Evict oldest page in buffer.

Advantages:

Simple implementation
Low overhead

Disadvantages:

May evict frequently accessed pages
Ignores access patterns
Poor for workloads with hot pages

LRU (Least Recently Used)

Strategy: Evict page with oldest access time.

Advantages:

Considers recency of access
Good for temporal locality
Better than FIFO for typical workloads

Disadvantages:

Sequential scans pollute cache
Overhead of tracking access times

Implementation:

// LRU using doubly-linked list + hash map
class LRUPolicy {
private:
    std::list<std::pair<PageID, Page*>> lru_list;  // Most recent at front
    std::unordered_map<PageID, list::iterator> page_map;

public:
    void touch(PageID page_id) {
        // Move to front (most recent)
        auto it = page_map[page_id];
        lru_list.splice(lru_list.begin(), lru_list, it);
    }

    PageID evict() {
        // Remove from back (least recent)
        PageID victim = lru_list.back().first;
        lru_list.pop_back();
        page_map.erase(victim);
        return victim;
    }
};

TwoQ Policy

Strategy: Separate queues for single-access and frequently-accessed pages.

Architecture:

┌─────────────────────────┐
│   FIFO Queue (Cold)     │  ← New pages enter here
│   Read-once pages       │
└─────────────────────────┘
           ↓ On second access
┌─────────────────────────┐
│    LRU Queue (Hot)      │  ← Frequently accessed
│   Multi-access pages    │
└─────────────────────────┘

Benefits:

Prevents buffer pool flooding from sequential scans
Prioritizes frequently accessed pages
Evicts cold pages first

Eviction Priority:

Evict from FIFO queue (cold pages)
Only if FIFO empty, evict from LRU queue

Buffer Pool Flooding

Problem: Sequential scans bring many pages into buffer, evicting hot pages.

Example:

Normal state:
Buffer: [Hot1, Hot2, Hot3, Hot4, Hot5]  ← Frequently used

After sequential scan:
Buffer: [Scan98, Scan99, Scan100, Scan101, Scan102]  ← Never reused
Lost: All hot pages evicted

Solution: TwoQ, LRU-K, or bypass buffer for known sequential scans.

Indexing Structures

Indexes provide fast access paths to data without full table scans.

Hash Index

Structure: Maps keys to locations using hash function.

Operations:

class HashIndex {
private:
    std::unordered_map<Key, std::vector<RecordID>> index;

public:
    void insert(Key key, RecordID rid) {
        index[key].push_back(rid);
    }

    std::vector<RecordID> find(Key key) {
        return index[key];  // O(1) average case
    }
};

Performance:

Point query: O(1) average, O(n) worst case
Range query: O(n) (must scan entire index)
Insert/Delete: O(1) average

Use Cases:

Equality searches
Primary key lookups
Hash joins

Collision Handling:

Chaining: Store colliding entries in linked list
Open Addressing: Probe for next available slot (linear, quadratic, double hashing)

B+ Tree Index

Structure: Balanced tree optimized for disk-based storage.

                  [Root: 50]
                 /          \
        [20, 40]              [70, 90]
       /    |    \           /   |    \
   [10,15] [30] [45]    [60,65] [80] [95]
     ↓      ↓    ↓        ↓      ↓    ↓
   Data   Data  Data    Data   Data  Data

Properties:

Balanced: All leaves at same depth
Sorted: Keys in order
High Fanout: Each node contains many keys (typically 100-200)
Leaf Linking: Leaves linked for range scans

Node Structure:

Internal Node:

┌──────────────────────────────────┐
│  K1 | K2 | K3 | ... | Kn        │  Keys
├──────────────────────────────────┤
│ P0 | P1 | P2 | P3 | ... | Pn    │  Pointers to children
└──────────────────────────────────┘

Leaf Node:

┌──────────────────────────────────┐
│  K1 | K2 | K3 | ... | Kn        │  Keys
├──────────────────────────────────┤
│  V1 | V2 | V3 | ... | Vn        │  Values (RecordIDs)
├──────────────────────────────────┤
│  Next Leaf Pointer               │  For range scans
└──────────────────────────────────┘

Performance:

Point query: O(log_F N) where F = fanout
Range query: O(log_F N + K) where K = result size
Insert: O(log_F N)
Delete: O(log_F N)

Example Capacity:

With 4KB pages, 8-byte keys, 8-byte pointers:

Fanout: ~250 entries per node
Height 3 tree: 250³ = 15.6 million records
Height 4 tree: 250⁴ = 3.9 billion records

Advantages:

Efficient range queries
Sequential access via leaf links
Guaranteed logarithmic performance
Self-balancing
Optimized for disk access

B+ Tree Operations

Search:

Value* search(Key key) {
    Node* node = root;

    // Descend to leaf
    while (!node->isLeaf) {
        int i = findChildIndex(node, key);
        node = node->children[i];
    }

    // Search in leaf
    return node->find(key);
}

Insert:

void insert(Key key, Value value) {
    Node* leaf = findLeaf(key);
    leaf->insert(key, value);

    if (leaf->isFull()) {
        splitNode(leaf);  // May cascade up tree
    }
}

Split Operation:

Before split (capacity 4):
┌────────────────────────────┐
│  10 | 20 | 30 | 40 | 50   │  ← Overflow
└────────────────────────────┘

After split:
┌──────────────┐         ┌──────────────┐
│  10 | 20     │ ← Left  │  40 | 50     │ ← Right
└──────────────┘         └──────────────┘
        ↓                        ↓
    Parent: [30]  ← Middle key promoted

Query Execution

Query execution transforms SQL into physical operations that retrieve and process data.

Query Processing Pipeline

SQL Query
   ↓
Parser (Syntax check, create parse tree)
   ↓
Optimizer (Choose execution plan)
   ↓
Execution Engine (Execute operators)
   ↓
Buffer Manager (Fetch pages)
   ↓
Result

Operator Framework

Philosophy: Modular, composable operators like Lego blocks.

Base Operator Interface:

class Operator {
public:
    virtual void open() = 0;          // Initialize
    virtual Tuple* next() = 0;        // Get next tuple
    virtual void close() = 0;         // Cleanup
};

Common Operators:

Operator

Purpose

Example

Scan

Read table

SELECT * FROM table

Select

Filter rows

WHERE salary > 50000

Project

Choose columns

SELECT name, salary

Join

Combine tables

FROM t1 JOIN t2 ON t1.id = t2.id

GroupBy

Aggregate

GROUP BY department

Sort

Order results

ORDER BY name

Example Query Plan:

SELECT department, AVG(salary)
FROM employees
WHERE salary > 50000
GROUP BY department;

Execution Plan:

GroupBy(department, AVG(salary))
   ↓
Select(salary > 50000)
   ↓
Scan(employees)

Iterator Model (Volcano Model)

Characteristics:

Pull-based: Parent calls next() on child
Pipelined: Tuples flow through operators
Memory efficient: Process one tuple at a time

Example:

class SelectOperator : public Operator {
private:
    Operator* child;
    Predicate predicate;

public:
    void open() { child->open(); }

    Tuple* next() {
        while (Tuple* t = child->next()) {
            if (predicate.evaluate(t)) {
                return t;  // Pass matching tuple
            }
        }
        return nullptr;  // No more tuples
    }

    void close() { child->close(); }
};

Query Optimization

Goals:

Minimize disk I/O
Reduce intermediate result sizes
Choose efficient join algorithms
Leverage indexes

Optimization Techniques:

1. Predicate Pushdown:

Bad:  Join → Filter
Good: Filter → Join  (fewer tuples to join)

2. Index Selection:

Instead of: Full table scan
Use:        Index scan when selective predicate exists

3. Join Order:

For: A ⋈ B ⋈ C

If |A| = 100, |B| = 1000, |C| = 10
Optimal: (A ⋈ C) ⋈ B
Bad: (B ⋈ C) ⋈ A

4. Join Algorithms:

Algorithm

Cost

Use When

Nested Loop

O(N × M)

Small tables, no indexes

Index Join

O(N × log M)

Index on join key

Hash Join

O(N + M)

Equijoin, memory available

Merge Join

O(N + M)

Sorted inputs

Storage Formats and Optimization

Row vs. Columnar Storage

Row Storage (NSM - N-ary Storage Model)

Structure: All attributes of a record stored together.

Page:
┌─────────────────────────────────────┐
│ Row1: [ID=1, Name=Alice, Age=30]   │
│ Row2: [ID=2, Name=Bob, Age=25]     │
│ Row3: [ID=3, Name=Carol, Age=35]   │
└─────────────────────────────────────┘

Advantages:

Efficient for fetching entire rows
Good for OLTP workloads
Simple implementation

Disadvantages:

Reads unnecessary columns
Poor cache locality for column scans
Inefficient for analytical queries

Columnar Storage (DSM - Decomposition Storage Model)

Structure: Each attribute stored separately.

ID Column:    [1, 2, 3, ...]
Name Column:  [Alice, Bob, Carol, ...]
Age Column:   [30, 25, 35, ...]

Advantages:

Read only needed columns
Better compression (similar data together)
Excellent cache locality
Efficient for OLAP workloads

Disadvantages:

Expensive row reconstruction
Poor for transactional workloads
Complex implementation

Performance Comparison:

Query: SELECT AVG(temperature) FROM weather WHERE timestamp BETWEEN ...

Storage Type

Pages Read

Query Time

Row

1000

112 ms

Columnar

3.6 ms

Reason: Only temperature column needed, not all columns.

Compression

Columnar storage enables effective compression due to:

Uniform data types per column
Data redundancy (repeating values)
Sorted data in many cases

Compression Techniques:

Technique

Best For

Example

Delta Encoding

Sorted numeric data

[100, 102, 105] → [100, +2, +3]

Run-Length Encoding

Repeated values

[A, A, A, B, B] → [3×A, 2×B]

Dictionary Encoding

Categorical data

[NY, CA, NY, TX] → [0, 1, 0, 2] + dict

Bit-Packing

Small-range integers

7-bit values in 7 bits (not 8)

Compression Benefits:

Reduced storage space (2-10× typical)
Reduced I/O (fewer pages to read)
Faster queries (less data to process)
Operates directly on compressed data (where possible)

Vectorized Execution

Problem: Tuple-at-a-time processing has high overhead.

Solution: Process batches of tuples (vectors) together.

Benefits:

Reduced function call overhead
Better CPU cache utilization
SIMD (Single Instruction Multiple Data) opportunities

SIMD Example:

// Without SIMD: 4 operations
result[0] = a[0] + b[0];
result[1] = a[1] + b[1];
result[2] = a[2] + b[2];
result[3] = a[3] + b[3];

// With SIMD: 1 operation
__m128i va = _mm_load_si128(a);  // Load 4 integers
__m128i vb = _mm_load_si128(b);
__m128i vr = _mm_add_epi32(va, vb);  // Add 4 integers simultaneously
_mm_store_si128(result, vr);

Performance Impact:

4× throughput with 128-bit SIMD
8× throughput with 256-bit AVX
16× throughput with 512-bit AVX-512

References

Course Materials:

CS 6400: Database Systems Concepts and Design - Georgia Tech OMSCS
CS 6422: Database System Implementation - Georgia Tech OMSCS

Previous04 SQL NextDistributed System

Last updated 12 days ago

Table of Contents

Overview

Key Implementation Concerns

Why C++ for Databases?

Transactions and ACID Properties

ACID Properties

Atomicity

Consistency

Isolation

Durability

History of ACID

Storage Management

Storage Hierarchy

Data Lifecycle

File I/O in C++

Memory Management

Heap vs. Stack Allocation

Smart Pointers (RAII)

Page Structure

Serialization and Deserialization

Buffer Management

Buffer Pool Architecture

Buffer Manager Components

Cache Eviction Policies

FIFO (First-In-First-Out)

LRU (Least Recently Used)

TwoQ Policy

Buffer Pool Flooding

Indexing Structures

Hash Index

B+ Tree Index

B+ Tree Operations

Query Execution

Query Processing Pipeline

Operator Framework

Iterator Model (Volcano Model)

Query Optimization

Storage Formats and Optimization

Row vs. Columnar Storage

Row Storage (NSM - N-ary Storage Model)

Columnar Storage (DSM - Decomposition Storage Model)

Compression

Vectorized Execution

References