05 Database Implementation
Table of Contents
Overview
This chapter covers the internal implementation of database systems, exploring how databases efficiently manage data storage, retrieval, and processing. Topics include storage management, buffer management, indexing structures, query execution models, and performance optimizations.
Key Implementation Concerns
Performance:
Minimize disk I/O (primary bottleneck)
Efficient memory utilization
Fast query execution
Reliability:
ACID transaction guarantees
Crash recovery mechanisms
Data durability
Scalability:
Handle growing data volumes
Support concurrent users
Efficient resource utilization
Why C++ for Databases?
Modern database systems are often implemented in C++ for several reasons:
Performance
Compiled to machine code, minimal runtime overhead
Hardware Control
Direct memory management, precise resource control
Type Safety
Compile-time error detection
Low-Level Access
Direct manipulation of bits, bytes, and memory
Performance Comparison:
C++ sum of 100M integers: 100 microseconds
Python equivalent: 5.24 seconds
Performance ratio: ~50,000× faster
Transactions and ACID Properties
Transactions are logical units of work that maintain database consistency and reliability. They are governed by ACID properties.
ACID Properties
Atomicity
"All or Nothing" - Transactions execute completely or not at all.
Characteristics:
No partial updates visible
Transaction failure causes rollback
Ensures database never in intermediate state
Implementation:
Write-Ahead Logging (WAL)
Undo logs for rollback
Redo logs for recovery
Example:
Consistency
Maintains database integrity constraints.
Characteristics:
Database moves from one valid state to another
All constraints satisfied before and after transaction
Business rules preserved
Example:
Isolation
Concurrent transactions don't interfere with each other.
Characteristics:
Transactions appear to execute sequentially
Intermediate states not visible to others
Prevents read/write conflicts
Isolation Levels:
Read Uncommitted
Possible
Possible
Possible
Read Committed
Prevented
Possible
Possible
Repeatable Read
Prevented
Prevented
Possible
Serializable
Prevented
Prevented
Prevented
Implementation:
Locking protocols (two-phase locking)
Multi-Version Concurrency Control (MVCC)
Timestamp ordering
Durability
Committed changes survive system failures.
Characteristics:
Changes persisted to non-volatile storage
Survive crashes, power failures
Write-Ahead Logging ensures durability
Implementation:
Force log records to disk before commit
Redo logs reconstruct committed transactions
Checkpoint mechanisms for recovery
History of ACID
The ACID acronym was formalized by Andreas Reuter (1983) and Theo Härder, building on earlier concepts from database theory.
Storage Management
Storage management handles the physical organization of data on disk and its transfer to memory.
Storage Hierarchy
DRAM
~30 ns
GB
Yes
High
SSD
~100 μs
TB
No
Medium
HDD
~10 ms
TB
No
Low
Key Insight: Memory is ~100,000× faster than disk, driving all storage management decisions.
Data Lifecycle
Hot Data (DRAM):
Frequently accessed
Cached for speed
Volatile (lost on power failure)
Limited capacity
Durable Data (Disk):
Permanent storage
Slower access
Persistent across failures
Large capacity
Database manages movement between tiers based on access patterns.
File I/O in C++
Databases use file I/O for persistence:
Memory Management
Heap vs. Stack Allocation
Stack:
Fast allocation/deallocation
Limited size
Automatic lifetime management
Scope-bound
Heap:
Dynamic allocation
Large pool of memory
Manual lifetime management (or smart pointers)
Survives beyond function scope
Database Usage:
Large data structures → heap
Temporary local variables → stack
Page buffers → heap
Control structures → stack
Smart Pointers (RAII)
Resource Acquisition Is Initialization (RAII) ties resource lifetime to object scope.
Problems with Raw Pointers:
Memory leaks
Dangling pointers
Double-free errors
Ownership ambiguity
Smart Pointer Solution:
Page Structure
Page is the basic unit of disk storage:
Characteristics:
Fixed size (typically 4KB, 8KB, 16KB)
Atomic unit of I/O
Contains multiple tuples
Includes metadata (used space, free space)
Slotted Page Structure:
Advantages:
Direct access to any tuple by slot index
Efficient space utilization
Simplified compaction
Support for variable-length tuples
Serialization and Deserialization
Purpose: Convert in-memory structures to on-disk format and vice versa.
Serialization (Write):
Deserialization (Read):
Buffer Management
The buffer manager caches frequently accessed pages in memory to minimize disk I/O.
Buffer Pool Architecture
Buffer Manager Components
Data Structures:
Page Table (Hash Map): Maps page IDs to buffer pool frames
Replacement Policy: Determines which page to evict
Dirty Bit: Tracks modified pages requiring write-back
Operations:
Cache Eviction Policies
FIFO (First-In-First-Out)
Strategy: Evict oldest page in buffer.
Advantages:
Simple implementation
Low overhead
Disadvantages:
May evict frequently accessed pages
Ignores access patterns
Poor for workloads with hot pages
LRU (Least Recently Used)
Strategy: Evict page with oldest access time.
Advantages:
Considers recency of access
Good for temporal locality
Better than FIFO for typical workloads
Disadvantages:
Sequential scans pollute cache
Overhead of tracking access times
Implementation:
TwoQ Policy
Strategy: Separate queues for single-access and frequently-accessed pages.
Architecture:
Benefits:
Prevents buffer pool flooding from sequential scans
Prioritizes frequently accessed pages
Evicts cold pages first
Eviction Priority:
Evict from FIFO queue (cold pages)
Only if FIFO empty, evict from LRU queue
Buffer Pool Flooding
Problem: Sequential scans bring many pages into buffer, evicting hot pages.
Example:
Solution: TwoQ, LRU-K, or bypass buffer for known sequential scans.
Indexing Structures
Indexes provide fast access paths to data without full table scans.
Hash Index
Structure: Maps keys to locations using hash function.
Operations:
Performance:
Point query: O(1) average, O(n) worst case
Range query: O(n) (must scan entire index)
Insert/Delete: O(1) average
Use Cases:
Equality searches
Primary key lookups
Hash joins
Collision Handling:
Chaining: Store colliding entries in linked list
Open Addressing: Probe for next available slot (linear, quadratic, double hashing)
B+ Tree Index
Structure: Balanced tree optimized for disk-based storage.
Properties:
Balanced: All leaves at same depth
Sorted: Keys in order
High Fanout: Each node contains many keys (typically 100-200)
Leaf Linking: Leaves linked for range scans
Node Structure:
Internal Node:
Leaf Node:
Performance:
Point query: O(log_F N) where F = fanout
Range query: O(log_F N + K) where K = result size
Insert: O(log_F N)
Delete: O(log_F N)
Example Capacity:
With 4KB pages, 8-byte keys, 8-byte pointers:
Fanout: ~250 entries per node
Height 3 tree: 250³ = 15.6 million records
Height 4 tree: 250⁴ = 3.9 billion records
Advantages:
Efficient range queries
Sequential access via leaf links
Guaranteed logarithmic performance
Self-balancing
Optimized for disk access
B+ Tree Operations
Search:
Insert:
Split Operation:
Query Execution
Query execution transforms SQL into physical operations that retrieve and process data.
Query Processing Pipeline
Operator Framework
Philosophy: Modular, composable operators like Lego blocks.
Base Operator Interface:
Common Operators:
Scan
Read table
SELECT * FROM table
Select
Filter rows
WHERE salary > 50000
Project
Choose columns
SELECT name, salary
Join
Combine tables
FROM t1 JOIN t2 ON t1.id = t2.id
GroupBy
Aggregate
GROUP BY department
Sort
Order results
ORDER BY name
Example Query Plan:
Execution Plan:
Iterator Model (Volcano Model)
Characteristics:
Pull-based: Parent calls next() on child
Pipelined: Tuples flow through operators
Memory efficient: Process one tuple at a time
Example:
Query Optimization
Goals:
Minimize disk I/O
Reduce intermediate result sizes
Choose efficient join algorithms
Leverage indexes
Optimization Techniques:
1. Predicate Pushdown:
2. Index Selection:
3. Join Order:
4. Join Algorithms:
Nested Loop
O(N × M)
Small tables, no indexes
Index Join
O(N × log M)
Index on join key
Hash Join
O(N + M)
Equijoin, memory available
Merge Join
O(N + M)
Sorted inputs
Storage Formats and Optimization
Row vs. Columnar Storage
Row Storage (NSM - N-ary Storage Model)
Structure: All attributes of a record stored together.
Advantages:
Efficient for fetching entire rows
Good for OLTP workloads
Simple implementation
Disadvantages:
Reads unnecessary columns
Poor cache locality for column scans
Inefficient for analytical queries
Columnar Storage (DSM - Decomposition Storage Model)
Structure: Each attribute stored separately.
Advantages:
Read only needed columns
Better compression (similar data together)
Excellent cache locality
Efficient for OLAP workloads
Disadvantages:
Expensive row reconstruction
Poor for transactional workloads
Complex implementation
Performance Comparison:
Query: SELECT AVG(temperature) FROM weather WHERE timestamp BETWEEN ...
Row
1000
112 ms
Columnar
10
3.6 ms
Reason: Only temperature column needed, not all columns.
Compression
Columnar storage enables effective compression due to:
Uniform data types per column
Data redundancy (repeating values)
Sorted data in many cases
Compression Techniques:
Delta Encoding
Sorted numeric data
[100, 102, 105] → [100, +2, +3]
Run-Length Encoding
Repeated values
[A, A, A, B, B] → [3×A, 2×B]
Dictionary Encoding
Categorical data
[NY, CA, NY, TX] → [0, 1, 0, 2] + dict
Bit-Packing
Small-range integers
7-bit values in 7 bits (not 8)
Compression Benefits:
Reduced storage space (2-10× typical)
Reduced I/O (fewer pages to read)
Faster queries (less data to process)
Operates directly on compressed data (where possible)
Vectorized Execution
Problem: Tuple-at-a-time processing has high overhead.
Solution: Process batches of tuples (vectors) together.
Benefits:
Reduced function call overhead
Better CPU cache utilization
SIMD (Single Instruction Multiple Data) opportunities
SIMD Example:
Performance Impact:
4× throughput with 128-bit SIMD
8× throughput with 256-bit AVX
16× throughput with 512-bit AVX-512
References
Course Materials:
CS 6400: Database Systems Concepts and Design - Georgia Tech OMSCS
CS 6422: Database System Implementation - Georgia Tech OMSCS
Last updated