03 Database Design
Table of Contents
Overview
Database design involves organizing data to minimize redundancy, ensure data integrity, and optimize performance. This chapter covers normalization theory, which provides a systematic approach to database design, as well as physical storage considerations.
Key Goals:
Eliminate data redundancy
Prevent anomalies (insertion, deletion, update)
Ensure data integrity
Optimize storage and access performance
Normalization
Normalization is the process of organizing data in a database to reduce redundancy and dependency. It involves decomposing tables into smaller, well-structured tables without losing information.
Non-Normalized Problems
Non-normalized databases suffer from several critical issues:
1. Redundancy Problems
Data Redundancy occurs when the same information is stored in multiple places, leading to:
Insertion Anomaly:
Cannot insert data without all related information
Example: Cannot add a new department unless an employee is assigned to it
Deletion Anomaly:
Deleting one piece of information causes unintended loss of other information
Example: Deleting the last employee in a department removes all department information
Update Anomaly:
Changes must be made in multiple places
Inconsistency if updates are not applied everywhere
Example: Changing a department name requires updating every employee record in that department
Example of Redundant Table:
1
Alice
10
Sales
Building A
2
Bob
10
Sales
Building A
3
Carol
20
Engineering
Building B
Problems:
"Sales" and "Building A" repeated for every sales employee
If "Sales" moves to "Building C", must update multiple rows
Deleting all Sales employees loses department information
2. Information Loss
Information Loss happens when decomposing a table improperly, leading to:
Inability to reconstruct original data
Loss of relationships between data
Incorrect data representation
Example of Lossy Decomposition:
Original Table:
1
CS101
Dr. Smith
Bad Decomposition: Table 1: (StudentID, Instructor) Table 2: (CourseID, Instructor)
Problem: Cannot determine which student took which course with which instructor.
3. Dependency Loss
Dependency Loss occurs when functional dependencies that existed in the original table cannot be enforced in the decomposed tables.
Example:
Original constraint: "Each course has one instructor"
After bad decomposition: This constraint cannot be enforced
Functional Dependencies
Functional Dependency (FD) is a relationship between attributes where one set of attributes uniquely determines another set of attributes.
Definition
For a relation R, X → Y (X functionally determines Y) means:
For each value of X in R, there is precisely one value of Y in R
Whenever two tuples have the same value for X, they must have the same value for Y
Notation: X → Y reads as "X determines Y" or "Y is functionally dependent on X"
Types of Functional Dependencies
Full Functional Dependency
Y is fully functionally dependent on X if:
Y is dependent on X, AND
Y is not dependent on any proper subset of X
Example:
(StudentID, CourseID) → Grade
This is a full dependency because:
- Grade depends on both StudentID AND CourseID
- Grade does not depend on StudentID alone
- Grade does not depend on CourseID alonePartial Dependency
Y is partially dependent on X if:
Y is dependent on X, BUT
Y is also dependent on a proper subset of X
Example:
(StudentID, CourseID) → StudentName
This is a partial dependency because:
- StudentName depends on (StudentID, CourseID)
- StudentName also depends on just StudentIDTransitive Dependency
Y is transitively dependent on X if:
X → Z AND Z → Y
Therefore X → Y through Z
Example:
StudentID → DepartmentID
DepartmentID → DepartmentName
Therefore: StudentID → DepartmentName (transitive)Keys and Functional Dependencies
Keys are used to enforce functional dependencies:
Super Key
Any set of attributes that determines all other attributes
Candidate Key
Minimal super key (no proper subset is a super key)
Primary Key
Chosen candidate key used for tuple identification
Making X the key in a relation enforces the dependency that X determines all other attributes (X → Y for all Y).
Enforcing Functional Dependencies
Functional dependencies ensure:
Uniqueness: Each X value maps to exactly one Y value
One-to-One Relationship: Between X and Y values in the relation
Implementation:
Use PRIMARY KEY constraints
Use UNIQUE constraints
Define foreign keys appropriately
Normal Forms
Normal forms provide a progression of rules for organizing database tables, each building on the previous.
First Normal Form (1NF)
Definition: All domain values in a relation R are atomic (indivisible).
Requirements:
No repeating groups
No arrays or lists in a single cell
Each cell contains a single value
All entries in a column are of the same data type
Note: All proper relations are automatically in 1NF due to the definition of the relational model.
Example:
Violation of 1NF:
1
Alice
555-1234, 555-5678
Corrected to 1NF:
1
Alice
555-1234
1
Alice
555-5678
Second Normal Form (2NF)
Definition: R is in 2NF if:
R is in 1NF, AND
Every non-key attribute is fully functionally dependent on the key (no partial dependencies)
Target: Eliminate partial dependencies
Process:
Identify partial dependencies
Decompose table to remove them
Example:
Violation of 2NF:
Enrollment(StudentID, CourseID, StudentName, CourseName, Grade)
Primary Key: (StudentID, CourseID)
Partial Dependencies:
- StudentID → StudentName
- CourseID → CourseNameCorrected to 2NF:
Student(StudentID, StudentName)
Course(CourseID, CourseName)
Enrollment(StudentID, CourseID, Grade)Third Normal Form (3NF)
Definition: R is in 3NF if:
R is in 2NF, AND
Every non-key attribute is non-transitively dependent on the key
Target: Eliminate transitive dependencies
Process:
Identify transitive dependencies
Decompose table to remove them
Example:
Violation of 3NF:
Employee(EmployeeID, DepartmentID, DepartmentName)
Transitive Dependency:
- EmployeeID → DepartmentID
- DepartmentID → DepartmentName
- Therefore: EmployeeID → DepartmentName (transitive)Corrected to 3NF:
Employee(EmployeeID, DepartmentID)
Department(DepartmentID, DepartmentName)Boyce-Codd Normal Form (BCNF)
Definition: R is in BCNF if:
R is in 3NF, AND
Every determinant is a candidate key
Determinant: A set of attributes in R on which some other attribute is fully functionally dependent.
BCNF is stricter than 3NF:
3NF allows non-key attributes to determine other non-key attributes
BCNF requires every determinant to be a candidate key
Example:
Violation of BCNF (but in 3NF):
CourseSchedule(StudentID, Course, Instructor)
Primary Key: (StudentID, Course)
Functional Dependencies:
- (StudentID, Course) → Instructor (satisfies 3NF)
- Instructor → Course (violates BCNF, Instructor not a candidate key)Corrected to BCNF:
StudentInstructor(StudentID, Instructor)
InstructorCourse(Instructor, Course)Normal Forms Summary
1NF
Repeating groups
Atomic values
2NF
Partial dependencies
Full dependency on key
3NF
Transitive dependencies
Non-transitive dependency on key
BCNF
All non-candidate key determinants
Every determinant is a candidate key
Progression:
1NF ⊂ 2NF ⊂ 3NF ⊂ BCNFEvery table in BCNF is also in 3NF, 2NF, and 1NF.
Armstrong's Axioms
Armstrong's Axioms are a set of inference rules for deriving functional dependencies. They are sound (only valid FDs are derived) and complete (all valid FDs can be derived).
Primary Rules
1. Reflexivity
Rule: If Y is a subset of X, then X → Y
Example:
If X = {StudentID, Name}, Y = {StudentID}
Then X → Y (trivial dependency)Application: Trivial dependencies always hold
2. Augmentation
Rule: If X → Y, then WX → WY (for any W)
Example:
If StudentID → Name
Then (StudentID, CourseID) → (Name, CourseID)Application: Adding the same attributes to both sides preserves the dependency
3. Transitivity
Rule: If X → Y and Y → Z, then X → Z
Example:
If StudentID → DepartmentID
And DepartmentID → DepartmentName
Then StudentID → DepartmentNameApplication: Dependencies can be chained
Derived Rules
From the primary rules, additional useful rules can be derived:
Union
Rule: If X → Y and X → Z, then X → YZ
Proof:
X → Y (given)
X → Z (given)
X → XY (augmentation of 1)
XY → YZ (augmentation of 2)
X → YZ (transitivity of 3 and 4)
Decomposition
Rule: If X → YZ, then X → Y and X → Z
Proof:
X → YZ (given)
YZ → Y (reflexivity)
X → Y (transitivity of 1 and 2)
Similarly, X → Z
Pseudotransitivity
Rule: If X → Y and WY → Z, then WX → Z
Application: Useful for more complex dependency inference
Applications
Armstrong's Axioms are used to:
Derive all functional dependencies from a given set
Determine candidate keys
Verify normal forms
Optimize database schemas
Database Architecture
Database architecture describes how data is physically stored and accessed.
Storage Technologies
Modern databases use a hierarchy of storage technologies:
CPU Registers
Fastest
Bytes-KB
Volatile
Highest
CPU Cache (L1/L2/L3)
Very Fast
KB-MB
Volatile
Very High
DRAM (RAM)
Fast
GB
Volatile
High
SSD
Medium
GB-TB
Non-volatile
Medium
HDD
Slow
TB
Non-volatile
Low
Tape/Cloud
Very Slow
PB+
Non-volatile
Lowest
Volatile vs. Persistent Storage
Volatile Storage (DRAM)
Characteristics:
Data Lost Without Power
Fast access (nanoseconds)
Limited capacity
Expensive
Usage:
Cache frequently accessed data ("hot data")
Buffer pool
Query execution workspace
Persistent Storage (Disk)
Characteristics:
Data is Safe after power loss
Slower access (milliseconds)
Large capacity
Inexpensive
Usage:
Permanent data storage
Transaction logs
Database files
Data Lifecycle
The database manages data movement between storage tiers:
┌─────────────────────┐
│ Hot Data (RAM) │ ← Frequently accessed
├─────────────────────┤
│ Warm Data (SSD) │ ← Moderately accessed
├─────────────────────┤
│ Cold Data (HDD) │ ← Rarely accessed
└─────────────────────┘Strategies:
Least Recently Used (LRU)
Most Frequently Used (MFU)
Clock algorithms
Data Access Performance
Main Memory Access:
Latency: ~30 nanoseconds (10^-7 seconds)
Bandwidth: ~GB/s per channel
Disk Access:
Latency: ~10 milliseconds (10^-2 seconds)
Seek time: 3-8 ms
Rotational delay: 2-3 ms
Transfer time: 0.5-1 ms
Performance Gap:
Main memory is approximately 100,000 times faster than disk
This gap drives database architecture decisions
Data Organization
How data is organized on disk significantly impacts performance.
Disk Structure
Modern hard drives consist of:
Components:
Platters: Circular disks with magnetic coating
Tracks: Concentric circles on platter surfaces
Sectors: Fixed-size segments of tracks (typically 512 bytes or 4KB)
Cylinders: Collection of tracks at the same position across platters
Read-Write Heads: Magnetic sensors for each platter surface
Actuator: Moves heads across tracks

Blocks:
Unit of information transferred between disk and main memory
Typically 4KB, 8KB, or larger
Multiple records stored in a single block
Files consist of multiple blocks connected by address pointers
Storage Strategies
Heap (Unsorted Files)
Characteristics:
Records stored in no particular order
New records appended to end of file
Simplest organization
Performance:
Insert
O(1)
Append to end
Search
O(N/2) average
Must scan all blocks
Delete
O(N/2)
Find then remove
Update
O(N/2)
Find then modify
Use Cases:
Small tables
Tables with mostly sequential scans
Temporary tables
Sorted Files
Characteristics:
Records ordered by a key attribute
Insertion requires maintaining order
Binary search possible
Performance:
Insert
O(N)
Must maintain order
Search
O(log N)
Binary search
Range Query
O(log N + K)
K = result size
Advantages:
Fast searches on sort key
Efficient range queries
Sequential access benefits
Disadvantages:
Expensive insertions
Only one sort order
Indexing
Indexes provide fast access paths to data without requiring full table scans.
Sparse Primary Index
Structure:
One index entry per data block
Index stores: (first key in block, pointer to block)
Requires data to be sorted
Performance:
Search: O(log F) where F = number of blocks
Space: Minimal index size
Dense Primary Index
Structure:
One index entry per record
Index stores: (key value, pointer to record)
Requires data to be sorted
Performance:
Search: O(log N) where N = number of records
Space: Larger index size
Secondary Index
Structure:
Built on non-key attributes
Must be dense (one entry per record)
Data doesn't need to be sorted on this attribute
Use Cases:
Point queries on non-key attributes
Attributes frequently used in WHERE clauses
Multilevel Index
Concept: Create an index on the index
Structure:
┌──────────────────────┐
│ Top-Level Index │ ← Smallest, fits in memory
├──────────────────────┤
│ Second-Level Index │
├──────────────────────┤
│ First-Level Index │ ← Dense or sparse
├──────────────────────┤
│ Data Blocks │
└──────────────────────┘Performance:
Search: O(log_F N) where F = fanout
Reduces number of disk accesses significantly
B+ Tree Index
B+ Trees are the most common index structure in databases.
Characteristics:
Balanced: All leaves at same depth
Sorted: Keys in order within nodes
High Fanout: Each node contains many keys
Leaf Links: Leaf nodes linked for range scans
Advantages:
Guaranteed logarithmic search time
Efficient insertions and deletions
Range queries via leaf traversal
Self-balancing
Structure:
Internal Nodes: Contain keys and pointers to children
Leaf Nodes: Contain keys and pointers to data (or data itself)
Order N: Each node contains between ⌈N/2⌉ and N children
Performance:
Search: O(log_F N)
Insert: O(log_F N)
Delete: O(log_F N)
Range Query: O(log_F N + K) where K = result size
Hashing
Hashing provides constant-time average access for point queries.
Static Hashing
Structure:
Hash function: h(key) = bucket number
Each bucket contains one or more data blocks
Overflow blocks for collisions
Hash Function Properties:
Uniform distribution of values
Minimal collisions
Efficient computation
Performance:
Search: O(1) average, O(N) worst case
Insert: O(1) average
Range Query: O(N) (no ordering)
Limitations:
Fixed number of buckets
Performance degrades as data grows
No support for range queries
Dynamic Hashing
Approaches:
Extendible Hashing: Adjust directory size dynamically
Linear Hashing: Grow buckets incrementally
Advantages:
Adapts to data growth
Maintains constant-time access
No full reorganization needed
References
Course Materials:
CS 6400: Database Systems Concepts and Design - Georgia Tech OMSCS
Last updated