03 Database Design

Overview

Database design involves organizing data to minimize redundancy, ensure data integrity, and optimize performance. This chapter covers normalization theory, which provides a systematic approach to database design, as well as physical storage considerations.

Key Goals:

Eliminate data redundancy
Prevent anomalies (insertion, deletion, update)
Ensure data integrity
Optimize storage and access performance

Normalization

Normalization is the process of organizing data in a database to reduce redundancy and dependency. It involves decomposing tables into smaller, well-structured tables without losing information.

Non-Normalized Problems

Non-normalized databases suffer from several critical issues:

1. Redundancy Problems

Data Redundancy occurs when the same information is stored in multiple places, leading to:

Insertion Anomaly:

Cannot insert data without all related information
Example: Cannot add a new department unless an employee is assigned to it

Deletion Anomaly:

Deleting one piece of information causes unintended loss of other information
Example: Deleting the last employee in a department removes all department information

Update Anomaly:

Changes must be made in multiple places
Inconsistency if updates are not applied everywhere
Example: Changing a department name requires updating every employee record in that department

Example of Redundant Table:

EmployeeID

Name

DepartmentID

DepartmentName

DepartmentLocation

Alice

Sales

Building A

Bob

Sales

Building A

Carol

Engineering

Building B

Problems:

"Sales" and "Building A" repeated for every sales employee
If "Sales" moves to "Building C", must update multiple rows
Deleting all Sales employees loses department information

2. Information Loss

Information Loss happens when decomposing a table improperly, leading to:

Inability to reconstruct original data
Loss of relationships between data
Incorrect data representation

Example of Lossy Decomposition:

Original Table:

StudentID

CourseID

Instructor

CS101

Dr. Smith

Bad Decomposition: Table 1: (StudentID, Instructor) Table 2: (CourseID, Instructor)

Problem: Cannot determine which student took which course with which instructor.

3. Dependency Loss

Dependency Loss occurs when functional dependencies that existed in the original table cannot be enforced in the decomposed tables.

Example:

Original constraint: "Each course has one instructor"
After bad decomposition: This constraint cannot be enforced

Functional Dependencies

Functional Dependency (FD) is a relationship between attributes where one set of attributes uniquely determines another set of attributes.

Definition

For a relation R, X → Y (X functionally determines Y) means:

For each value of X in R, there is precisely one value of Y in R
Whenever two tuples have the same value for X, they must have the same value for Y

Notation: X → Y reads as "X determines Y" or "Y is functionally dependent on X"

Types of Functional Dependencies

Full Functional Dependency

Y is fully functionally dependent on X if:

Y is dependent on X, AND
Y is not dependent on any proper subset of X

Example:

(StudentID, CourseID) → Grade

This is a full dependency because:
- Grade depends on both StudentID AND CourseID
- Grade does not depend on StudentID alone
- Grade does not depend on CourseID alone

Partial Dependency

Y is partially dependent on X if:

Y is dependent on X, BUT
Y is also dependent on a proper subset of X

Example:

(StudentID, CourseID) → StudentName

This is a partial dependency because:
- StudentName depends on (StudentID, CourseID)
- StudentName also depends on just StudentID

Transitive Dependency

Y is transitively dependent on X if:

X → Z AND Z → Y
Therefore X → Y through Z

Example:

StudentID → DepartmentID
DepartmentID → DepartmentName
Therefore: StudentID → DepartmentName (transitive)

Keys and Functional Dependencies

Keys are used to enforce functional dependencies:

Key Type

Role in FD

Super Key

Any set of attributes that determines all other attributes

Candidate Key

Minimal super key (no proper subset is a super key)

Primary Key

Chosen candidate key used for tuple identification

Making X the key in a relation enforces the dependency that X determines all other attributes (X → Y for all Y).

Enforcing Functional Dependencies

Functional dependencies ensure:

Uniqueness: Each X value maps to exactly one Y value
One-to-One Relationship: Between X and Y values in the relation

Implementation:

Use PRIMARY KEY constraints
Use UNIQUE constraints
Define foreign keys appropriately

Normal Forms

Normal forms provide a progression of rules for organizing database tables, each building on the previous.

First Normal Form (1NF)

Definition: All domain values in a relation R are atomic (indivisible).

Requirements:

No repeating groups
No arrays or lists in a single cell
Each cell contains a single value
All entries in a column are of the same data type

Note: All proper relations are automatically in 1NF due to the definition of the relational model.

Example:

Violation of 1NF:

StudentID

Name

PhoneNumbers

Alice

555-1234, 555-5678

Corrected to 1NF:

StudentID

Name

PhoneNumber

Alice

555-1234

Alice

555-5678

Second Normal Form (2NF)

Definition: R is in 2NF if:

R is in 1NF, AND
Every non-key attribute is fully functionally dependent on the key (no partial dependencies)

Target: Eliminate partial dependencies

Process:

Identify partial dependencies
Decompose table to remove them

Example:

Violation of 2NF:

Enrollment(StudentID, CourseID, StudentName, CourseName, Grade)
Primary Key: (StudentID, CourseID)

Partial Dependencies:
- StudentID → StudentName
- CourseID → CourseName

Corrected to 2NF:

Student(StudentID, StudentName)
Course(CourseID, CourseName)
Enrollment(StudentID, CourseID, Grade)

Third Normal Form (3NF)

Definition: R is in 3NF if:

R is in 2NF, AND
Every non-key attribute is non-transitively dependent on the key

Target: Eliminate transitive dependencies

Process:

Identify transitive dependencies
Decompose table to remove them

Example:

Violation of 3NF:

Employee(EmployeeID, DepartmentID, DepartmentName)

Transitive Dependency:
- EmployeeID → DepartmentID
- DepartmentID → DepartmentName
- Therefore: EmployeeID → DepartmentName (transitive)

Corrected to 3NF:

Employee(EmployeeID, DepartmentID)
Department(DepartmentID, DepartmentName)

Boyce-Codd Normal Form (BCNF)

Definition: R is in BCNF if:

R is in 3NF, AND
Every determinant is a candidate key

Determinant: A set of attributes in R on which some other attribute is fully functionally dependent.

BCNF is stricter than 3NF:

3NF allows non-key attributes to determine other non-key attributes
BCNF requires every determinant to be a candidate key

Example:

Violation of BCNF (but in 3NF):

CourseSchedule(StudentID, Course, Instructor)
Primary Key: (StudentID, Course)

Functional Dependencies:
- (StudentID, Course) → Instructor (satisfies 3NF)
- Instructor → Course (violates BCNF, Instructor not a candidate key)

Corrected to BCNF:

StudentInstructor(StudentID, Instructor)
InstructorCourse(Instructor, Course)

Normal Forms Summary

Normal Form

Eliminates

Key Requirement

1NF

Repeating groups

Atomic values

2NF

Partial dependencies

Full dependency on key

3NF

Transitive dependencies

Non-transitive dependency on key

BCNF

All non-candidate key determinants

Every determinant is a candidate key

Progression:

1NF ⊂ 2NF ⊂ 3NF ⊂ BCNF

Every table in BCNF is also in 3NF, 2NF, and 1NF.

Armstrong's Axioms

Armstrong's Axioms are a set of inference rules for deriving functional dependencies. They are sound (only valid FDs are derived) and complete (all valid FDs can be derived).

Primary Rules

1. Reflexivity

Rule: If Y is a subset of X, then X → Y

Example:

If X = {StudentID, Name}, Y = {StudentID}
Then X → Y (trivial dependency)

Application: Trivial dependencies always hold

2. Augmentation

Rule: If X → Y, then WX → WY (for any W)

Example:

If StudentID → Name
Then (StudentID, CourseID) → (Name, CourseID)

Application: Adding the same attributes to both sides preserves the dependency

3. Transitivity

Rule: If X → Y and Y → Z, then X → Z

Example:

If StudentID → DepartmentID
And DepartmentID → DepartmentName
Then StudentID → DepartmentName

Application: Dependencies can be chained

Derived Rules

From the primary rules, additional useful rules can be derived:

Union

Rule: If X → Y and X → Z, then X → YZ

Proof:

X → Y (given)
X → Z (given)
X → XY (augmentation of 1)
XY → YZ (augmentation of 2)
X → YZ (transitivity of 3 and 4)

Decomposition

Rule: If X → YZ, then X → Y and X → Z

Proof:

X → YZ (given)
YZ → Y (reflexivity)
X → Y (transitivity of 1 and 2)
Similarly, X → Z

Pseudotransitivity

Rule: If X → Y and WY → Z, then WX → Z

Application: Useful for more complex dependency inference

Applications

Armstrong's Axioms are used to:

Derive all functional dependencies from a given set
Determine candidate keys
Verify normal forms
Optimize database schemas

Database Architecture

Database architecture describes how data is physically stored and accessed.

Storage Technologies

Modern databases use a hierarchy of storage technologies:

Storage Type

Speed

Capacity

Volatility

Cost/GB

CPU Registers

Fastest

Bytes-KB

Volatile

Highest

CPU Cache (L1/L2/L3)

Very Fast

KB-MB

Volatile

Very High

DRAM (RAM)

Fast

Volatile

High

SSD

Medium

GB-TB

Non-volatile

Medium

HDD

Slow

Non-volatile

Low

Tape/Cloud

Very Slow

PB+

Non-volatile

Lowest

Volatile vs. Persistent Storage

Volatile Storage (DRAM)

Characteristics:

Data Lost Without Power
Fast access (nanoseconds)
Limited capacity
Expensive

Usage:

Cache frequently accessed data ("hot data")
Buffer pool
Query execution workspace

Persistent Storage (Disk)

Characteristics:

Data is Safe after power loss
Slower access (milliseconds)
Large capacity
Inexpensive

Usage:

Permanent data storage
Transaction logs
Database files

Data Lifecycle

The database manages data movement between storage tiers:

┌─────────────────────┐
│    Hot Data (RAM)   │ ← Frequently accessed
├─────────────────────┤
│  Warm Data (SSD)    │ ← Moderately accessed
├─────────────────────┤
│   Cold Data (HDD)   │ ← Rarely accessed
└─────────────────────┘

Strategies:

Least Recently Used (LRU)
Most Frequently Used (MFU)
Clock algorithms

Data Access Performance

Main Memory Access:

Latency: ~30 nanoseconds (10^-7 seconds)
Bandwidth: ~GB/s per channel

Disk Access:

Latency: ~10 milliseconds (10^-2 seconds)
Seek time: 3-8 ms
Rotational delay: 2-3 ms
Transfer time: 0.5-1 ms

Performance Gap:

Main memory is approximately 100,000 times faster than disk
This gap drives database architecture decisions

Data Organization

How data is organized on disk significantly impacts performance.

Disk Structure

Modern hard drives consist of:

Components:

Platters: Circular disks with magnetic coating
Tracks: Concentric circles on platter surfaces
Sectors: Fixed-size segments of tracks (typically 512 bytes or 4KB)
Cylinders: Collection of tracks at the same position across platters
Read-Write Heads: Magnetic sensors for each platter surface
Actuator: Moves heads across tracks

Blocks:

Unit of information transferred between disk and main memory
Typically 4KB, 8KB, or larger
Multiple records stored in a single block
Files consist of multiple blocks connected by address pointers

Storage Strategies

Heap (Unsorted Files)

Characteristics:

Records stored in no particular order
New records appended to end of file
Simplest organization

Performance:

Operation

Complexity

Notes

Insert

O(1)

Append to end

Search

O(N/2) average

Must scan all blocks

Delete

O(N/2)

Find then remove

Update

O(N/2)

Find then modify

Use Cases:

Small tables
Tables with mostly sequential scans
Temporary tables

Sorted Files

Characteristics:

Records ordered by a key attribute
Insertion requires maintaining order
Binary search possible

Performance:

Operation

Complexity

Notes

Insert

O(N)

Must maintain order

Search

O(log N)

Binary search

Range Query

O(log N + K)

K = result size

Advantages:

Fast searches on sort key
Efficient range queries
Sequential access benefits

Disadvantages:

Expensive insertions
Only one sort order

Indexing

Indexes provide fast access paths to data without requiring full table scans.

Sparse Primary Index

Structure:

One index entry per data block
Index stores: (first key in block, pointer to block)
Requires data to be sorted

Performance:

Search: O(log F) where F = number of blocks
Space: Minimal index size

Dense Primary Index

Structure:

One index entry per record
Index stores: (key value, pointer to record)
Requires data to be sorted

Performance:

Search: O(log N) where N = number of records
Space: Larger index size

Secondary Index

Structure:

Built on non-key attributes
Must be dense (one entry per record)
Data doesn't need to be sorted on this attribute

Use Cases:

Point queries on non-key attributes
Attributes frequently used in WHERE clauses

Multilevel Index

Concept: Create an index on the index

Structure:

┌──────────────────────┐
│   Top-Level Index    │ ← Smallest, fits in memory
├──────────────────────┤
│ Second-Level Index   │
├──────────────────────┤
│  First-Level Index   │ ← Dense or sparse
├──────────────────────┤
│     Data Blocks      │
└──────────────────────┘

Performance:

Search: O(log_F N) where F = fanout
Reduces number of disk accesses significantly

B+ Tree Index

B+ Trees are the most common index structure in databases.

Characteristics:

Balanced: All leaves at same depth
Sorted: Keys in order within nodes
High Fanout: Each node contains many keys
Leaf Links: Leaf nodes linked for range scans

Advantages:

Guaranteed logarithmic search time
Efficient insertions and deletions
Range queries via leaf traversal
Self-balancing

Structure:

Internal Nodes: Contain keys and pointers to children
Leaf Nodes: Contain keys and pointers to data (or data itself)
Order N: Each node contains between ⌈N/2⌉ and N children

Performance:

Search: O(log_F N)
Insert: O(log_F N)
Delete: O(log_F N)
Range Query: O(log_F N + K) where K = result size

Hashing

Hashing provides constant-time average access for point queries.

Static Hashing

Structure:

Hash function: h(key) = bucket number
Each bucket contains one or more data blocks
Overflow blocks for collisions

Hash Function Properties:

Uniform distribution of values
Minimal collisions
Efficient computation

Performance:

Search: O(1) average, O(N) worst case
Insert: O(1) average
Range Query: O(N) (no ordering)

Limitations:

Fixed number of buckets
Performance degrades as data grows
No support for range queries

Dynamic Hashing

Approaches:

Extendible Hashing: Adjust directory size dynamically
Linear Hashing: Grow buckets incrementally

Advantages:

Adapts to data growth
Maintains constant-time access
No full reorganization needed

References

Course Materials:

CS 6400: Database Systems Concepts and Design - Georgia Tech OMSCS

Previous02 Data Model Next04 SQL

Last updated 12 days ago

Table of Contents

Overview

Normalization

Non-Normalized Problems

1. Redundancy Problems

2. Information Loss

3. Dependency Loss

Functional Dependencies

Definition

Types of Functional Dependencies

Full Functional Dependency

Partial Dependency

Transitive Dependency

Keys and Functional Dependencies

Enforcing Functional Dependencies

Normal Forms

First Normal Form (1NF)

Second Normal Form (2NF)

Third Normal Form (3NF)

Boyce-Codd Normal Form (BCNF)

Normal Forms Summary

Armstrong's Axioms

Primary Rules

1. Reflexivity

2. Augmentation

3. Transitivity

Derived Rules

Union

Decomposition

Pseudotransitivity

Applications

Database Architecture

Storage Technologies

Volatile vs. Persistent Storage

Volatile Storage (DRAM)

Persistent Storage (Disk)

Data Lifecycle

Data Access Performance

Data Organization

Disk Structure

Storage Strategies

Heap (Unsorted Files)

Sorted Files

Indexing

Sparse Primary Index

Dense Primary Index

Secondary Index

Multilevel Index

B+ Tree Index

Hashing

Static Hashing

Dynamic Hashing

References