07 Virtualization
Table of Contents
Overview
Virtualization extends the concept of operating system extensibility by enabling the simultaneous co-existence of multiple complete operating systems on top of the same hardware platform. This technology allows multiple guest operating systems to run independently on shared hardware resources, managed by a specialized layer called a hypervisor or Virtual Machine Monitor (VMM). The concept, pioneered by IBM VM/370 in the 1960s and 70s, has become fundamental to modern cloud computing and data center operations.
Motivation and Utility Computing
Cost Reduction Through Resource Sharing
The primary motivation for virtualization is cost reduction through efficient resource sharing. Virtualization enables companies to provide users with an experience equivalent to owning dedicated hardware platforms at a fraction of the cost by providing virtual platforms instead. From the user's perspective, a virtual platform functions identically to a real one, running applications without concern for the underlying implementation.
Resource Aggregation
Virtualization facilitates utility computing by sharing hardware resources across multiple user communities, significantly reducing the cost of ownership and maintenance. The underlying principle leverages the fact that resource usage is typically bursty. By aggregating bursty demands from multiple users, a cumulative usage pattern emerges that can be serviced by a shared pool of resources. This allows individual users to potentially access more resources than they could individually afford at a fraction of the cost.
Modern Cloud Computing
Major cloud providers like Amazon Web Services and Microsoft employ this model, providing resources on a shared, rental, or utility basis. Virtualization represents the logical extension of the idea of extensibility or specialization of services, applied at a much larger granularityβnamely an entire operating system.
Historical Context
The concept of virtualization dates back to the IBM VM/370 in the 1960s and 70s, which aimed to give every user the illusion that the computer was theirs exclusively. Modern virtualization gained resurgence in the early 2000s with projects like Stanford's SimOS (which formed the basis for VMware) and technologies like Xen. Initial applications included application mobility, server consolidation, collocating hosting facilities, and distributed web services.
Today, virtualization has become mainstream because companies can now provide resources with complete performance isolation and bill each individual user separately. It has transformed computing into a utility, much like electricity and water services.
Hypervisors
To manage multiple operating systems on shared hardware, a specialized Virtual Machine Monitor (VMM) or hypervisor is neededβessentially an "operating system of operating systems." The operating systems running on top of the shared hardware are called virtual machines (VMs) or guest operating systems.
Types of Hypervisors
Native Hypervisor (Bare Metal):
Runs directly on top of bare hardware
Guest operating systems run on top of this hypervisor
Offers the best performance for guest operating systems due to minimal interference
Examples: VMware ESXi, Xen
Hosted Hypervisor:
Runs as an application process on top of a host operating system
Guest operating systems are clients of this hosted hypervisor
Examples: VMware Workstation, VirtualBox
Native (Bare Metal) Hypervisor Hosted Hypervisor
================================== =================================
ββββββββββββ¬βββββββββββ¬ββββββββββββ ββββββββββββ¬βββββββββββ¬βββββββββββ
β Guest β Guest β Guest β β Guest β Guest β Guest β
β OS 1 β OS 2 β OS 3 β β OS 1 β OS 2 β OS 3 β
ββββββββββββ΄βββββββββββ΄ββββββββββββ€ ββββββββββββ΄βββββββββββ΄βββββββββββ€
β β β Hypervisor (Application) β
β Hypervisor (VMM) β ββββββββββββββββββββββββββββββββββ€
β β β Host Operating System β
βββββββββββββββββββββββββββββββββββ€ ββββββββββββββββββββββββββββββββββ€
β Hardware Platform β β Hardware Platform β
βββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββVirtualization Approaches
Full Virtualization
In full virtualization, the operating system remains completely untouchedβnot even a single line of code is modified. The guest operating system believes it is running on bare metal.
Trap and Emulate Strategy:
Guest operating systems run as user-level processes
When a guest OS attempts to execute a privileged instruction, it triggers a trap into the hypervisor
The hypervisor emulates the intended functionality of the operating system
Full Virtualization: Trap and Emulate
======================================
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Guest Operating System β
β (Unmodified - runs at user level) β
β β
β Application βββΊ System Call βββΊ Privileged Instr. β
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
β Trap (implicit)
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Hypervisor β
β β
β 1. Catch trap from privileged instruction β
β 2. Validate operation β
β 3. Emulate the intended functionality β
β 4. Return control to Guest OS β
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
βΌ
Hardware AccessArchitectural Challenges:
A significant issue with full virtualization, particularly in older Intel architectures, is that some privileged instructions may fail silently rather than trapping to the hypervisor. To counter this, hypervisors employ a binary translation strategy, identifying and modifying problematic instructions in the guest operating system binaries. Newer hardware from Intel and AMD includes virtualization support to mitigate these issues.
Example: VMware ESX Server
Para-virtualization
Para-virtualization involves modifying the source code of the guest operating system. This allows for avoiding problematic instructions and including optimizations, such as letting the guest operating system see real hardware resources underneath the hypervisor.
Key Characteristics:
Guest OS source code is modified to be aware of the hypervisor
Applications remain unchangedβthe interfaces that applications see are exactly the interfaces provided by the operating system
Modifications are typically small (approximately 1-1.36% of the codebase for Linux and Windows XP)
Uses explicit hypercalls instead of implicit traps
Para-virtualization: Explicit Hypercalls
=========================================
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Guest Operating System β
β (Modified - aware of hypervisor) β
β β
β Application βββΊ System Call βββΊ Hypercall API β
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
β Hypercall (explicit)
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Hypervisor β
β β
β Hypercall Interface: β
β β’ create_page_table() β
β β’ switch_to_page_table() β
β β’ update_page_table() β
β β’ device_operation() β
β β
β Direct execution of privileged operations β
βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ
β
βΌ
Hardware AccessExample: Xen
Comparison of Virtualization Approaches
GOS Code Modification
No modificationβbinary remains unchanged
Source code modified (typically 1-2% of codebase)
GOS Awareness
Unaware of hypervisorβbelieves it runs on bare metal
Aware of hypervisor presence
Privileged Operations
Trap and emulateβtraps on privileged instructions
Explicit hypercalls to hypervisor
Handling Quirk Architectures
Binary translation/rewriting for silently failing instructions
Modified source code avoids problematic instructions
GOS β Hypervisor Control Transfer
Implicit via traps
Explicit via hypercalls
Memory Management
Hypervisor maintains PPN to MPN mapping in Shadow Page Table
GOS manages PPN to MPN mapping using hypercalls
I/O Virtualization
Trap and emulateβlittle innovation
Clean device abstractions, shared buffers, zero-copy
Interrupt Control
Events delivered as they occur
GOS has explicit control over event delivery
Examples
VMware ESX Server
Xen
CPU Virtualization
CPU virtualization aims to give each guest operating system the illusion that it owns the CPU and is unaware of other guests on the same CPU.
Scheduling and Accountability
The hypervisor acts as a scheduler, allocating CPU time to different VMs:
Scheduling Policies:
Proportional Share Scheduler (VMware ESX Server): Allocates CPU shares based on service agreements
Fair Share Scheduler: Gives an equal share of the CPU to each guest operating system
The hypervisor accounts for the time used on the CPU by each guest for billing purposes, including time stolen from a particular VM to service external interrupts that did not belong to it.
Handling Program Discontinuities
When a process executes on the CPU within a VM, various program discontinuities can occur, which the hypervisor must field and pass to the appropriate guest operating system:
System calls: Calls into the guest OS
Page faults: Occur when a virtual page is not in machine memory
Exceptions: Such as divide by zero
External interrupts: Asynchronous events not initiated by the process
All these events are packaged as software interrupts by the hypervisor and delivered to the guest operating system, which then handles them as it would on bare metal.
Challenges with Privileged Instructions
Full Virtualization Challenges:
Guest OSes run at user-level privilege
When a guest OS tries to execute a privileged instruction, it should trap into the hypervisor
In older x86 architectures, some privileged instructions fail silently when executed at user level
Hypervisors must be aware of hardware quirks and employ binary rewriting to detect and handle silently failing instructions
Newer versions of Intel and AMD architectures include virtualization support to address these issues
Communication Mechanisms:
Full Virtualization: Communication is always implicit via traps
Para-virtualization: Communication is explicit via hypercalls
Memory Virtualization
Memory virtualization is crucial to performance in virtualized environments due to its impact on the entire memory hierarchy.
Memory Address Translation
In a virtualized setup, the hypervisor controls the machine memory (real physical memory), while each guest operating system's physical memory is itself an illusion. This guest-perceived physical memory is often non-contiguous in real machine memory due to dynamic allocation and sharing.
Two-Step Translation Process:
Virtual Page Number (VPN) to Physical Page Number (PPN): Handled by the guest OS's page table
Physical Page Number (PPN) to Machine Page Number (MPN): Handled by the hypervisor, typically using a shadow page table
The shadow page table acts as the hardware page table in architectures where the CPU uses the page table for address translation.
Efficient Mapping in Full Virtualization
Since a fully virtualized guest OS is unaware of machine pages, the hypervisor maintains the PPN to MPN mapping. To ensure efficiency on every memory access:
The hypervisor intercepts guest OS attempts to update its page tables (privileged instructions that cause traps)
The hypervisor directly updates the shadow page table (or hardware page table/TLB) with the VPN to MPN mapping
This bypasses the guest OS's page table for future translations, allowing direct VPN to MPN translation at hardware speeds
This technique is used in VMware ESX Server.
Efficient Mapping in Para-virtualization
In para-virtualization, the guest OS knows that its physical memory is not contiguous and can be made aware of machine memory. Therefore, the burden of PPN to MPN mapping can be shifted to the guest OS.
Hypercalls for Memory Management:
create_page_table: Allocate and initialize a page frame as a page tableswitch_to_page_table: Change the active hardware page table during a context switchupdate_page_table: Update specific entries in the page table after events like page faults
This cooperative approach allows the guest OS to manage its page tables, leading to efficient address translation at hardware speeds.
Dynamic Memory Management: Ballooning
To handle bursty memory requirements and over-commitment of memory, a technique called ballooning is used:
Inflation Process:
A special balloon device driver is installed in each guest OS
If the hypervisor needs more memory, it contacts a guest OS that is not actively using all of its memory via a private channel
It instructs the balloon driver to inflate, causing it to request memory from the guest OS
The guest OS pages out data to disk to satisfy the balloon's requests
The memory acquired by the balloon driver is returned to the hypervisor, freeing up machine memory
Deflation Process:
When a guest needs more memory, the hypervisor tells the balloon to deflate, releasing memory back to the guest OS, which can then page in data from disk.
Ballooning assumes cooperation with the guest operating system and is applicable to both full and para-virtualized environments.
Memory Ballooning Mechanism
============================
Inflation (Reclaim Memory): Deflation (Return Memory):
ββββββββββββββββββββ ββββββββββββββββββββ
β Guest OS 1 β β Guest OS 1 β
β ββββββββββββββ β β ββββββββββββββ β
β β Balloon β β βββ Inflate β β Balloon β β βββ Deflate
β β Driver β β β β Driver β β
β β β² β β β β β β βΌ β β
β β β β β β β β β β β β
β β β βΌ β β β β β β β β
β β Requests β β β β Releases β β
β β Memory β β β β Memory β β
β ββββββββββββββ β β ββββββββββββββ β
β β β β β² β
β βΌ β β β β
β Pages out to β β Pages in from β
β disk β β disk β
βββββββββββ¬βββββββββ βββββββββββ¬βββββββββ
β β²
β Returns β Allocates
β Memory β Memory
βΌ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Hypervisor β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Machine Memory Pool (Physical RAM) β β
β β [Reclaimed] ββββββββββββ ββββββββββββΊ [Allocated] β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββVM-Oblivious Page Sharing
To maximize memory utilization, virtualization allows sharing of identical memory pages across different VMs, particularly for core operating system pages or common applications.
Cooperative Sharing:
With guest OS cooperation, the hypervisor can mark shared pages as copy-on-write (COW). If a VM writes to such a page, a copy is made, and the VM receives its own unique page.
VM-Oblivious Page Sharing (VMware ESX Server):
This technique works completely oblivious to the guest operating system:
The hypervisor maintains a hash table containing content hashes of machine pages
It periodically scans VM pages, calculates their content hashes, and checks for matches in the hash table
A content hash match is a hint, requiring a full comparison to confirm identical content
If a full match is confirmed, the PPN to MPN mapping for the newly discovered identical page is modified to point to the existing shared machine page
The reference count for the shared machine page is incremented, and the entries are marked as COW
The duplicate machine page can then be freed
This scanning is performed as a background activity when the system is lightly loaded due to its labor-intensive nature. This method applies to both full and para-virtualized settings.
Memory Allocation Policies
Hypervisors implement policies for allocating and reclaiming memory, balancing performance, fairness, and utilization:
Pure Share-Based Policy:
Resources allocated based on service level agreement
Drawback: hoarding of unused resources
Principle: "pay less, get less"
Working-Set-Based Approach:
Allocates memory dynamically based on the VM's active working set
Dynamic Idle-Adjusted Shares (VMware ESX Server):
Hybrid approach that taxes idle pages more than active pages
If a VM is hoarding idle pages, the hypervisor reclaims them (50% or 75% tax rate)
Allows reclaiming most idle memory while providing a buffer for sudden working set increases
Device Virtualization
Device virtualization aims to give the guest operating system the illusion that it owns the I/O devices.
Full Virtualization Approach
Trap and Emulate Technique:
When a guest OS attempts any access to devices, it results in a trap into the hypervisor
The hypervisor emulates the functionality
Not much scope for innovation beyond ensuring legality and permissions of I/O operations
Para-virtualization Approach
Para-virtualization offers more opportunities for innovation because guest OSes are aware of the hypervisor:
Hypervisors can provide clean and simple device abstractions
Shared buffers can be exposed via APIs to allow efficient data transfer between the guest OS and the hypervisor
Eliminates overhead of copying data multiple times
Innovations in event delivery are possible
Control Transfer
Guest β Hypervisor
Implicitly via traps from privileged instructions
Explicitly via hypercalls
Hypervisor β Guest
Software interrupts or events
Software interrupts
Event Control
Events delivered as they occur
GOS controls when event notifications are delivered via hypercalls
Data Transfer in Para-virtualization
Data transfer is explicit in para-virtualized settings with significant scope for innovation.
Asynchronous I/O Rings (Xen):
Xen uses shared data structures called I/O rings for efficient communication between guests and the hypervisor. The I/O ring is a circular queue of descriptors that supports asynchronous request/response communication between the guest domain and Xen.
Key Characteristics:
I/O rings are a set of descriptors allocated from shared memory pages
Each request has a unique ID; Xen places responses back in the ring with the same ID
The guest is the request producer, and Xen is the request consumer
Xen is the response producer, and the guest is the response consumer
Four pointers manage the ring: request producer pointer, request consumer pointer, response producer pointer, and response consumer pointer
Descriptors contain pointers to guest-owned machine pages where data resides or should be placed
Xen accesses these pages directly, avoiding data copying (zero-copy semantics)

Based on the asynchronous I/O ring design described in Barham et al., "Xen and the Art of Virtualization," SOSP 2003.
Network Virtualization (Xen):
Each guest has separate I/O rings for transmission and reception:
Transmission: Guest enqueues descriptors with pointers to its packet buffers; Xen retrieves packets directly without copying and uses a round-robin scheduler to transmit; pages are pinned during transmission
Reception: Xen receives packets and directly places them into pre-allocated network buffers (guest-owned pages) or swaps the received machine page with a guest-owned page, avoiding copies
Disk I/O Virtualization (Xen):
Similar to network virtualization, each VM has a dedicated I/O ring for disk I/O:
Descriptors include pointers to guest operating system buffers
Xen manages requests asynchronously and can reorder them for efficiency
Provides a reorder barrier hypercall for guests to enforce operations in a specific order
Measuring Time and Resource Usage
The fundamental principle of utility computing is that resources are shared by multiple clients, requiring mechanisms for billing. Virtualized environments accurately record time and space usage for:
CPU time
Memory allocation
Storage utilization
Network bandwidth
This metering enables precise billing based on actual resource consumption.
Modern Hardware Support
Modern processor chips from both Intel and AMD include virtualization support in hardware to ease hypervisor implementation. These hardware features address the architectural quirks that previously required techniques like binary rewriting in full virtualization.
Hardware-Assisted Virtualization:
Para-virtualized systems like Xen now exploit these hardware features to support unmodified operating systems through hardware-assisted virtualization, combining the benefits of both approaches.
References
Course Materials:
CS 6210: Advanced Operating Systems - Georgia Tech OMSCS
Foundational Systems:
IBM VM/370 - Pioneering virtualization system (1960s-70s)
Stanford SimOS - Foundation for VMware
Xen - Open-source hypervisor
VMware ESX Server - Commercial bare-metal hypervisor
Last updated