|
|
 |
|
IBM Research
|  |
hcall">
HV">
IOMMU">
LPAR">
ISA">
]>
Research Hypervisor Principles
Draft Version (0.20)
IBM Research Hypervisor Group
2005
International Business Machines Corporation
&revision;
&date;
Hypervisors allow different operating systems or different
instances of a single operating system to run on the same
hardware at the same time. In this sense, Hypervisors resemble
virtual machine systems such as VMWare. However, recent
developments in Hypervisors have shown that performance need not
be compromised as it is with VM systems[XEN]. An OS, and its
applications, running on top of a Hypervisor, and its
applications, can run at the native speed of the machine; in
this steady state, the OS need never use Hypervisor services.
Introduction
The Research Hypervisor project creates a virtual machine
environment that is a Logical representation of a
machine that has been Partitioned from the
original base machine. We call this environment the Logical
Partition (().
In Xen parlance an ( is called a Domain .
The ( is different from a pure virtual machine in that it
uses para-virtualization techniques first
described by the Denali Project[DEN].
Introduce the "types" of hypervisors somewhere and don't
forget the ref.
Para-virtualization requires the software that runs in the
( to be aware that it is not running
on bare metal . This (
awarness comes from the use of software
abstractions to deal with the loss of access to certain specific
machine resources that are normally the domain of the software.
These resources are abstracted by a series of interfaces to the
Hypervisor (&HV;), known as Hypervisor Calls (or &hcall;s).
The Hypervisor is primarily concerned with the management of
memory, processors, interrupts and some simple transports. The
heart of the Research Hypervisor design is to keep the
core &HV; restricted to these items and
keep the code as small and simple as possible. All other
services are the domain of surrounding
cooperative (s. Although these
services can be used to create an (ed environment, they are
beyond the scope of the &HV; itself.
In the following sections we will discuss our history, specific
goals, design decisions, how logical abstractions were created
in the &HV; to facilitate an ( environment and, where
necessary, will also discuss where the two currently supported
ISAs differ.
This document is a companion to the following documents:
POWER Architecture Platform Reference
The (PAPR) document available from
IBM from: [URL] It describes
the environemnt that an general purpose operating
system will run in includeing bootstrap, runtime and
shutdown. Of particular interest should be the
description of the ( option.
The Research Hypervisor ( Extentention
This document describes the additional &HV; calls
supported by the research hypervisor in order to
manage and creare (s, available with this document.
The Reseach Hypervisor Cell Processor Extention.
This document provides the description of the
additional &HV; interfaces necessary to access the
features of the Cell Processor Architecture, available
from the Cell Architecure Library.
If you have to ask, you'll never know [cite].
History
Machine abstractions have been around since the early data
processing years, modern operating systems introduce a simpler
machine abstraction for user applications centralizing
complexity, micro/exo-kernel attempt to simplify the resource
management by distributing the complexity while still
maintaining a simple application environment and new projects
emerge with approaches differing slightly, this project is another.
IBM System 390 introduced an architecture that enabled machine
virtualization by introducing a new processor mode that allowed
access to privileged resources to be efficiently programmed
using micro and milli-code instruction set. In the last few
years IBM has extended the PowerPC architecture to introduce a
new processor mode that had exclusive access to specific
processor resources yet retained the same programming
environment and instruction set. Although not as efficient as
executing processor milli-code instructions it is uniquely
tailored for a para-virtualized environment to be created.
The IBM PHYP Hypervisor product, which was released with the
recent Squadron pSeries machines[Ref], is the first software
&HV; to take advantage of this new processor mode. It is
capable of running several heterogeneous operating systems at
once, and provides each ( with an amazing array of RAS,
high end IO, and server consolidation benefits [Ref]. Other
processor manufacturers, in particular Intel and AMD, are also
designing (or have designed) architectural enhancements that
introduce a similar processor mode to their processors.
Regardless of whether or not this new processor mode exists,
this document shall refer to the processor mode that the &HV;
executes in as Hypervisor Mode .
This project began as an open source reference implementation of
the PHYP para-virtualization interfaces as defined by the Power
Architecture Platform Requirements (PAPR) in
order to explore new processor and architecture innovations. It
is evolving in to a &HV; core that is capable is creating a
common ( environment on a multitude of different
architectures with and without explicit &HV; support, the ia32
architecture being the most robust example of this ability.
Goals
The creation of an ( environment should go beyond the
simple virtualization of the underlying hardware, but instead
should create an environment that presents a simpler machine by
which software (including OSes) can take, customize and easily
take advantage of architectural enhancements at the processor
level. With this in mind, and the PARP as a staring point we
set out to explore:
Support Open Source OSes like Linux, the BSDs, Darwin,
etc.. and run on all machines they run.
Security, by providing complete isolation, attestation, as well as
TPM[ref] services completely in software.
Small, auditable, and configurable source space.
Explore architectural and processor enhancements.
Small one off (s that performs
specific isolated tasks.
Library OS creating an even simpler C Library/POSIX like (.
Create Real Time ( environment.
Full virtualization from within an (
( managment.
New logical transports and inter-( services.
Some of these goals have been explored extensively, others are
just beginning, but it is all still being researched for
improvement.
Hypervisor Principles
The Logical Model
It has been said that any problem becomes easier if you add a
layer of abstraction [cite]. The creation of a set of logical
resources for an ( represents that abstraction. A quick
overview of the benefits are as follows.
The &hcall;
The mechanism by which an ( submits a request to the &HV;
is generally &ISA; specific, but the calling semantics are
similar to those use by Unix System Calls. They normally
require the processor to suffer a trap and an address space
change while the processor enters the &HV; mode. System calls
used by most operating system are generally targeted to a C
library and/or POSIX symantics where
arbitrary buffers pointers are passed into the kernel and some
cross address space copying is required to support these
semantics. &HV;s take a more shared memory centric view and
reserve the &hcall; fo ruse a s a control channel for small
messages that can usually fit in parameter registers reserved
by the ABI and, in the pathological case with the IA32, use of
the callers stack.
Memory
The software running on an ( is assigned
chunks of contiguous memory that become
accessible to the LPAR when it requests that the &HV; maps a
virtual page to a logical page. A collection of these chunks
create a logical address space which the software can use to
describe a virtual to logical mapping. The &HV; then takes
this tuple of ( identifier and logical
page identifier, resolves the actual physical page identifier,
and finally inserts the translation into the page table that
the &HV; controls.
Partitioning physical memory into these large chunks
simplifies and reduces the meta-data necessary for the &HV; to
maintain the access, translation, and (in particular cases)
reverse translation information that is required to manage
several logical address spaces.
The logical address space also has a RAS
benefit, such that faulty memory banks can be identified and
easily vacated so that the memory can be removed and/or
replaced without the software running on the ( having any
knowledge.
PowerPC Memory
The PowerPC memory model is ...
... The translation from effective to
virtual remains the domain ot the (;
however, the transaltion from the virtual to
the physical is controlled by the &HV;, and therefore, the
hashed page table must be abstracted form the (.
In order for the the ( to track the
HTAB entries, the &HV; and the (
must both agree the geometry. The log number of bytes in
the HTAB is the second word of the
ibm,pft_size property of each
cpu node.
Processors
Processors are usually time shared by the &HV;. Each instance
of a processor in an ( is a logical in
such a way that there can be more logical processors than
there are physical processors. There is also the ability to
take a processor thread, which may not have full processor
semantics, and represent it as a full logical processor.
Synergistic cores (SPEs), like those found
in the Cell architecture can be arbitrarily assigned and
isolated to different (s. This allows for the workload
assigned to an SPE to continue independent
of which ( is currently executing on the host processor.
Interrupt Controler
As different interrupt sources, such as processors and
devices) are assigned to different (s it is important to
be able to reflect the insantiated interrupt to the
appropriate (. This is done by virtualizing the external
interrupt controllers (XIRR).
Physcial Devices
The direct access to a physical device
from an ( is simply arranging for the device to be
mappable by that (. Once that mapping and the necessary
interrupt routing is established, the native driver is capable
of controlling the device naturally.
This access is usually done from a trusted
( that is known not to program the device in a
malicious manner.
One could imagine that such and ( could program an
device to DMA to arbitrary parts of physical memory therby
effectively destroying other (s or the &HV; itself.
However, some machines are equiped with I/O translation
mechanism (&IOMMU;) that are capable
instantiating a unique I/O address space that can be
partitioned and/or taged in such a way that each device can
access only a specific set of physical address ranges.
&IOMMU;s
&IOMMU;s come in several forms, some work on a the
slot level, and some work on the bus level,
some self virtualizing devices had I/O MMUs on
the adapter. Each present isolation at a different
granualarity.
For every &IOMMU; on the system the &HV; instantiates a unique
I/O address space. For convienience this address space starts
at address 0 and can be of arbitrary size (usually selected
for the performace characteristics of the machine).
The I/O address space is the paritioned such that a contiguous
range of the I/O assress space is assigned to every uniquely
identifiable device, this usually is some slot identfier on
that bus. For example, there is a I/O bus available for the
Cell processor architecture that contains several adapter
directly attached to the bus. Each adapter has its own unique
identifier, and thus can be paritioned and assigned to
individual (s.
Some buses (usually called a south bridge
or host bridge ) are proprietary busses that
contain adapters to industry standard buses like
PCI or PCI-X. It may be the case that
the standard bus does not uniquely identify each
slot on that bus and therefore that
standard bus is not partitionable.
Each I/O address space parition is described by the
ibm,my-dma-window property in the done of
the device. The ( is able to use addresses in the
window to program DMA transactions with the
device and use the
H_TCE_* &HV;
calls to create translations from the window address to the
( logical address.
Programming a device to DMA byond its DMA window will
usually result in the device being taken off line and reused
by the system.
Logcial Devices
Logical devices are introduced to allow single devices to be
shared by multiple (s, but an addition benefit is the
drastic reduction of the number of devices each ( must
support.
The virualization of devices are not specifically a function
of the &HV; but rather an inter-partition
service that the &HV; enables.
Console
By virtualizing the console (VTTYs), there is no need for
low level drivers to directly access a Graphic Console or a
Serial UART on the machine. This enables early boot and low
level debugging/panic services to be drastically simplified.
Network
By virtualizing the network (VETH/ILLAN), can benefit from a
shared memory transport effectivly presenting the
NIC with an arbitrary sized
MTU. More than one can be selected so
that the most efficient size can be chosen per payload. For
network communication that is destined out of box, the
( actually operating the physical
NIC can use a packet bridge,
IP forwarding, or Network Address
Transaltion technology forward the communication to the
ouside world.
Block Devices
Block devices, such as Disks, CD-ROM, DVD, or even disk
images can can be hosted and vitualized by the (s that
have access to them. Then by using the VSCSI protocol,
other (s can have direct and shared access to them.
Performace Characteristics
The Punt
Unfortunately, detailed performace results were beyond the scope
of this project (due to funding restrictions). However, early
performance characteristics using unoptimized code paths have
shown a less then 4% overhead in CPU bound performance, in order
to accomodate a Hypevisor and a single controlling/IO Hosting
Linux. On I/O bound workloads preliminary results show that
virtualizing I/O produces a 20% loss from the adapters
theoretical bandwidth, WRT disk and/or networking.
In a heterogenous environment (such as the Cell Processors)
where the workload is self contained in the
SPE the perfoamce impact was engligible and
in somecases immeasurable.
|
|
|