|  |  | 
|  | Timekeeping Virtualization for X86-Based Architectures | 
|  |  | 
|  | Zachary Amsden <zamsden@redhat.com> | 
|  | Copyright (c) 2010, Red Hat.  All rights reserved. | 
|  |  | 
|  | 1) Overview | 
|  | 2) Timing Devices | 
|  | 3) TSC Hardware | 
|  | 4) Virtualization Problems | 
|  |  | 
|  | ========================================================================= | 
|  |  | 
|  | 1) Overview | 
|  |  | 
|  | One of the most complicated parts of the X86 platform, and specifically, | 
|  | the virtualization of this platform is the plethora of timing devices available | 
|  | and the complexity of emulating those devices.  In addition, virtualization of | 
|  | time introduces a new set of challenges because it introduces a multiplexed | 
|  | division of time beyond the control of the guest CPU. | 
|  |  | 
|  | First, we will describe the various timekeeping hardware available, then | 
|  | present some of the problems which arise and solutions available, giving | 
|  | specific recommendations for certain classes of KVM guests. | 
|  |  | 
|  | The purpose of this document is to collect data and information relevant to | 
|  | timekeeping which may be difficult to find elsewhere, specifically, | 
|  | information relevant to KVM and hardware-based virtualization. | 
|  |  | 
|  | ========================================================================= | 
|  |  | 
|  | 2) Timing Devices | 
|  |  | 
|  | First we discuss the basic hardware devices available.  TSC and the related | 
|  | KVM clock are special enough to warrant a full exposition and are described in | 
|  | the following section. | 
|  |  | 
|  | 2.1) i8254 - PIT | 
|  |  | 
|  | One of the first timer devices available is the programmable interrupt timer, | 
|  | or PIT.  The PIT has a fixed frequency 1.193182 MHz base clock and three | 
|  | channels which can be programmed to deliver periodic or one-shot interrupts. | 
|  | These three channels can be configured in different modes and have individual | 
|  | counters.  Channel 1 and 2 were not available for general use in the original | 
|  | IBM PC, and historically were connected to control RAM refresh and the PC | 
|  | speaker.  Now the PIT is typically integrated as part of an emulated chipset | 
|  | and a separate physical PIT is not used. | 
|  |  | 
|  | The PIT uses I/O ports 0x40 - 0x43.  Access to the 16-bit counters is done | 
|  | using single or multiple byte access to the I/O ports.  There are 6 modes | 
|  | available, but not all modes are available to all timers, as only timer 2 | 
|  | has a connected gate input, required for modes 1 and 5.  The gate line is | 
|  | controlled by port 61h, bit 0, as illustrated in the following diagram. | 
|  |  | 
|  | --------------             ---------------- | 
|  | |              |           |                | | 
|  | |  1.1932 MHz  |---------->| CLOCK      OUT | ---------> IRQ 0 | 
|  | |    Clock     |   |       |                | | 
|  | --------------    |    +->| GATE  TIMER 0  | | 
|  | |        ---------------- | 
|  | | | 
|  | |        ---------------- | 
|  | |       |                | | 
|  | |------>| CLOCK      OUT | ---------> 66.3 KHZ DRAM | 
|  | |       |                |            (aka /dev/null) | 
|  | |    +->| GATE  TIMER 1  | | 
|  | |        ---------------- | 
|  | | | 
|  | |        ---------------- | 
|  | |       |                | | 
|  | |------>| CLOCK      OUT | ---------> Port 61h, bit 5 | 
|  | |                |      | | 
|  | Port 61h, bit 0 ---------->| GATE  TIMER 2  |       \_.----   ____ | 
|  | ----------------         _|    )--|LPF|---Speaker | 
|  | / *----   \___/ | 
|  | Port 61h, bit 1 -----------------------------------/ | 
|  |  | 
|  | The timer modes are now described. | 
|  |  | 
|  | Mode 0: Single Timeout.   This is a one-shot software timeout that counts down | 
|  | when the gate is high (always true for timers 0 and 1).  When the count | 
|  | reaches zero, the output goes high. | 
|  |  | 
|  | Mode 1: Triggered One-shot.  The output is initially set high.  When the gate | 
|  | line is set high, a countdown is initiated (which does not stop if the gate is | 
|  | lowered), during which the output is set low.  When the count reaches zero, | 
|  | the output goes high. | 
|  |  | 
|  | Mode 2: Rate Generator.  The output is initially set high.  When the countdown | 
|  | reaches 1, the output goes low for one count and then returns high.  The value | 
|  | is reloaded and the countdown automatically resumes.  If the gate line goes | 
|  | low, the count is halted.  If the output is low when the gate is lowered, the | 
|  | output automatically goes high (this only affects timer 2). | 
|  |  | 
|  | Mode 3: Square Wave.   This generates a high / low square wave.  The count | 
|  | determines the length of the pulse, which alternates between high and low | 
|  | when zero is reached.  The count only proceeds when gate is high and is | 
|  | automatically reloaded on reaching zero.  The count is decremented twice at | 
|  | each clock to generate a full high / low cycle at the full periodic rate. | 
|  | If the count is even, the clock remains high for N/2 counts and low for N/2 | 
|  | counts; if the clock is odd, the clock is high for (N+1)/2 counts and low | 
|  | for (N-1)/2 counts.  Only even values are latched by the counter, so odd | 
|  | values are not observed when reading.  This is the intended mode for timer 2, | 
|  | which generates sine-like tones by low-pass filtering the square wave output. | 
|  |  | 
|  | Mode 4: Software Strobe.  After programming this mode and loading the counter, | 
|  | the output remains high until the counter reaches zero.  Then the output | 
|  | goes low for 1 clock cycle and returns high.  The counter is not reloaded. | 
|  | Counting only occurs when gate is high. | 
|  |  | 
|  | Mode 5: Hardware Strobe.  After programming and loading the counter, the | 
|  | output remains high.  When the gate is raised, a countdown is initiated | 
|  | (which does not stop if the gate is lowered).  When the counter reaches zero, | 
|  | the output goes low for 1 clock cycle and then returns high.  The counter is | 
|  | not reloaded. | 
|  |  | 
|  | In addition to normal binary counting, the PIT supports BCD counting.  The | 
|  | command port, 0x43 is used to set the counter and mode for each of the three | 
|  | timers. | 
|  |  | 
|  | PIT commands, issued to port 0x43, using the following bit encoding: | 
|  |  | 
|  | Bit 7-4: Command (See table below) | 
|  | Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) | 
|  | Bit 0  : Binary (0) / BCD (1) | 
|  |  | 
|  | Command table: | 
|  |  | 
|  | 0000 - Latch Timer 0 count for port 0x40 | 
|  | sample and hold the count to be read in port 0x40; | 
|  | additional commands ignored until counter is read; | 
|  | mode bits ignored. | 
|  |  | 
|  | 0001 - Set Timer 0 LSB mode for port 0x40 | 
|  | set timer to read LSB only and force MSB to zero; | 
|  | mode bits set timer mode | 
|  |  | 
|  | 0010 - Set Timer 0 MSB mode for port 0x40 | 
|  | set timer to read MSB only and force LSB to zero; | 
|  | mode bits set timer mode | 
|  |  | 
|  | 0011 - Set Timer 0 16-bit mode for port 0x40 | 
|  | set timer to read / write LSB first, then MSB; | 
|  | mode bits set timer mode | 
|  |  | 
|  | 0100 - Latch Timer 1 count for port 0x41 - as described above | 
|  | 0101 - Set Timer 1 LSB mode for port 0x41 - as described above | 
|  | 0110 - Set Timer 1 MSB mode for port 0x41 - as described above | 
|  | 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above | 
|  |  | 
|  | 1000 - Latch Timer 2 count for port 0x42 - as described above | 
|  | 1001 - Set Timer 2 LSB mode for port 0x42 - as described above | 
|  | 1010 - Set Timer 2 MSB mode for port 0x42 - as described above | 
|  | 1011 - Set Timer 2 16-bit mode for port 0x42 as described above | 
|  |  | 
|  | 1101 - General counter latch | 
|  | Latch combination of counters into corresponding ports | 
|  | Bit 3 = Counter 2 | 
|  | Bit 2 = Counter 1 | 
|  | Bit 1 = Counter 0 | 
|  | Bit 0 = Unused | 
|  |  | 
|  | 1110 - Latch timer status | 
|  | Latch combination of counter mode into corresponding ports | 
|  | Bit 3 = Counter 2 | 
|  | Bit 2 = Counter 1 | 
|  | Bit 1 = Counter 0 | 
|  |  | 
|  | The output of ports 0x40-0x42 following this command will be: | 
|  |  | 
|  | Bit 7 = Output pin | 
|  | Bit 6 = Count loaded (0 if timer has expired) | 
|  | Bit 5-4 = Read / Write mode | 
|  | 01 = MSB only | 
|  | 10 = LSB only | 
|  | 11 = LSB / MSB (16-bit) | 
|  | Bit 3-1 = Mode | 
|  | Bit 0 = Binary (0) / BCD mode (1) | 
|  |  | 
|  | 2.2) RTC | 
|  |  | 
|  | The second device which was available in the original PC was the MC146818 real | 
|  | time clock.  The original device is now obsolete, and usually emulated by the | 
|  | system chipset, sometimes by an HPET and some frankenstein IRQ routing. | 
|  |  | 
|  | The RTC is accessed through CMOS variables, which uses an index register to | 
|  | control which bytes are read.  Since there is only one index register, read | 
|  | of the CMOS and read of the RTC require lock protection (in addition, it is | 
|  | dangerous to allow userspace utilities such as hwclock to have direct RTC | 
|  | access, as they could corrupt kernel reads and writes of CMOS memory). | 
|  |  | 
|  | The RTC generates an interrupt which is usually routed to IRQ 8.  The interrupt | 
|  | can function as a periodic timer, an additional once a day alarm, and can issue | 
|  | interrupts after an update of the CMOS registers by the MC146818 is complete. | 
|  | The type of interrupt is signalled in the RTC status registers. | 
|  |  | 
|  | The RTC will update the current time fields by battery power even while the | 
|  | system is off.  The current time fields should not be read while an update is | 
|  | in progress, as indicated in the status register. | 
|  |  | 
|  | The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be | 
|  | programmed to a 32kHz divider if the RTC is to count seconds. | 
|  |  | 
|  | This is the RAM map originally used for the RTC/CMOS: | 
|  |  | 
|  | Location    Size    Description | 
|  | ------------------------------------------ | 
|  | 00h         byte    Current second (BCD) | 
|  | 01h         byte    Seconds alarm (BCD) | 
|  | 02h         byte    Current minute (BCD) | 
|  | 03h         byte    Minutes alarm (BCD) | 
|  | 04h         byte    Current hour (BCD) | 
|  | 05h         byte    Hours alarm (BCD) | 
|  | 06h         byte    Current day of week (BCD) | 
|  | 07h         byte    Current day of month (BCD) | 
|  | 08h         byte    Current month (BCD) | 
|  | 09h         byte    Current year (BCD) | 
|  | 0Ah         byte    Register A | 
|  | bit 7   = Update in progress | 
|  | bit 6-4 = Divider for clock | 
|  | 000 = 4.194 MHz | 
|  | 001 = 1.049 MHz | 
|  | 010 = 32 kHz | 
|  | 10X = test modes | 
|  | 110 = reset / disable | 
|  | 111 = reset / disable | 
|  | bit 3-0 = Rate selection for periodic interrupt | 
|  | 000 = periodic timer disabled | 
|  | 001 = 3.90625 uS | 
|  | 010 = 7.8125 uS | 
|  | 011 = .122070 mS | 
|  | 100 = .244141 mS | 
|  | ... | 
|  | 1101 = 125 mS | 
|  | 1110 = 250 mS | 
|  | 1111 = 500 mS | 
|  | 0Bh         byte    Register B | 
|  | bit 7   = Run (0) / Halt (1) | 
|  | bit 6   = Periodic interrupt enable | 
|  | bit 5   = Alarm interrupt enable | 
|  | bit 4   = Update-ended interrupt enable | 
|  | bit 3   = Square wave interrupt enable | 
|  | bit 2   = BCD calendar (0) / Binary (1) | 
|  | bit 1   = 12-hour mode (0) / 24-hour mode (1) | 
|  | bit 0   = 0 (DST off) / 1 (DST enabled) | 
|  | OCh         byte    Register C (read only) | 
|  | bit 7   = interrupt request flag (IRQF) | 
|  | bit 6   = periodic interrupt flag (PF) | 
|  | bit 5   = alarm interrupt flag (AF) | 
|  | bit 4   = update interrupt flag (UF) | 
|  | bit 3-0 = reserved | 
|  | ODh         byte    Register D (read only) | 
|  | bit 7   = RTC has power | 
|  | bit 6-0 = reserved | 
|  | 32h         byte    Current century BCD (*) | 
|  | (*) location vendor specific and now determined from ACPI global tables | 
|  |  | 
|  | 2.3) APIC | 
|  |  | 
|  | On Pentium and later processors, an on-board timer is available to each CPU | 
|  | as part of the Advanced Programmable Interrupt Controller.  The APIC is | 
|  | accessed through memory-mapped registers and provides interrupt service to each | 
|  | CPU, used for IPIs and local timer interrupts. | 
|  |  | 
|  | Although in theory the APIC is a safe and stable source for local interrupts, | 
|  | in practice, many bugs and glitches have occurred due to the special nature of | 
|  | the APIC CPU-local memory-mapped hardware.  Beware that CPU errata may affect | 
|  | the use of the APIC and that workarounds may be required.  In addition, some of | 
|  | these workarounds pose unique constraints for virtualization - requiring either | 
|  | extra overhead incurred from extra reads of memory-mapped I/O or additional | 
|  | functionality that may be more computationally expensive to implement. | 
|  |  | 
|  | Since the APIC is documented quite well in the Intel and AMD manuals, we will | 
|  | avoid repetition of the detail here.  It should be pointed out that the APIC | 
|  | timer is programmed through the LVT (local vector timer) register, is capable | 
|  | of one-shot or periodic operation, and is based on the bus clock divided down | 
|  | by the programmable divider register. | 
|  |  | 
|  | 2.4) HPET | 
|  |  | 
|  | HPET is quite complex, and was originally intended to replace the PIT / RTC | 
|  | support of the X86 PC.  It remains to be seen whether that will be the case, as | 
|  | the de facto standard of PC hardware is to emulate these older devices.  Some | 
|  | systems designated as legacy free may support only the HPET as a hardware timer | 
|  | device. | 
|  |  | 
|  | The HPET spec is rather loose and vague, requiring at least 3 hardware timers, | 
|  | but allowing implementation freedom to support many more.  It also imposes no | 
|  | fixed rate on the timer frequency, but does impose some extremal values on | 
|  | frequency, error and slew. | 
|  |  | 
|  | In general, the HPET is recommended as a high precision (compared to PIT /RTC) | 
|  | time source which is independent of local variation (as there is only one HPET | 
|  | in any given system).  The HPET is also memory-mapped, and its presence is | 
|  | indicated through ACPI tables by the BIOS. | 
|  |  | 
|  | Detailed specification of the HPET is beyond the current scope of this | 
|  | document, as it is also very well documented elsewhere. | 
|  |  | 
|  | 2.5) Offboard Timers | 
|  |  | 
|  | Several cards, both proprietary (watchdog boards) and commonplace (e1000) have | 
|  | timing chips built into the cards which may have registers which are accessible | 
|  | to kernel or user drivers.  To the author's knowledge, using these to generate | 
|  | a clocksource for a Linux or other kernel has not yet been attempted and is in | 
|  | general frowned upon as not playing by the agreed rules of the game.  Such a | 
|  | timer device would require additional support to be virtualized properly and is | 
|  | not considered important at this time as no known operating system does this. | 
|  |  | 
|  | ========================================================================= | 
|  |  | 
|  | 3) TSC Hardware | 
|  |  | 
|  | The TSC or time stamp counter is relatively simple in theory; it counts | 
|  | instruction cycles issued by the processor, which can be used as a measure of | 
|  | time.  In practice, due to a number of problems, it is the most complicated | 
|  | timekeeping device to use. | 
|  |  | 
|  | The TSC is represented internally as a 64-bit MSR which can be read with the | 
|  | RDMSR, RDTSC, or RDTSCP (when available) instructions.  In the past, hardware | 
|  | limitations made it possible to write the TSC, but generally on old hardware it | 
|  | was only possible to write the low 32-bits of the 64-bit counter, and the upper | 
|  | 32-bits of the counter were cleared.  Now, however, on Intel processors family | 
|  | 0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction | 
|  | has been lifted and all 64-bits are writable.  On AMD systems, the ability to | 
|  | write the TSC MSR is not an architectural guarantee. | 
|  |  | 
|  | The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by | 
|  | means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access. | 
|  |  | 
|  | Some vendors have implemented an additional instruction, RDTSCP, which returns | 
|  | atomically not just the TSC, but an indicator which corresponds to the | 
|  | processor number.  This can be used to index into an array of TSC variables to | 
|  | determine offset information in SMP systems where TSCs are not synchronized. | 
|  | The presence of this instruction must be determined by consulting CPUID feature | 
|  | bits. | 
|  |  | 
|  | Both VMX and SVM provide extension fields in the virtualization hardware which | 
|  | allows the guest visible TSC to be offset by a constant.  Newer implementations | 
|  | promise to allow the TSC to additionally be scaled, but this hardware is not | 
|  | yet widely available. | 
|  |  | 
|  | 3.1) TSC synchronization | 
|  |  | 
|  | The TSC is a CPU-local clock in most implementations.  This means, on SMP | 
|  | platforms, the TSCs of different CPUs may start at different times depending | 
|  | on when the CPUs are powered on.  Generally, CPUs on the same die will share | 
|  | the same clock, however, this is not always the case. | 
|  |  | 
|  | The BIOS may attempt to resynchronize the TSCs during the poweron process and | 
|  | the operating system or other system software may attempt to do this as well. | 
|  | Several hardware limitations make the problem worse - if it is not possible to | 
|  | write the full 64-bits of the TSC, it may be impossible to match the TSC in | 
|  | newly arriving CPUs to that of the rest of the system, resulting in | 
|  | unsynchronized TSCs.  This may be done by BIOS or system software, but in | 
|  | practice, getting a perfectly synchronized TSC will not be possible unless all | 
|  | values are read from the same clock, which generally only is possible on single | 
|  | socket systems or those with special hardware support. | 
|  |  | 
|  | 3.2) TSC and CPU hotplug | 
|  |  | 
|  | As touched on already, CPUs which arrive later than the boot time of the system | 
|  | may not have a TSC value that is synchronized with the rest of the system. | 
|  | Either system software, BIOS, or SMM code may actually try to establish the TSC | 
|  | to a value matching the rest of the system, but a perfect match is usually not | 
|  | a guarantee.  This can have the effect of bringing a system from a state where | 
|  | TSC is synchronized back to a state where TSC synchronization flaws, however | 
|  | small, may be exposed to the OS and any virtualization environment. | 
|  |  | 
|  | 3.3) TSC and multi-socket / NUMA | 
|  |  | 
|  | Multi-socket systems, especially large multi-socket systems are likely to have | 
|  | individual clocksources rather than a single, universally distributed clock. | 
|  | Since these clocks are driven by different crystals, they will not have | 
|  | perfectly matched frequency, and temperature and electrical variations will | 
|  | cause the CPU clocks, and thus the TSCs to drift over time.  Depending on the | 
|  | exact clock and bus design, the drift may or may not be fixed in absolute | 
|  | error, and may accumulate over time. | 
|  |  | 
|  | In addition, very large systems may deliberately slew the clocks of individual | 
|  | cores.  This technique, known as spread-spectrum clocking, reduces EMI at the | 
|  | clock frequency and harmonics of it, which may be required to pass FCC | 
|  | standards for telecommunications and computer equipment. | 
|  |  | 
|  | It is recommended not to trust the TSCs to remain synchronized on NUMA or | 
|  | multiple socket systems for these reasons. | 
|  |  | 
|  | 3.4) TSC and C-states | 
|  |  | 
|  | C-states, or idling states of the processor, especially C1E and deeper sleep | 
|  | states may be problematic for TSC as well.  The TSC may stop advancing in such | 
|  | a state, resulting in a TSC which is behind that of other CPUs when execution | 
|  | is resumed.  Such CPUs must be detected and flagged by the operating system | 
|  | based on CPU and chipset identifications. | 
|  |  | 
|  | The TSC in such a case may be corrected by catching it up to a known external | 
|  | clocksource. | 
|  |  | 
|  | 3.5) TSC frequency change / P-states | 
|  |  | 
|  | To make things slightly more interesting, some CPUs may change frequency.  They | 
|  | may or may not run the TSC at the same rate, and because the frequency change | 
|  | may be staggered or slewed, at some points in time, the TSC rate may not be | 
|  | known other than falling within a range of values.  In this case, the TSC will | 
|  | not be a stable time source, and must be calibrated against a known, stable, | 
|  | external clock to be a usable source of time. | 
|  |  | 
|  | Whether the TSC runs at a constant rate or scales with the P-state is model | 
|  | dependent and must be determined by inspecting CPUID, chipset or vendor | 
|  | specific MSR fields. | 
|  |  | 
|  | In addition, some vendors have known bugs where the P-state is actually | 
|  | compensated for properly during normal operation, but when the processor is | 
|  | inactive, the P-state may be raised temporarily to service cache misses from | 
|  | other processors.  In such cases, the TSC on halted CPUs could advance faster | 
|  | than that of non-halted processors.  AMD Turion processors are known to have | 
|  | this problem. | 
|  |  | 
|  | 3.6) TSC and STPCLK / T-states | 
|  |  | 
|  | External signals given to the processor may also have the effect of stopping | 
|  | the TSC.  This is typically done for thermal emergency power control to prevent | 
|  | an overheating condition, and typically, there is no way to detect that this | 
|  | condition has happened. | 
|  |  | 
|  | 3.7) TSC virtualization - VMX | 
|  |  | 
|  | VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP | 
|  | instructions, which is enough for full virtualization of TSC in any manner.  In | 
|  | addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET | 
|  | field specified in the VMCS.  Special instructions must be used to read and | 
|  | write the VMCS field. | 
|  |  | 
|  | 3.8) TSC virtualization - SVM | 
|  |  | 
|  | SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP | 
|  | instructions, which is enough for full virtualization of TSC in any manner.  In | 
|  | addition, SVM allows passing through the host TSC plus an additional offset | 
|  | field specified in the SVM control block. | 
|  |  | 
|  | 3.9) TSC feature bits in Linux | 
|  |  | 
|  | In summary, there is no way to guarantee the TSC remains in perfect | 
|  | synchronization unless it is explicitly guaranteed by the architecture.  Even | 
|  | if so, the TSCs in multi-sockets or NUMA systems may still run independently | 
|  | despite being locally consistent. | 
|  |  | 
|  | The following feature bits are used by Linux to signal various TSC attributes, | 
|  | but they can only be taken to be meaningful for UP or single node systems. | 
|  |  | 
|  | X86_FEATURE_TSC 		: The TSC is available in hardware | 
|  | X86_FEATURE_RDTSCP		: The RDTSCP instruction is available | 
|  | X86_FEATURE_CONSTANT_TSC 	: The TSC rate is unchanged with P-states | 
|  | X86_FEATURE_NONSTOP_TSC		: The TSC does not stop in C-states | 
|  | X86_FEATURE_TSC_RELIABLE	: TSC sync checks are skipped (VMware) | 
|  |  | 
|  | 4) Virtualization Problems | 
|  |  | 
|  | Timekeeping is especially problematic for virtualization because a number of | 
|  | challenges arise.  The most obvious problem is that time is now shared between | 
|  | the host and, potentially, a number of virtual machines.  Thus the virtual | 
|  | operating system does not run with 100% usage of the CPU, despite the fact that | 
|  | it may very well make that assumption.  It may expect it to remain true to very | 
|  | exacting bounds when interrupt sources are disabled, but in reality only its | 
|  | virtual interrupt sources are disabled, and the machine may still be preempted | 
|  | at any time.  This causes problems as the passage of real time, the injection | 
|  | of machine interrupts and the associated clock sources are no longer completely | 
|  | synchronized with real time. | 
|  |  | 
|  | This same problem can occur on native harware to a degree, as SMM mode may | 
|  | steal cycles from the naturally on X86 systems when SMM mode is used by the | 
|  | BIOS, but not in such an extreme fashion.  However, the fact that SMM mode may | 
|  | cause similar problems to virtualization makes it a good justification for | 
|  | solving many of these problems on bare metal. | 
|  |  | 
|  | 4.1) Interrupt clocking | 
|  |  | 
|  | One of the most immediate problems that occurs with legacy operating systems | 
|  | is that the system timekeeping routines are often designed to keep track of | 
|  | time by counting periodic interrupts.  These interrupts may come from the PIT | 
|  | or the RTC, but the problem is the same: the host virtualization engine may not | 
|  | be able to deliver the proper number of interrupts per second, and so guest | 
|  | time may fall behind.  This is especially problematic if a high interrupt rate | 
|  | is selected, such as 1000 HZ, which is unfortunately the default for many Linux | 
|  | guests. | 
|  |  | 
|  | There are three approaches to solving this problem; first, it may be possible | 
|  | to simply ignore it.  Guests which have a separate time source for tracking | 
|  | 'wall clock' or 'real time' may not need any adjustment of their interrupts to | 
|  | maintain proper time.  If this is not sufficient, it may be necessary to inject | 
|  | additional interrupts into the guest in order to increase the effective | 
|  | interrupt rate.  This approach leads to complications in extreme conditions, | 
|  | where host load or guest lag is too much to compensate for, and thus another | 
|  | solution to the problem has risen: the guest may need to become aware of lost | 
|  | ticks and compensate for them internally.  Although promising in theory, the | 
|  | implementation of this policy in Linux has been extremely error prone, and a | 
|  | number of buggy variants of lost tick compensation are distributed across | 
|  | commonly used Linux systems. | 
|  |  | 
|  | Windows uses periodic RTC clocking as a means of keeping time internally, and | 
|  | thus requires interrupt slewing to keep proper time.  It does use a low enough | 
|  | rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in | 
|  | practice. | 
|  |  | 
|  | 4.2) TSC sampling and serialization | 
|  |  | 
|  | As the highest precision time source available, the cycle counter of the CPU | 
|  | has aroused much interest from developers.  As explained above, this timer has | 
|  | many problems unique to its nature as a local, potentially unstable and | 
|  | potentially unsynchronized source.  One issue which is not unique to the TSC, | 
|  | but is highlighted because of its very precise nature is sampling delay.  By | 
|  | definition, the counter, once read is already old.  However, it is also | 
|  | possible for the counter to be read ahead of the actual use of the result. | 
|  | This is a consequence of the superscalar execution of the instruction stream, | 
|  | which may execute instructions out of order.  Such execution is called | 
|  | non-serialized.  Forcing serialized execution is necessary for precise | 
|  | measurement with the TSC, and requires a serializing instruction, such as CPUID | 
|  | or an MSR read. | 
|  |  | 
|  | Since CPUID may actually be virtualized by a trap and emulate mechanism, this | 
|  | serialization can pose a performance issue for hardware virtualization.  An | 
|  | accurate time stamp counter reading may therefore not always be available, and | 
|  | it may be necessary for an implementation to guard against "backwards" reads of | 
|  | the TSC as seen from other CPUs, even in an otherwise perfectly synchronized | 
|  | system. | 
|  |  | 
|  | 4.3) Timespec aliasing | 
|  |  | 
|  | Additionally, this lack of serialization from the TSC poses another challenge | 
|  | when using results of the TSC when measured against another time source.  As | 
|  | the TSC is much higher precision, many possible values of the TSC may be read | 
|  | while another clock is still expressing the same value. | 
|  |  | 
|  | That is, you may read (T,T+10) while external clock C maintains the same value. | 
|  | Due to non-serialized reads, you may actually end up with a range which | 
|  | fluctuates - from (T-1.. T+10).  Thus, any time calculated from a TSC, but | 
|  | calibrated against an external value may have a range of valid values. | 
|  | Re-calibrating this computation may actually cause time, as computed after the | 
|  | calibration, to go backwards, compared with time computed before the | 
|  | calibration. | 
|  |  | 
|  | This problem is particularly pronounced with an internal time source in Linux, | 
|  | the kernel time, which is expressed in the theoretically high resolution | 
|  | timespec - but which advances in much larger granularity intervals, sometimes | 
|  | at the rate of jiffies, and possibly in catchup modes, at a much larger step. | 
|  |  | 
|  | This aliasing requires care in the computation and recalibration of kvmclock | 
|  | and any other values derived from TSC computation (such as TSC virtualization | 
|  | itself). | 
|  |  | 
|  | 4.4) Migration | 
|  |  | 
|  | Migration of a virtual machine raises problems for timekeeping in two ways. | 
|  | First, the migration itself may take time, during which interrupts cannot be | 
|  | delivered, and after which, the guest time may need to be caught up.  NTP may | 
|  | be able to help to some degree here, as the clock correction required is | 
|  | typically small enough to fall in the NTP-correctable window. | 
|  |  | 
|  | An additional concern is that timers based off the TSC (or HPET, if the raw bus | 
|  | clock is exposed) may now be running at different rates, requiring compensation | 
|  | in some way in the hypervisor by virtualizing these timers.  In addition, | 
|  | migrating to a faster machine may preclude the use of a passthrough TSC, as a | 
|  | faster clock cannot be made visible to a guest without the potential of time | 
|  | advancing faster than usual.  A slower clock is less of a problem, as it can | 
|  | always be caught up to the original rate.  KVM clock avoids these problems by | 
|  | simply storing multipliers and offsets against the TSC for the guest to convert | 
|  | back into nanosecond resolution values. | 
|  |  | 
|  | 4.5) Scheduling | 
|  |  | 
|  | Since scheduling may be based on precise timing and firing of interrupts, the | 
|  | scheduling algorithms of an operating system may be adversely affected by | 
|  | virtualization.  In theory, the effect is random and should be universally | 
|  | distributed, but in contrived as well as real scenarios (guest device access, | 
|  | causes of virtualization exits, possible context switch), this may not always | 
|  | be the case.  The effect of this has not been well studied. | 
|  |  | 
|  | In an attempt to work around this, several implementations have provided a | 
|  | paravirtualized scheduler clock, which reveals the true amount of CPU time for | 
|  | which a virtual machine has been running. | 
|  |  | 
|  | 4.6) Watchdogs | 
|  |  | 
|  | Watchdog timers, such as the lock detector in Linux may fire accidentally when | 
|  | running under hardware virtualization due to timer interrupts being delayed or | 
|  | misinterpretation of the passage of real time.  Usually, these warnings are | 
|  | spurious and can be ignored, but in some circumstances it may be necessary to | 
|  | disable such detection. | 
|  |  | 
|  | 4.7) Delays and precision timing | 
|  |  | 
|  | Precise timing and delays may not be possible in a virtualized system.  This | 
|  | can happen if the system is controlling physical hardware, or issues delays to | 
|  | compensate for slower I/O to and from devices.  The first issue is not solvable | 
|  | in general for a virtualized system; hardware control software can't be | 
|  | adequately virtualized without a full real-time operating system, which would | 
|  | require an RT aware virtualization platform. | 
|  |  | 
|  | The second issue may cause performance problems, but this is unlikely to be a | 
|  | significant issue.  In many cases these delays may be eliminated through | 
|  | configuration or paravirtualization. | 
|  |  | 
|  | 4.8) Covert channels and leaks | 
|  |  | 
|  | In addition to the above problems, time information will inevitably leak to the | 
|  | guest about the host in anything but a perfect implementation of virtualized | 
|  | time.  This may allow the guest to infer the presence of a hypervisor (as in a | 
|  | red-pill type detection), and it may allow information to leak between guests | 
|  | by using CPU utilization itself as a signalling channel.  Preventing such | 
|  | problems would require completely isolated virtual time which may not track | 
|  | real time any longer.  This may be useful in certain security or QA contexts, | 
|  | but in general isn't recommended for real-world deployment scenarios. |