| Zachary Amsden | f392eb2 | 2010-08-19 22:07:33 -1000 | [diff] [blame] | 1 |  | 
|  | 2 | Timekeeping Virtualization for X86-Based Architectures | 
|  | 3 |  | 
|  | 4 | Zachary Amsden <zamsden@redhat.com> | 
|  | 5 | Copyright (c) 2010, Red Hat.  All rights reserved. | 
|  | 6 |  | 
|  | 7 | 1) Overview | 
|  | 8 | 2) Timing Devices | 
|  | 9 | 3) TSC Hardware | 
|  | 10 | 4) Virtualization Problems | 
|  | 11 |  | 
|  | 12 | ========================================================================= | 
|  | 13 |  | 
|  | 14 | 1) Overview | 
|  | 15 |  | 
|  | 16 | One of the most complicated parts of the X86 platform, and specifically, | 
|  | 17 | the virtualization of this platform is the plethora of timing devices available | 
|  | 18 | and the complexity of emulating those devices.  In addition, virtualization of | 
|  | 19 | time introduces a new set of challenges because it introduces a multiplexed | 
|  | 20 | division of time beyond the control of the guest CPU. | 
|  | 21 |  | 
|  | 22 | First, we will describe the various timekeeping hardware available, then | 
|  | 23 | present some of the problems which arise and solutions available, giving | 
|  | 24 | specific recommendations for certain classes of KVM guests. | 
|  | 25 |  | 
|  | 26 | The purpose of this document is to collect data and information relevant to | 
|  | 27 | timekeeping which may be difficult to find elsewhere, specifically, | 
|  | 28 | information relevant to KVM and hardware-based virtualization. | 
|  | 29 |  | 
|  | 30 | ========================================================================= | 
|  | 31 |  | 
|  | 32 | 2) Timing Devices | 
|  | 33 |  | 
|  | 34 | First we discuss the basic hardware devices available.  TSC and the related | 
|  | 35 | KVM clock are special enough to warrant a full exposition and are described in | 
|  | 36 | the following section. | 
|  | 37 |  | 
|  | 38 | 2.1) i8254 - PIT | 
|  | 39 |  | 
|  | 40 | One of the first timer devices available is the programmable interrupt timer, | 
|  | 41 | or PIT.  The PIT has a fixed frequency 1.193182 MHz base clock and three | 
|  | 42 | channels which can be programmed to deliver periodic or one-shot interrupts. | 
|  | 43 | These three channels can be configured in different modes and have individual | 
|  | 44 | counters.  Channel 1 and 2 were not available for general use in the original | 
|  | 45 | IBM PC, and historically were connected to control RAM refresh and the PC | 
|  | 46 | speaker.  Now the PIT is typically integrated as part of an emulated chipset | 
|  | 47 | and a separate physical PIT is not used. | 
|  | 48 |  | 
|  | 49 | The PIT uses I/O ports 0x40 - 0x43.  Access to the 16-bit counters is done | 
|  | 50 | using single or multiple byte access to the I/O ports.  There are 6 modes | 
|  | 51 | available, but not all modes are available to all timers, as only timer 2 | 
|  | 52 | has a connected gate input, required for modes 1 and 5.  The gate line is | 
|  | 53 | controlled by port 61h, bit 0, as illustrated in the following diagram. | 
|  | 54 |  | 
|  | 55 | --------------             ---------------- | 
|  | 56 | |              |           |                | | 
|  | 57 | |  1.1932 MHz  |---------->| CLOCK      OUT | ---------> IRQ 0 | 
|  | 58 | |    Clock     |   |       |                | | 
|  | 59 | --------------    |    +->| GATE  TIMER 0  | | 
|  | 60 | |        ---------------- | 
|  | 61 | | | 
|  | 62 | |        ---------------- | 
|  | 63 | |       |                | | 
|  | 64 | |------>| CLOCK      OUT | ---------> 66.3 KHZ DRAM | 
|  | 65 | |       |                |            (aka /dev/null) | 
|  | 66 | |    +->| GATE  TIMER 1  | | 
|  | 67 | |        ---------------- | 
|  | 68 | | | 
|  | 69 | |        ---------------- | 
|  | 70 | |       |                | | 
|  | 71 | |------>| CLOCK      OUT | ---------> Port 61h, bit 5 | 
|  | 72 | |                |      | | 
|  | 73 | Port 61h, bit 0 ---------->| GATE  TIMER 2  |       \_.----   ____ | 
|  | 74 | ----------------         _|    )--|LPF|---Speaker | 
|  | 75 | / *----   \___/ | 
|  | 76 | Port 61h, bit 1 -----------------------------------/ | 
|  | 77 |  | 
|  | 78 | The timer modes are now described. | 
|  | 79 |  | 
|  | 80 | Mode 0: Single Timeout.   This is a one-shot software timeout that counts down | 
|  | 81 | when the gate is high (always true for timers 0 and 1).  When the count | 
|  | 82 | reaches zero, the output goes high. | 
|  | 83 |  | 
| Lucas De Marchi | 25985ed | 2011-03-30 22:57:33 -0300 | [diff] [blame] | 84 | Mode 1: Triggered One-shot.  The output is initially set high.  When the gate | 
| Zachary Amsden | f392eb2 | 2010-08-19 22:07:33 -1000 | [diff] [blame] | 85 | line is set high, a countdown is initiated (which does not stop if the gate is | 
|  | 86 | lowered), during which the output is set low.  When the count reaches zero, | 
|  | 87 | the output goes high. | 
|  | 88 |  | 
|  | 89 | Mode 2: Rate Generator.  The output is initially set high.  When the countdown | 
|  | 90 | reaches 1, the output goes low for one count and then returns high.  The value | 
|  | 91 | is reloaded and the countdown automatically resumes.  If the gate line goes | 
|  | 92 | low, the count is halted.  If the output is low when the gate is lowered, the | 
|  | 93 | output automatically goes high (this only affects timer 2). | 
|  | 94 |  | 
|  | 95 | Mode 3: Square Wave.   This generates a high / low square wave.  The count | 
|  | 96 | determines the length of the pulse, which alternates between high and low | 
|  | 97 | when zero is reached.  The count only proceeds when gate is high and is | 
|  | 98 | automatically reloaded on reaching zero.  The count is decremented twice at | 
|  | 99 | each clock to generate a full high / low cycle at the full periodic rate. | 
|  | 100 | If the count is even, the clock remains high for N/2 counts and low for N/2 | 
|  | 101 | counts; if the clock is odd, the clock is high for (N+1)/2 counts and low | 
|  | 102 | for (N-1)/2 counts.  Only even values are latched by the counter, so odd | 
|  | 103 | values are not observed when reading.  This is the intended mode for timer 2, | 
|  | 104 | which generates sine-like tones by low-pass filtering the square wave output. | 
|  | 105 |  | 
|  | 106 | Mode 4: Software Strobe.  After programming this mode and loading the counter, | 
|  | 107 | the output remains high until the counter reaches zero.  Then the output | 
|  | 108 | goes low for 1 clock cycle and returns high.  The counter is not reloaded. | 
|  | 109 | Counting only occurs when gate is high. | 
|  | 110 |  | 
|  | 111 | Mode 5: Hardware Strobe.  After programming and loading the counter, the | 
|  | 112 | output remains high.  When the gate is raised, a countdown is initiated | 
|  | 113 | (which does not stop if the gate is lowered).  When the counter reaches zero, | 
|  | 114 | the output goes low for 1 clock cycle and then returns high.  The counter is | 
|  | 115 | not reloaded. | 
|  | 116 |  | 
|  | 117 | In addition to normal binary counting, the PIT supports BCD counting.  The | 
|  | 118 | command port, 0x43 is used to set the counter and mode for each of the three | 
|  | 119 | timers. | 
|  | 120 |  | 
|  | 121 | PIT commands, issued to port 0x43, using the following bit encoding: | 
|  | 122 |  | 
|  | 123 | Bit 7-4: Command (See table below) | 
|  | 124 | Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) | 
|  | 125 | Bit 0  : Binary (0) / BCD (1) | 
|  | 126 |  | 
|  | 127 | Command table: | 
|  | 128 |  | 
|  | 129 | 0000 - Latch Timer 0 count for port 0x40 | 
|  | 130 | sample and hold the count to be read in port 0x40; | 
|  | 131 | additional commands ignored until counter is read; | 
|  | 132 | mode bits ignored. | 
|  | 133 |  | 
|  | 134 | 0001 - Set Timer 0 LSB mode for port 0x40 | 
|  | 135 | set timer to read LSB only and force MSB to zero; | 
|  | 136 | mode bits set timer mode | 
|  | 137 |  | 
|  | 138 | 0010 - Set Timer 0 MSB mode for port 0x40 | 
|  | 139 | set timer to read MSB only and force LSB to zero; | 
|  | 140 | mode bits set timer mode | 
|  | 141 |  | 
|  | 142 | 0011 - Set Timer 0 16-bit mode for port 0x40 | 
|  | 143 | set timer to read / write LSB first, then MSB; | 
|  | 144 | mode bits set timer mode | 
|  | 145 |  | 
|  | 146 | 0100 - Latch Timer 1 count for port 0x41 - as described above | 
|  | 147 | 0101 - Set Timer 1 LSB mode for port 0x41 - as described above | 
|  | 148 | 0110 - Set Timer 1 MSB mode for port 0x41 - as described above | 
|  | 149 | 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above | 
|  | 150 |  | 
|  | 151 | 1000 - Latch Timer 2 count for port 0x42 - as described above | 
|  | 152 | 1001 - Set Timer 2 LSB mode for port 0x42 - as described above | 
|  | 153 | 1010 - Set Timer 2 MSB mode for port 0x42 - as described above | 
|  | 154 | 1011 - Set Timer 2 16-bit mode for port 0x42 as described above | 
|  | 155 |  | 
|  | 156 | 1101 - General counter latch | 
|  | 157 | Latch combination of counters into corresponding ports | 
|  | 158 | Bit 3 = Counter 2 | 
|  | 159 | Bit 2 = Counter 1 | 
|  | 160 | Bit 1 = Counter 0 | 
|  | 161 | Bit 0 = Unused | 
|  | 162 |  | 
|  | 163 | 1110 - Latch timer status | 
|  | 164 | Latch combination of counter mode into corresponding ports | 
|  | 165 | Bit 3 = Counter 2 | 
|  | 166 | Bit 2 = Counter 1 | 
|  | 167 | Bit 1 = Counter 0 | 
|  | 168 |  | 
|  | 169 | The output of ports 0x40-0x42 following this command will be: | 
|  | 170 |  | 
|  | 171 | Bit 7 = Output pin | 
|  | 172 | Bit 6 = Count loaded (0 if timer has expired) | 
|  | 173 | Bit 5-4 = Read / Write mode | 
|  | 174 | 01 = MSB only | 
|  | 175 | 10 = LSB only | 
|  | 176 | 11 = LSB / MSB (16-bit) | 
|  | 177 | Bit 3-1 = Mode | 
|  | 178 | Bit 0 = Binary (0) / BCD mode (1) | 
|  | 179 |  | 
|  | 180 | 2.2) RTC | 
|  | 181 |  | 
|  | 182 | The second device which was available in the original PC was the MC146818 real | 
|  | 183 | time clock.  The original device is now obsolete, and usually emulated by the | 
|  | 184 | system chipset, sometimes by an HPET and some frankenstein IRQ routing. | 
|  | 185 |  | 
|  | 186 | The RTC is accessed through CMOS variables, which uses an index register to | 
|  | 187 | control which bytes are read.  Since there is only one index register, read | 
|  | 188 | of the CMOS and read of the RTC require lock protection (in addition, it is | 
|  | 189 | dangerous to allow userspace utilities such as hwclock to have direct RTC | 
|  | 190 | access, as they could corrupt kernel reads and writes of CMOS memory). | 
|  | 191 |  | 
|  | 192 | The RTC generates an interrupt which is usually routed to IRQ 8.  The interrupt | 
|  | 193 | can function as a periodic timer, an additional once a day alarm, and can issue | 
|  | 194 | interrupts after an update of the CMOS registers by the MC146818 is complete. | 
|  | 195 | The type of interrupt is signalled in the RTC status registers. | 
|  | 196 |  | 
|  | 197 | The RTC will update the current time fields by battery power even while the | 
|  | 198 | system is off.  The current time fields should not be read while an update is | 
|  | 199 | in progress, as indicated in the status register. | 
|  | 200 |  | 
|  | 201 | The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be | 
|  | 202 | programmed to a 32kHz divider if the RTC is to count seconds. | 
|  | 203 |  | 
|  | 204 | This is the RAM map originally used for the RTC/CMOS: | 
|  | 205 |  | 
|  | 206 | Location    Size    Description | 
|  | 207 | ------------------------------------------ | 
|  | 208 | 00h         byte    Current second (BCD) | 
|  | 209 | 01h         byte    Seconds alarm (BCD) | 
|  | 210 | 02h         byte    Current minute (BCD) | 
|  | 211 | 03h         byte    Minutes alarm (BCD) | 
|  | 212 | 04h         byte    Current hour (BCD) | 
|  | 213 | 05h         byte    Hours alarm (BCD) | 
|  | 214 | 06h         byte    Current day of week (BCD) | 
|  | 215 | 07h         byte    Current day of month (BCD) | 
|  | 216 | 08h         byte    Current month (BCD) | 
|  | 217 | 09h         byte    Current year (BCD) | 
|  | 218 | 0Ah         byte    Register A | 
|  | 219 | bit 7   = Update in progress | 
|  | 220 | bit 6-4 = Divider for clock | 
|  | 221 | 000 = 4.194 MHz | 
|  | 222 | 001 = 1.049 MHz | 
|  | 223 | 010 = 32 kHz | 
|  | 224 | 10X = test modes | 
|  | 225 | 110 = reset / disable | 
|  | 226 | 111 = reset / disable | 
|  | 227 | bit 3-0 = Rate selection for periodic interrupt | 
|  | 228 | 000 = periodic timer disabled | 
|  | 229 | 001 = 3.90625 uS | 
|  | 230 | 010 = 7.8125 uS | 
|  | 231 | 011 = .122070 mS | 
|  | 232 | 100 = .244141 mS | 
|  | 233 | ... | 
|  | 234 | 1101 = 125 mS | 
|  | 235 | 1110 = 250 mS | 
|  | 236 | 1111 = 500 mS | 
|  | 237 | 0Bh         byte    Register B | 
|  | 238 | bit 7   = Run (0) / Halt (1) | 
|  | 239 | bit 6   = Periodic interrupt enable | 
|  | 240 | bit 5   = Alarm interrupt enable | 
|  | 241 | bit 4   = Update-ended interrupt enable | 
|  | 242 | bit 3   = Square wave interrupt enable | 
|  | 243 | bit 2   = BCD calendar (0) / Binary (1) | 
|  | 244 | bit 1   = 12-hour mode (0) / 24-hour mode (1) | 
|  | 245 | bit 0   = 0 (DST off) / 1 (DST enabled) | 
|  | 246 | OCh         byte    Register C (read only) | 
|  | 247 | bit 7   = interrupt request flag (IRQF) | 
|  | 248 | bit 6   = periodic interrupt flag (PF) | 
|  | 249 | bit 5   = alarm interrupt flag (AF) | 
|  | 250 | bit 4   = update interrupt flag (UF) | 
|  | 251 | bit 3-0 = reserved | 
|  | 252 | ODh         byte    Register D (read only) | 
|  | 253 | bit 7   = RTC has power | 
|  | 254 | bit 6-0 = reserved | 
|  | 255 | 32h         byte    Current century BCD (*) | 
|  | 256 | (*) location vendor specific and now determined from ACPI global tables | 
|  | 257 |  | 
|  | 258 | 2.3) APIC | 
|  | 259 |  | 
|  | 260 | On Pentium and later processors, an on-board timer is available to each CPU | 
|  | 261 | as part of the Advanced Programmable Interrupt Controller.  The APIC is | 
|  | 262 | accessed through memory-mapped registers and provides interrupt service to each | 
|  | 263 | CPU, used for IPIs and local timer interrupts. | 
|  | 264 |  | 
|  | 265 | Although in theory the APIC is a safe and stable source for local interrupts, | 
|  | 266 | in practice, many bugs and glitches have occurred due to the special nature of | 
|  | 267 | the APIC CPU-local memory-mapped hardware.  Beware that CPU errata may affect | 
|  | 268 | the use of the APIC and that workarounds may be required.  In addition, some of | 
|  | 269 | these workarounds pose unique constraints for virtualization - requiring either | 
|  | 270 | extra overhead incurred from extra reads of memory-mapped I/O or additional | 
|  | 271 | functionality that may be more computationally expensive to implement. | 
|  | 272 |  | 
|  | 273 | Since the APIC is documented quite well in the Intel and AMD manuals, we will | 
|  | 274 | avoid repetition of the detail here.  It should be pointed out that the APIC | 
|  | 275 | timer is programmed through the LVT (local vector timer) register, is capable | 
|  | 276 | of one-shot or periodic operation, and is based on the bus clock divided down | 
|  | 277 | by the programmable divider register. | 
|  | 278 |  | 
|  | 279 | 2.4) HPET | 
|  | 280 |  | 
|  | 281 | HPET is quite complex, and was originally intended to replace the PIT / RTC | 
|  | 282 | support of the X86 PC.  It remains to be seen whether that will be the case, as | 
|  | 283 | the de facto standard of PC hardware is to emulate these older devices.  Some | 
|  | 284 | systems designated as legacy free may support only the HPET as a hardware timer | 
|  | 285 | device. | 
|  | 286 |  | 
|  | 287 | The HPET spec is rather loose and vague, requiring at least 3 hardware timers, | 
|  | 288 | but allowing implementation freedom to support many more.  It also imposes no | 
|  | 289 | fixed rate on the timer frequency, but does impose some extremal values on | 
|  | 290 | frequency, error and slew. | 
|  | 291 |  | 
|  | 292 | In general, the HPET is recommended as a high precision (compared to PIT /RTC) | 
|  | 293 | time source which is independent of local variation (as there is only one HPET | 
|  | 294 | in any given system).  The HPET is also memory-mapped, and its presence is | 
|  | 295 | indicated through ACPI tables by the BIOS. | 
|  | 296 |  | 
|  | 297 | Detailed specification of the HPET is beyond the current scope of this | 
|  | 298 | document, as it is also very well documented elsewhere. | 
|  | 299 |  | 
|  | 300 | 2.5) Offboard Timers | 
|  | 301 |  | 
|  | 302 | Several cards, both proprietary (watchdog boards) and commonplace (e1000) have | 
|  | 303 | timing chips built into the cards which may have registers which are accessible | 
|  | 304 | to kernel or user drivers.  To the author's knowledge, using these to generate | 
|  | 305 | a clocksource for a Linux or other kernel has not yet been attempted and is in | 
|  | 306 | general frowned upon as not playing by the agreed rules of the game.  Such a | 
|  | 307 | timer device would require additional support to be virtualized properly and is | 
|  | 308 | not considered important at this time as no known operating system does this. | 
|  | 309 |  | 
|  | 310 | ========================================================================= | 
|  | 311 |  | 
|  | 312 | 3) TSC Hardware | 
|  | 313 |  | 
|  | 314 | The TSC or time stamp counter is relatively simple in theory; it counts | 
|  | 315 | instruction cycles issued by the processor, which can be used as a measure of | 
|  | 316 | time.  In practice, due to a number of problems, it is the most complicated | 
|  | 317 | timekeeping device to use. | 
|  | 318 |  | 
|  | 319 | The TSC is represented internally as a 64-bit MSR which can be read with the | 
|  | 320 | RDMSR, RDTSC, or RDTSCP (when available) instructions.  In the past, hardware | 
|  | 321 | limitations made it possible to write the TSC, but generally on old hardware it | 
|  | 322 | was only possible to write the low 32-bits of the 64-bit counter, and the upper | 
|  | 323 | 32-bits of the counter were cleared.  Now, however, on Intel processors family | 
|  | 324 | 0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction | 
|  | 325 | has been lifted and all 64-bits are writable.  On AMD systems, the ability to | 
|  | 326 | write the TSC MSR is not an architectural guarantee. | 
|  | 327 |  | 
|  | 328 | The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by | 
|  | 329 | means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access. | 
|  | 330 |  | 
|  | 331 | Some vendors have implemented an additional instruction, RDTSCP, which returns | 
|  | 332 | atomically not just the TSC, but an indicator which corresponds to the | 
|  | 333 | processor number.  This can be used to index into an array of TSC variables to | 
|  | 334 | determine offset information in SMP systems where TSCs are not synchronized. | 
|  | 335 | The presence of this instruction must be determined by consulting CPUID feature | 
|  | 336 | bits. | 
|  | 337 |  | 
|  | 338 | Both VMX and SVM provide extension fields in the virtualization hardware which | 
|  | 339 | allows the guest visible TSC to be offset by a constant.  Newer implementations | 
|  | 340 | promise to allow the TSC to additionally be scaled, but this hardware is not | 
|  | 341 | yet widely available. | 
|  | 342 |  | 
|  | 343 | 3.1) TSC synchronization | 
|  | 344 |  | 
|  | 345 | The TSC is a CPU-local clock in most implementations.  This means, on SMP | 
|  | 346 | platforms, the TSCs of different CPUs may start at different times depending | 
|  | 347 | on when the CPUs are powered on.  Generally, CPUs on the same die will share | 
|  | 348 | the same clock, however, this is not always the case. | 
|  | 349 |  | 
|  | 350 | The BIOS may attempt to resynchronize the TSCs during the poweron process and | 
|  | 351 | the operating system or other system software may attempt to do this as well. | 
|  | 352 | Several hardware limitations make the problem worse - if it is not possible to | 
|  | 353 | write the full 64-bits of the TSC, it may be impossible to match the TSC in | 
|  | 354 | newly arriving CPUs to that of the rest of the system, resulting in | 
|  | 355 | unsynchronized TSCs.  This may be done by BIOS or system software, but in | 
|  | 356 | practice, getting a perfectly synchronized TSC will not be possible unless all | 
|  | 357 | values are read from the same clock, which generally only is possible on single | 
|  | 358 | socket systems or those with special hardware support. | 
|  | 359 |  | 
|  | 360 | 3.2) TSC and CPU hotplug | 
|  | 361 |  | 
|  | 362 | As touched on already, CPUs which arrive later than the boot time of the system | 
|  | 363 | may not have a TSC value that is synchronized with the rest of the system. | 
|  | 364 | Either system software, BIOS, or SMM code may actually try to establish the TSC | 
|  | 365 | to a value matching the rest of the system, but a perfect match is usually not | 
|  | 366 | a guarantee.  This can have the effect of bringing a system from a state where | 
|  | 367 | TSC is synchronized back to a state where TSC synchronization flaws, however | 
|  | 368 | small, may be exposed to the OS and any virtualization environment. | 
|  | 369 |  | 
|  | 370 | 3.3) TSC and multi-socket / NUMA | 
|  | 371 |  | 
|  | 372 | Multi-socket systems, especially large multi-socket systems are likely to have | 
|  | 373 | individual clocksources rather than a single, universally distributed clock. | 
|  | 374 | Since these clocks are driven by different crystals, they will not have | 
|  | 375 | perfectly matched frequency, and temperature and electrical variations will | 
|  | 376 | cause the CPU clocks, and thus the TSCs to drift over time.  Depending on the | 
|  | 377 | exact clock and bus design, the drift may or may not be fixed in absolute | 
|  | 378 | error, and may accumulate over time. | 
|  | 379 |  | 
|  | 380 | In addition, very large systems may deliberately slew the clocks of individual | 
|  | 381 | cores.  This technique, known as spread-spectrum clocking, reduces EMI at the | 
|  | 382 | clock frequency and harmonics of it, which may be required to pass FCC | 
|  | 383 | standards for telecommunications and computer equipment. | 
|  | 384 |  | 
|  | 385 | It is recommended not to trust the TSCs to remain synchronized on NUMA or | 
|  | 386 | multiple socket systems for these reasons. | 
|  | 387 |  | 
|  | 388 | 3.4) TSC and C-states | 
|  | 389 |  | 
|  | 390 | C-states, or idling states of the processor, especially C1E and deeper sleep | 
|  | 391 | states may be problematic for TSC as well.  The TSC may stop advancing in such | 
|  | 392 | a state, resulting in a TSC which is behind that of other CPUs when execution | 
|  | 393 | is resumed.  Such CPUs must be detected and flagged by the operating system | 
|  | 394 | based on CPU and chipset identifications. | 
|  | 395 |  | 
|  | 396 | The TSC in such a case may be corrected by catching it up to a known external | 
|  | 397 | clocksource. | 
|  | 398 |  | 
|  | 399 | 3.5) TSC frequency change / P-states | 
|  | 400 |  | 
|  | 401 | To make things slightly more interesting, some CPUs may change frequency.  They | 
|  | 402 | may or may not run the TSC at the same rate, and because the frequency change | 
|  | 403 | may be staggered or slewed, at some points in time, the TSC rate may not be | 
|  | 404 | known other than falling within a range of values.  In this case, the TSC will | 
|  | 405 | not be a stable time source, and must be calibrated against a known, stable, | 
|  | 406 | external clock to be a usable source of time. | 
|  | 407 |  | 
|  | 408 | Whether the TSC runs at a constant rate or scales with the P-state is model | 
|  | 409 | dependent and must be determined by inspecting CPUID, chipset or vendor | 
|  | 410 | specific MSR fields. | 
|  | 411 |  | 
|  | 412 | In addition, some vendors have known bugs where the P-state is actually | 
|  | 413 | compensated for properly during normal operation, but when the processor is | 
|  | 414 | inactive, the P-state may be raised temporarily to service cache misses from | 
|  | 415 | other processors.  In such cases, the TSC on halted CPUs could advance faster | 
|  | 416 | than that of non-halted processors.  AMD Turion processors are known to have | 
|  | 417 | this problem. | 
|  | 418 |  | 
|  | 419 | 3.6) TSC and STPCLK / T-states | 
|  | 420 |  | 
|  | 421 | External signals given to the processor may also have the effect of stopping | 
|  | 422 | the TSC.  This is typically done for thermal emergency power control to prevent | 
|  | 423 | an overheating condition, and typically, there is no way to detect that this | 
|  | 424 | condition has happened. | 
|  | 425 |  | 
|  | 426 | 3.7) TSC virtualization - VMX | 
|  | 427 |  | 
|  | 428 | VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP | 
|  | 429 | instructions, which is enough for full virtualization of TSC in any manner.  In | 
|  | 430 | addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET | 
|  | 431 | field specified in the VMCS.  Special instructions must be used to read and | 
|  | 432 | write the VMCS field. | 
|  | 433 |  | 
|  | 434 | 3.8) TSC virtualization - SVM | 
|  | 435 |  | 
|  | 436 | SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP | 
|  | 437 | instructions, which is enough for full virtualization of TSC in any manner.  In | 
|  | 438 | addition, SVM allows passing through the host TSC plus an additional offset | 
|  | 439 | field specified in the SVM control block. | 
|  | 440 |  | 
|  | 441 | 3.9) TSC feature bits in Linux | 
|  | 442 |  | 
|  | 443 | In summary, there is no way to guarantee the TSC remains in perfect | 
|  | 444 | synchronization unless it is explicitly guaranteed by the architecture.  Even | 
|  | 445 | if so, the TSCs in multi-sockets or NUMA systems may still run independently | 
|  | 446 | despite being locally consistent. | 
|  | 447 |  | 
|  | 448 | The following feature bits are used by Linux to signal various TSC attributes, | 
|  | 449 | but they can only be taken to be meaningful for UP or single node systems. | 
|  | 450 |  | 
|  | 451 | X86_FEATURE_TSC 		: The TSC is available in hardware | 
|  | 452 | X86_FEATURE_RDTSCP		: The RDTSCP instruction is available | 
|  | 453 | X86_FEATURE_CONSTANT_TSC 	: The TSC rate is unchanged with P-states | 
|  | 454 | X86_FEATURE_NONSTOP_TSC		: The TSC does not stop in C-states | 
|  | 455 | X86_FEATURE_TSC_RELIABLE	: TSC sync checks are skipped (VMware) | 
|  | 456 |  | 
|  | 457 | 4) Virtualization Problems | 
|  | 458 |  | 
|  | 459 | Timekeeping is especially problematic for virtualization because a number of | 
|  | 460 | challenges arise.  The most obvious problem is that time is now shared between | 
|  | 461 | the host and, potentially, a number of virtual machines.  Thus the virtual | 
|  | 462 | operating system does not run with 100% usage of the CPU, despite the fact that | 
|  | 463 | it may very well make that assumption.  It may expect it to remain true to very | 
|  | 464 | exacting bounds when interrupt sources are disabled, but in reality only its | 
|  | 465 | virtual interrupt sources are disabled, and the machine may still be preempted | 
|  | 466 | at any time.  This causes problems as the passage of real time, the injection | 
|  | 467 | of machine interrupts and the associated clock sources are no longer completely | 
|  | 468 | synchronized with real time. | 
|  | 469 |  | 
|  | 470 | This same problem can occur on native harware to a degree, as SMM mode may | 
|  | 471 | steal cycles from the naturally on X86 systems when SMM mode is used by the | 
|  | 472 | BIOS, but not in such an extreme fashion.  However, the fact that SMM mode may | 
|  | 473 | cause similar problems to virtualization makes it a good justification for | 
|  | 474 | solving many of these problems on bare metal. | 
|  | 475 |  | 
|  | 476 | 4.1) Interrupt clocking | 
|  | 477 |  | 
|  | 478 | One of the most immediate problems that occurs with legacy operating systems | 
|  | 479 | is that the system timekeeping routines are often designed to keep track of | 
|  | 480 | time by counting periodic interrupts.  These interrupts may come from the PIT | 
|  | 481 | or the RTC, but the problem is the same: the host virtualization engine may not | 
|  | 482 | be able to deliver the proper number of interrupts per second, and so guest | 
|  | 483 | time may fall behind.  This is especially problematic if a high interrupt rate | 
|  | 484 | is selected, such as 1000 HZ, which is unfortunately the default for many Linux | 
|  | 485 | guests. | 
|  | 486 |  | 
|  | 487 | There are three approaches to solving this problem; first, it may be possible | 
|  | 488 | to simply ignore it.  Guests which have a separate time source for tracking | 
|  | 489 | 'wall clock' or 'real time' may not need any adjustment of their interrupts to | 
|  | 490 | maintain proper time.  If this is not sufficient, it may be necessary to inject | 
|  | 491 | additional interrupts into the guest in order to increase the effective | 
|  | 492 | interrupt rate.  This approach leads to complications in extreme conditions, | 
|  | 493 | where host load or guest lag is too much to compensate for, and thus another | 
|  | 494 | solution to the problem has risen: the guest may need to become aware of lost | 
|  | 495 | ticks and compensate for them internally.  Although promising in theory, the | 
|  | 496 | implementation of this policy in Linux has been extremely error prone, and a | 
|  | 497 | number of buggy variants of lost tick compensation are distributed across | 
|  | 498 | commonly used Linux systems. | 
|  | 499 |  | 
|  | 500 | Windows uses periodic RTC clocking as a means of keeping time internally, and | 
|  | 501 | thus requires interrupt slewing to keep proper time.  It does use a low enough | 
|  | 502 | rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in | 
|  | 503 | practice. | 
|  | 504 |  | 
|  | 505 | 4.2) TSC sampling and serialization | 
|  | 506 |  | 
|  | 507 | As the highest precision time source available, the cycle counter of the CPU | 
|  | 508 | has aroused much interest from developers.  As explained above, this timer has | 
|  | 509 | many problems unique to its nature as a local, potentially unstable and | 
|  | 510 | potentially unsynchronized source.  One issue which is not unique to the TSC, | 
|  | 511 | but is highlighted because of its very precise nature is sampling delay.  By | 
|  | 512 | definition, the counter, once read is already old.  However, it is also | 
|  | 513 | possible for the counter to be read ahead of the actual use of the result. | 
|  | 514 | This is a consequence of the superscalar execution of the instruction stream, | 
|  | 515 | which may execute instructions out of order.  Such execution is called | 
|  | 516 | non-serialized.  Forcing serialized execution is necessary for precise | 
|  | 517 | measurement with the TSC, and requires a serializing instruction, such as CPUID | 
|  | 518 | or an MSR read. | 
|  | 519 |  | 
|  | 520 | Since CPUID may actually be virtualized by a trap and emulate mechanism, this | 
|  | 521 | serialization can pose a performance issue for hardware virtualization.  An | 
|  | 522 | accurate time stamp counter reading may therefore not always be available, and | 
|  | 523 | it may be necessary for an implementation to guard against "backwards" reads of | 
|  | 524 | the TSC as seen from other CPUs, even in an otherwise perfectly synchronized | 
|  | 525 | system. | 
|  | 526 |  | 
|  | 527 | 4.3) Timespec aliasing | 
|  | 528 |  | 
|  | 529 | Additionally, this lack of serialization from the TSC poses another challenge | 
|  | 530 | when using results of the TSC when measured against another time source.  As | 
|  | 531 | the TSC is much higher precision, many possible values of the TSC may be read | 
|  | 532 | while another clock is still expressing the same value. | 
|  | 533 |  | 
|  | 534 | That is, you may read (T,T+10) while external clock C maintains the same value. | 
|  | 535 | Due to non-serialized reads, you may actually end up with a range which | 
|  | 536 | fluctuates - from (T-1.. T+10).  Thus, any time calculated from a TSC, but | 
|  | 537 | calibrated against an external value may have a range of valid values. | 
|  | 538 | Re-calibrating this computation may actually cause time, as computed after the | 
|  | 539 | calibration, to go backwards, compared with time computed before the | 
|  | 540 | calibration. | 
|  | 541 |  | 
|  | 542 | This problem is particularly pronounced with an internal time source in Linux, | 
|  | 543 | the kernel time, which is expressed in the theoretically high resolution | 
|  | 544 | timespec - but which advances in much larger granularity intervals, sometimes | 
|  | 545 | at the rate of jiffies, and possibly in catchup modes, at a much larger step. | 
|  | 546 |  | 
|  | 547 | This aliasing requires care in the computation and recalibration of kvmclock | 
|  | 548 | and any other values derived from TSC computation (such as TSC virtualization | 
|  | 549 | itself). | 
|  | 550 |  | 
|  | 551 | 4.4) Migration | 
|  | 552 |  | 
|  | 553 | Migration of a virtual machine raises problems for timekeeping in two ways. | 
|  | 554 | First, the migration itself may take time, during which interrupts cannot be | 
|  | 555 | delivered, and after which, the guest time may need to be caught up.  NTP may | 
|  | 556 | be able to help to some degree here, as the clock correction required is | 
|  | 557 | typically small enough to fall in the NTP-correctable window. | 
|  | 558 |  | 
|  | 559 | An additional concern is that timers based off the TSC (or HPET, if the raw bus | 
|  | 560 | clock is exposed) may now be running at different rates, requiring compensation | 
|  | 561 | in some way in the hypervisor by virtualizing these timers.  In addition, | 
|  | 562 | migrating to a faster machine may preclude the use of a passthrough TSC, as a | 
|  | 563 | faster clock cannot be made visible to a guest without the potential of time | 
|  | 564 | advancing faster than usual.  A slower clock is less of a problem, as it can | 
|  | 565 | always be caught up to the original rate.  KVM clock avoids these problems by | 
|  | 566 | simply storing multipliers and offsets against the TSC for the guest to convert | 
|  | 567 | back into nanosecond resolution values. | 
|  | 568 |  | 
|  | 569 | 4.5) Scheduling | 
|  | 570 |  | 
|  | 571 | Since scheduling may be based on precise timing and firing of interrupts, the | 
|  | 572 | scheduling algorithms of an operating system may be adversely affected by | 
|  | 573 | virtualization.  In theory, the effect is random and should be universally | 
|  | 574 | distributed, but in contrived as well as real scenarios (guest device access, | 
|  | 575 | causes of virtualization exits, possible context switch), this may not always | 
|  | 576 | be the case.  The effect of this has not been well studied. | 
|  | 577 |  | 
|  | 578 | In an attempt to work around this, several implementations have provided a | 
|  | 579 | paravirtualized scheduler clock, which reveals the true amount of CPU time for | 
|  | 580 | which a virtual machine has been running. | 
|  | 581 |  | 
|  | 582 | 4.6) Watchdogs | 
|  | 583 |  | 
|  | 584 | Watchdog timers, such as the lock detector in Linux may fire accidentally when | 
|  | 585 | running under hardware virtualization due to timer interrupts being delayed or | 
|  | 586 | misinterpretation of the passage of real time.  Usually, these warnings are | 
|  | 587 | spurious and can be ignored, but in some circumstances it may be necessary to | 
|  | 588 | disable such detection. | 
|  | 589 |  | 
|  | 590 | 4.7) Delays and precision timing | 
|  | 591 |  | 
|  | 592 | Precise timing and delays may not be possible in a virtualized system.  This | 
|  | 593 | can happen if the system is controlling physical hardware, or issues delays to | 
|  | 594 | compensate for slower I/O to and from devices.  The first issue is not solvable | 
|  | 595 | in general for a virtualized system; hardware control software can't be | 
|  | 596 | adequately virtualized without a full real-time operating system, which would | 
|  | 597 | require an RT aware virtualization platform. | 
|  | 598 |  | 
|  | 599 | The second issue may cause performance problems, but this is unlikely to be a | 
|  | 600 | significant issue.  In many cases these delays may be eliminated through | 
|  | 601 | configuration or paravirtualization. | 
|  | 602 |  | 
|  | 603 | 4.8) Covert channels and leaks | 
|  | 604 |  | 
|  | 605 | In addition to the above problems, time information will inevitably leak to the | 
|  | 606 | guest about the host in anything but a perfect implementation of virtualized | 
|  | 607 | time.  This may allow the guest to infer the presence of a hypervisor (as in a | 
|  | 608 | red-pill type detection), and it may allow information to leak between guests | 
|  | 609 | by using CPU utilization itself as a signalling channel.  Preventing such | 
|  | 610 | problems would require completely isolated virtual time which may not track | 
|  | 611 | real time any longer.  This may be useful in certain security or QA contexts, | 
|  | 612 | but in general isn't recommended for real-world deployment scenarios. |