|  | KVM-specific MSRs. | 
|  | Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010 | 
|  | ===================================================== | 
|  |  | 
|  | KVM makes use of some custom MSRs to service some requests. | 
|  |  | 
|  | Custom MSRs have a range reserved for them, that goes from | 
|  | 0x4b564d00 to 0x4b564dff. There are MSRs outside this area, | 
|  | but they are deprecated and their use is discouraged. | 
|  |  | 
|  | Custom MSR list | 
|  | -------- | 
|  |  | 
|  | The current supported Custom MSR list is: | 
|  |  | 
|  | MSR_KVM_WALL_CLOCK_NEW:   0x4b564d00 | 
|  |  | 
|  | data: 4-byte alignment physical address of a memory area which must be | 
|  | in guest RAM. This memory is expected to hold a copy of the following | 
|  | structure: | 
|  |  | 
|  | struct pvclock_wall_clock { | 
|  | u32   version; | 
|  | u32   sec; | 
|  | u32   nsec; | 
|  | } __attribute__((__packed__)); | 
|  |  | 
|  | whose data will be filled in by the hypervisor. The hypervisor is only | 
|  | guaranteed to update this data at the moment of MSR write. | 
|  | Users that want to reliably query this information more than once have | 
|  | to write more than once to this MSR. Fields have the following meanings: | 
|  |  | 
|  | version: guest has to check version before and after grabbing | 
|  | time information and check that they are both equal and even. | 
|  | An odd version indicates an in-progress update. | 
|  |  | 
|  | sec: number of seconds for wallclock. | 
|  |  | 
|  | nsec: number of nanoseconds for wallclock. | 
|  |  | 
|  | Note that although MSRs are per-CPU entities, the effect of this | 
|  | particular MSR is global. | 
|  |  | 
|  | Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid | 
|  | leaf prior to usage. | 
|  |  | 
|  | MSR_KVM_SYSTEM_TIME_NEW:  0x4b564d01 | 
|  |  | 
|  | data: 4-byte aligned physical address of a memory area which must be in | 
|  | guest RAM, plus an enable bit in bit 0. This memory is expected to hold | 
|  | a copy of the following structure: | 
|  |  | 
|  | struct pvclock_vcpu_time_info { | 
|  | u32   version; | 
|  | u32   pad0; | 
|  | u64   tsc_timestamp; | 
|  | u64   system_time; | 
|  | u32   tsc_to_system_mul; | 
|  | s8    tsc_shift; | 
|  | u8    flags; | 
|  | u8    pad[2]; | 
|  | } __attribute__((__packed__)); /* 32 bytes */ | 
|  |  | 
|  | whose data will be filled in by the hypervisor periodically. Only one | 
|  | write, or registration, is needed for each VCPU. The interval between | 
|  | updates of this structure is arbitrary and implementation-dependent. | 
|  | The hypervisor may update this structure at any time it sees fit until | 
|  | anything with bit0 == 0 is written to it. | 
|  |  | 
|  | Fields have the following meanings: | 
|  |  | 
|  | version: guest has to check version before and after grabbing | 
|  | time information and check that they are both equal and even. | 
|  | An odd version indicates an in-progress update. | 
|  |  | 
|  | tsc_timestamp: the tsc value at the current VCPU at the time | 
|  | of the update of this structure. Guests can subtract this value | 
|  | from current tsc to derive a notion of elapsed time since the | 
|  | structure update. | 
|  |  | 
|  | system_time: a host notion of monotonic time, including sleep | 
|  | time at the time this structure was last updated. Unit is | 
|  | nanoseconds. | 
|  |  | 
|  | tsc_to_system_mul: a function of the tsc frequency. One has | 
|  | to multiply any tsc-related quantity by this value to get | 
|  | a value in nanoseconds, besides dividing by 2^tsc_shift | 
|  |  | 
|  | tsc_shift: cycle to nanosecond divider, as a power of two, to | 
|  | allow for shift rights. One has to shift right any tsc-related | 
|  | quantity by this value to get a value in nanoseconds, besides | 
|  | multiplying by tsc_to_system_mul. | 
|  |  | 
|  | With this information, guests can derive per-CPU time by | 
|  | doing: | 
|  |  | 
|  | time = (current_tsc - tsc_timestamp) | 
|  | time = (time * tsc_to_system_mul) >> tsc_shift | 
|  | time = time + system_time | 
|  |  | 
|  | flags: bits in this field indicate extended capabilities | 
|  | coordinated between the guest and the hypervisor. Availability | 
|  | of specific flags has to be checked in 0x40000001 cpuid leaf. | 
|  | Current flags are: | 
|  |  | 
|  | flag bit   | cpuid bit    | meaning | 
|  | ------------------------------------------------------------- | 
|  | |	           | time measures taken across | 
|  | 0      |	   24      | multiple cpus are guaranteed to | 
|  | |		   | be monotonic | 
|  | ------------------------------------------------------------- | 
|  |  | 
|  | Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid | 
|  | leaf prior to usage. | 
|  |  | 
|  |  | 
|  | MSR_KVM_WALL_CLOCK:  0x11 | 
|  |  | 
|  | data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead. | 
|  |  | 
|  | This MSR falls outside the reserved KVM range and may be removed in the | 
|  | future. Its usage is deprecated. | 
|  |  | 
|  | Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid | 
|  | leaf prior to usage. | 
|  |  | 
|  | MSR_KVM_SYSTEM_TIME: 0x12 | 
|  |  | 
|  | data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead. | 
|  |  | 
|  | This MSR falls outside the reserved KVM range and may be removed in the | 
|  | future. Its usage is deprecated. | 
|  |  | 
|  | Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid | 
|  | leaf prior to usage. | 
|  |  | 
|  | The suggested algorithm for detecting kvmclock presence is then: | 
|  |  | 
|  | if (!kvm_para_available())    /* refer to cpuid.txt */ | 
|  | return NON_PRESENT; | 
|  |  | 
|  | flags = cpuid_eax(0x40000001); | 
|  | if (flags & 3) { | 
|  | msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; | 
|  | msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; | 
|  | return PRESENT; | 
|  | } else if (flags & 0) { | 
|  | msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; | 
|  | msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; | 
|  | return PRESENT; | 
|  | } else | 
|  | return NON_PRESENT; | 
|  |  | 
|  | MSR_KVM_ASYNC_PF_EN: 0x4b564d02 | 
|  | data: Bits 63-6 hold 64-byte aligned physical address of a | 
|  | 64 byte memory area which must be in guest RAM and must be | 
|  | zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1 | 
|  | when asynchronous page faults are enabled on the vcpu 0 when | 
|  | disabled. Bit 2 is 1 if asynchronous page faults can be injected | 
|  | when vcpu is in cpl == 0. | 
|  |  | 
|  | First 4 byte of 64 byte memory location will be written to by | 
|  | the hypervisor at the time of asynchronous page fault (APF) | 
|  | injection to indicate type of asynchronous page fault. Value | 
|  | of 1 means that the page referred to by the page fault is not | 
|  | present. Value 2 means that the page is now available. Disabling | 
|  | interrupt inhibits APFs. Guest must not enable interrupt | 
|  | before the reason is read, or it may be overwritten by another | 
|  | APF. Since APF uses the same exception vector as regular page | 
|  | fault guest must reset the reason to 0 before it does | 
|  | something that can generate normal page fault.  If during page | 
|  | fault APF reason is 0 it means that this is regular page | 
|  | fault. | 
|  |  | 
|  | During delivery of type 1 APF cr2 contains a token that will | 
|  | be used to notify a guest when missing page becomes | 
|  | available. When page becomes available type 2 APF is sent with | 
|  | cr2 set to the token associated with the page. There is special | 
|  | kind of token 0xffffffff which tells vcpu that it should wake | 
|  | up all processes waiting for APFs and no individual type 2 APFs | 
|  | will be sent. | 
|  |  | 
|  | If APF is disabled while there are outstanding APFs, they will | 
|  | not be delivered. | 
|  |  | 
|  | Currently type 2 APF will be always delivered on the same vcpu as | 
|  | type 1 was, but guest should not rely on that. | 
|  |  | 
|  | MSR_KVM_STEAL_TIME: 0x4b564d03 | 
|  |  | 
|  | data: 64-byte alignment physical address of a memory area which must be | 
|  | in guest RAM, plus an enable bit in bit 0. This memory is expected to | 
|  | hold a copy of the following structure: | 
|  |  | 
|  | struct kvm_steal_time { | 
|  | __u64 steal; | 
|  | __u32 version; | 
|  | __u32 flags; | 
|  | __u32 pad[12]; | 
|  | } | 
|  |  | 
|  | whose data will be filled in by the hypervisor periodically. Only one | 
|  | write, or registration, is needed for each VCPU. The interval between | 
|  | updates of this structure is arbitrary and implementation-dependent. | 
|  | The hypervisor may update this structure at any time it sees fit until | 
|  | anything with bit0 == 0 is written to it. Guest is required to make sure | 
|  | this structure is initialized to zero. | 
|  |  | 
|  | Fields have the following meanings: | 
|  |  | 
|  | version: a sequence counter. In other words, guest has to check | 
|  | this field before and after grabbing time information and make | 
|  | sure they are both equal and even. An odd version indicates an | 
|  | in-progress update. | 
|  |  | 
|  | flags: At this point, always zero. May be used to indicate | 
|  | changes in this structure in the future. | 
|  |  | 
|  | steal: the amount of time in which this vCPU did not run, in | 
|  | nanoseconds. Time during which the vcpu is idle, will not be | 
|  | reported as steal time. |