| Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 1 | KVM-specific MSRs. | 
 | 2 | Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010 | 
 | 3 | ===================================================== | 
 | 4 |  | 
 | 5 | KVM makes use of some custom MSRs to service some requests. | 
| Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 6 |  | 
 | 7 | Custom MSRs have a range reserved for them, that goes from | 
 | 8 | 0x4b564d00 to 0x4b564dff. There are MSRs outside this area, | 
 | 9 | but they are deprecated and their use is discouraged. | 
 | 10 |  | 
 | 11 | Custom MSR list | 
 | 12 | -------- | 
 | 13 |  | 
 | 14 | The current supported Custom MSR list is: | 
 | 15 |  | 
 | 16 | MSR_KVM_WALL_CLOCK_NEW:   0x4b564d00 | 
 | 17 |  | 
 | 18 | 	data: 4-byte alignment physical address of a memory area which must be | 
 | 19 | 	in guest RAM. This memory is expected to hold a copy of the following | 
 | 20 | 	structure: | 
 | 21 |  | 
 | 22 | 	struct pvclock_wall_clock { | 
 | 23 | 		u32   version; | 
 | 24 | 		u32   sec; | 
 | 25 | 		u32   nsec; | 
 | 26 | 	} __attribute__((__packed__)); | 
 | 27 |  | 
 | 28 | 	whose data will be filled in by the hypervisor. The hypervisor is only | 
 | 29 | 	guaranteed to update this data at the moment of MSR write. | 
 | 30 | 	Users that want to reliably query this information more than once have | 
 | 31 | 	to write more than once to this MSR. Fields have the following meanings: | 
 | 32 |  | 
 | 33 | 		version: guest has to check version before and after grabbing | 
 | 34 | 		time information and check that they are both equal and even. | 
 | 35 | 		An odd version indicates an in-progress update. | 
 | 36 |  | 
| Stefan Fritsch | 879238f | 2012-09-16 12:55:40 +0200 | [diff] [blame] | 37 | 		sec: number of seconds for wallclock at time of boot. | 
| Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 38 |  | 
| Stefan Fritsch | 879238f | 2012-09-16 12:55:40 +0200 | [diff] [blame] | 39 | 		nsec: number of nanoseconds for wallclock at time of boot. | 
 | 40 |  | 
 | 41 | 	In order to get the current wallclock time, the system_time from | 
 | 42 | 	MSR_KVM_SYSTEM_TIME_NEW needs to be added. | 
| Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 43 |  | 
 | 44 | 	Note that although MSRs are per-CPU entities, the effect of this | 
 | 45 | 	particular MSR is global. | 
 | 46 |  | 
 | 47 | 	Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid | 
 | 48 | 	leaf prior to usage. | 
 | 49 |  | 
 | 50 | MSR_KVM_SYSTEM_TIME_NEW:  0x4b564d01 | 
 | 51 |  | 
 | 52 | 	data: 4-byte aligned physical address of a memory area which must be in | 
 | 53 | 	guest RAM, plus an enable bit in bit 0. This memory is expected to hold | 
 | 54 | 	a copy of the following structure: | 
 | 55 |  | 
 | 56 | 	struct pvclock_vcpu_time_info { | 
 | 57 | 		u32   version; | 
 | 58 | 		u32   pad0; | 
 | 59 | 		u64   tsc_timestamp; | 
 | 60 | 		u64   system_time; | 
 | 61 | 		u32   tsc_to_system_mul; | 
 | 62 | 		s8    tsc_shift; | 
 | 63 | 		u8    flags; | 
 | 64 | 		u8    pad[2]; | 
 | 65 | 	} __attribute__((__packed__)); /* 32 bytes */ | 
 | 66 |  | 
 | 67 | 	whose data will be filled in by the hypervisor periodically. Only one | 
 | 68 | 	write, or registration, is needed for each VCPU. The interval between | 
 | 69 | 	updates of this structure is arbitrary and implementation-dependent. | 
 | 70 | 	The hypervisor may update this structure at any time it sees fit until | 
 | 71 | 	anything with bit0 == 0 is written to it. | 
 | 72 |  | 
 | 73 | 	Fields have the following meanings: | 
 | 74 |  | 
 | 75 | 		version: guest has to check version before and after grabbing | 
 | 76 | 		time information and check that they are both equal and even. | 
 | 77 | 		An odd version indicates an in-progress update. | 
 | 78 |  | 
 | 79 | 		tsc_timestamp: the tsc value at the current VCPU at the time | 
 | 80 | 		of the update of this structure. Guests can subtract this value | 
 | 81 | 		from current tsc to derive a notion of elapsed time since the | 
 | 82 | 		structure update. | 
 | 83 |  | 
 | 84 | 		system_time: a host notion of monotonic time, including sleep | 
 | 85 | 		time at the time this structure was last updated. Unit is | 
 | 86 | 		nanoseconds. | 
 | 87 |  | 
| Stefan Fritsch | 879238f | 2012-09-16 12:55:40 +0200 | [diff] [blame] | 88 | 		tsc_to_system_mul: multiplier to be used when converting | 
 | 89 | 		tsc-related quantity to nanoseconds | 
| Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 90 |  | 
| Stefan Fritsch | 879238f | 2012-09-16 12:55:40 +0200 | [diff] [blame] | 91 | 		tsc_shift: shift to be used when converting tsc-related | 
 | 92 | 		quantity to nanoseconds. This shift will ensure that | 
 | 93 | 		multiplication with tsc_to_system_mul does not overflow. | 
 | 94 | 		A positive value denotes a left shift, a negative value | 
 | 95 | 		a right shift. | 
| Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 96 |  | 
| Stefan Fritsch | 879238f | 2012-09-16 12:55:40 +0200 | [diff] [blame] | 97 | 		The conversion from tsc to nanoseconds involves an additional | 
 | 98 | 		right shift by 32 bits. With this information, guests can | 
 | 99 | 		derive per-CPU time by doing: | 
| Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 100 |  | 
 | 101 | 			time = (current_tsc - tsc_timestamp) | 
| Stefan Fritsch | 879238f | 2012-09-16 12:55:40 +0200 | [diff] [blame] | 102 | 			if (tsc_shift >= 0) | 
 | 103 | 				time <<= tsc_shift; | 
 | 104 | 			else | 
 | 105 | 				time >>= -tsc_shift; | 
 | 106 | 			time = (time * tsc_to_system_mul) >> 32 | 
| Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 107 | 			time = time + system_time | 
 | 108 |  | 
 | 109 | 		flags: bits in this field indicate extended capabilities | 
 | 110 | 		coordinated between the guest and the hypervisor. Availability | 
 | 111 | 		of specific flags has to be checked in 0x40000001 cpuid leaf. | 
 | 112 | 		Current flags are: | 
 | 113 |  | 
 | 114 | 		 flag bit   | cpuid bit    | meaning | 
 | 115 | 		------------------------------------------------------------- | 
 | 116 | 			    |	           | time measures taken across | 
 | 117 | 		     0      |	   24      | multiple cpus are guaranteed to | 
 | 118 | 			    |		   | be monotonic | 
 | 119 | 		------------------------------------------------------------- | 
| Eric B Munson | 1c0b28c | 2012-03-10 14:37:27 -0500 | [diff] [blame] | 120 | 			    |		   | guest vcpu has been paused by | 
 | 121 | 		     1	    |	  N/A	   | the host | 
 | 122 | 			    |		   | See 4.70 in api.txt | 
 | 123 | 		------------------------------------------------------------- | 
| Glauber Costa | d2d7a61 | 2010-06-01 08:22:48 -0400 | [diff] [blame] | 124 |  | 
 | 125 | 	Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid | 
 | 126 | 	leaf prior to usage. | 
 | 127 |  | 
 | 128 |  | 
 | 129 | MSR_KVM_WALL_CLOCK:  0x11 | 
 | 130 |  | 
 | 131 | 	data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead. | 
 | 132 |  | 
 | 133 | 	This MSR falls outside the reserved KVM range and may be removed in the | 
 | 134 | 	future. Its usage is deprecated. | 
 | 135 |  | 
 | 136 | 	Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid | 
 | 137 | 	leaf prior to usage. | 
 | 138 |  | 
 | 139 | MSR_KVM_SYSTEM_TIME: 0x12 | 
 | 140 |  | 
 | 141 | 	data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead. | 
 | 142 |  | 
 | 143 | 	This MSR falls outside the reserved KVM range and may be removed in the | 
 | 144 | 	future. Its usage is deprecated. | 
 | 145 |  | 
 | 146 | 	Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid | 
 | 147 | 	leaf prior to usage. | 
 | 148 |  | 
 | 149 | 	The suggested algorithm for detecting kvmclock presence is then: | 
 | 150 |  | 
 | 151 | 		if (!kvm_para_available())    /* refer to cpuid.txt */ | 
 | 152 | 			return NON_PRESENT; | 
 | 153 |  | 
 | 154 | 		flags = cpuid_eax(0x40000001); | 
 | 155 | 		if (flags & 3) { | 
 | 156 | 			msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; | 
 | 157 | 			msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; | 
 | 158 | 			return PRESENT; | 
 | 159 | 		} else if (flags & 0) { | 
 | 160 | 			msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; | 
 | 161 | 			msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; | 
 | 162 | 			return PRESENT; | 
 | 163 | 		} else | 
 | 164 | 			return NON_PRESENT; | 
| Gleb Natapov | 344d958 | 2010-10-14 11:22:50 +0200 | [diff] [blame] | 165 |  | 
 | 166 | MSR_KVM_ASYNC_PF_EN: 0x4b564d02 | 
 | 167 | 	data: Bits 63-6 hold 64-byte aligned physical address of a | 
 | 168 | 	64 byte memory area which must be in guest RAM and must be | 
| Gleb Natapov | 6adba52 | 2010-10-14 11:22:55 +0200 | [diff] [blame] | 169 | 	zeroed. Bits 5-2 are reserved and should be zero. Bit 0 is 1 | 
| Gleb Natapov | 344d958 | 2010-10-14 11:22:50 +0200 | [diff] [blame] | 170 | 	when asynchronous page faults are enabled on the vcpu 0 when | 
| Gleb Natapov | 6adba52 | 2010-10-14 11:22:55 +0200 | [diff] [blame] | 171 | 	disabled. Bit 2 is 1 if asynchronous page faults can be injected | 
 | 172 | 	when vcpu is in cpl == 0. | 
| Gleb Natapov | 344d958 | 2010-10-14 11:22:50 +0200 | [diff] [blame] | 173 |  | 
 | 174 | 	First 4 byte of 64 byte memory location will be written to by | 
 | 175 | 	the hypervisor at the time of asynchronous page fault (APF) | 
 | 176 | 	injection to indicate type of asynchronous page fault. Value | 
 | 177 | 	of 1 means that the page referred to by the page fault is not | 
 | 178 | 	present. Value 2 means that the page is now available. Disabling | 
 | 179 | 	interrupt inhibits APFs. Guest must not enable interrupt | 
 | 180 | 	before the reason is read, or it may be overwritten by another | 
 | 181 | 	APF. Since APF uses the same exception vector as regular page | 
 | 182 | 	fault guest must reset the reason to 0 before it does | 
 | 183 | 	something that can generate normal page fault.  If during page | 
 | 184 | 	fault APF reason is 0 it means that this is regular page | 
 | 185 | 	fault. | 
 | 186 |  | 
 | 187 | 	During delivery of type 1 APF cr2 contains a token that will | 
 | 188 | 	be used to notify a guest when missing page becomes | 
 | 189 | 	available. When page becomes available type 2 APF is sent with | 
 | 190 | 	cr2 set to the token associated with the page. There is special | 
 | 191 | 	kind of token 0xffffffff which tells vcpu that it should wake | 
 | 192 | 	up all processes waiting for APFs and no individual type 2 APFs | 
 | 193 | 	will be sent. | 
 | 194 |  | 
 | 195 | 	If APF is disabled while there are outstanding APFs, they will | 
 | 196 | 	not be delivered. | 
 | 197 |  | 
 | 198 | 	Currently type 2 APF will be always delivered on the same vcpu as | 
 | 199 | 	type 1 was, but guest should not rely on that. | 
| Glauber Costa | 9ddabbe | 2011-07-11 15:28:13 -0400 | [diff] [blame] | 200 |  | 
 | 201 | MSR_KVM_STEAL_TIME: 0x4b564d03 | 
 | 202 |  | 
 | 203 | 	data: 64-byte alignment physical address of a memory area which must be | 
 | 204 | 	in guest RAM, plus an enable bit in bit 0. This memory is expected to | 
 | 205 | 	hold a copy of the following structure: | 
 | 206 |  | 
 | 207 | 	struct kvm_steal_time { | 
 | 208 | 		__u64 steal; | 
 | 209 | 		__u32 version; | 
 | 210 | 		__u32 flags; | 
 | 211 | 		__u32 pad[12]; | 
 | 212 | 	} | 
 | 213 |  | 
 | 214 | 	whose data will be filled in by the hypervisor periodically. Only one | 
 | 215 | 	write, or registration, is needed for each VCPU. The interval between | 
 | 216 | 	updates of this structure is arbitrary and implementation-dependent. | 
 | 217 | 	The hypervisor may update this structure at any time it sees fit until | 
 | 218 | 	anything with bit0 == 0 is written to it. Guest is required to make sure | 
 | 219 | 	this structure is initialized to zero. | 
 | 220 |  | 
 | 221 | 	Fields have the following meanings: | 
 | 222 |  | 
 | 223 | 		version: a sequence counter. In other words, guest has to check | 
 | 224 | 		this field before and after grabbing time information and make | 
 | 225 | 		sure they are both equal and even. An odd version indicates an | 
 | 226 | 		in-progress update. | 
 | 227 |  | 
 | 228 | 		flags: At this point, always zero. May be used to indicate | 
 | 229 | 		changes in this structure in the future. | 
 | 230 |  | 
 | 231 | 		steal: the amount of time in which this vCPU did not run, in | 
 | 232 | 		nanoseconds. Time during which the vcpu is idle, will not be | 
 | 233 | 		reported as steal time. | 
| Michael S. Tsirkin | c1af87d | 2012-06-24 19:24:49 +0300 | [diff] [blame] | 234 |  | 
 | 235 | MSR_KVM_EOI_EN: 0x4b564d04 | 
 | 236 | 	data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 | 
 | 237 | 	when disabled.  Bit 1 is reserved and must be zero.  When PV end of | 
 | 238 | 	interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned | 
 | 239 | 	physical address of a 4 byte memory area which must be in guest RAM and | 
 | 240 | 	must be zeroed. | 
 | 241 |  | 
 | 242 | 	The first, least significant bit of 4 byte memory location will be | 
 | 243 | 	written to by the hypervisor, typically at the time of interrupt | 
 | 244 | 	injection.  Value of 1 means that guest can skip writing EOI to the apic | 
 | 245 | 	(using MSR or MMIO write); instead, it is sufficient to signal | 
 | 246 | 	EOI by clearing the bit in guest memory - this location will | 
 | 247 | 	later be polled by the hypervisor. | 
 | 248 | 	Value of 0 means that the EOI write is required. | 
 | 249 |  | 
 | 250 | 	It is always safe for the guest to ignore the optimization and perform | 
 | 251 | 	the APIC EOI write anyway. | 
 | 252 |  | 
 | 253 | 	Hypervisor is guaranteed to only modify this least | 
 | 254 | 	significant bit while in the current VCPU context, this means that | 
 | 255 | 	guest does not need to use either lock prefix or memory ordering | 
 | 256 | 	primitives to synchronise with the hypervisor. | 
 | 257 |  | 
 | 258 | 	However, hypervisor can set and clear this memory bit at any time: | 
 | 259 | 	therefore to make sure hypervisor does not interrupt the | 
 | 260 | 	guest and clear the least significant bit in the memory area | 
 | 261 | 	in the window between guest testing it to detect | 
 | 262 | 	whether it can skip EOI apic write and between guest | 
 | 263 | 	clearing it to signal EOI to the hypervisor, | 
 | 264 | 	guest must both read the least significant bit in the memory area and | 
 | 265 | 	clear it using a single CPU instruction, such as test and clear, or | 
 | 266 | 	compare and exchange. |