| Nadav Har'El | 823e396 | 2011-05-25 23:17:11 +0300 | [diff] [blame] | 1 | Nested VMX | 
 | 2 | ========== | 
 | 3 |  | 
 | 4 | Overview | 
 | 5 | --------- | 
 | 6 |  | 
 | 7 | On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions) | 
 | 8 | to easily and efficiently run guest operating systems. Normally, these guests | 
 | 9 | *cannot* themselves be hypervisors running their own guests, because in VMX, | 
 | 10 | guests cannot use VMX instructions. | 
 | 11 |  | 
 | 12 | The "Nested VMX" feature adds this missing capability - of running guest | 
 | 13 | hypervisors (which use VMX) with their own nested guests. It does so by | 
 | 14 | allowing a guest to use VMX instructions, and correctly and efficiently | 
 | 15 | emulating them using the single level of VMX available in the hardware. | 
 | 16 |  | 
 | 17 | We describe in much greater detail the theory behind the nested VMX feature, | 
 | 18 | its implementation and its performance characteristics, in the OSDI 2010 paper | 
 | 19 | "The Turtles Project: Design and Implementation of Nested Virtualization", | 
 | 20 | available at: | 
 | 21 |  | 
 | 22 | 	http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf | 
 | 23 |  | 
 | 24 |  | 
 | 25 | Terminology | 
 | 26 | ----------- | 
 | 27 |  | 
 | 28 | Single-level virtualization has two levels - the host (KVM) and the guests. | 
 | 29 | In nested virtualization, we have three levels: The host (KVM), which we call | 
 | 30 | L0, the guest hypervisor, which we call L1, and its nested guest, which we | 
 | 31 | call L2. | 
 | 32 |  | 
 | 33 |  | 
 | 34 | Known limitations | 
 | 35 | ----------------- | 
 | 36 |  | 
 | 37 | The current code supports running Linux guests under KVM guests. | 
 | 38 | Only 64-bit guest hypervisors are supported. | 
 | 39 |  | 
 | 40 | Additional patches for running Windows under guest KVM, and Linux under | 
 | 41 | guest VMware server, and support for nested EPT, are currently running in | 
 | 42 | the lab, and will be sent as follow-on patchsets. | 
 | 43 |  | 
 | 44 |  | 
 | 45 | Running nested VMX | 
 | 46 | ------------------ | 
 | 47 |  | 
 | 48 | The nested VMX feature is disabled by default. It can be enabled by giving | 
 | 49 | the "nested=1" option to the kvm-intel module. | 
 | 50 |  | 
 | 51 | No modifications are required to user space (qemu). However, qemu's default | 
 | 52 | emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be | 
 | 53 | explicitly enabled, by giving qemu one of the following options: | 
 | 54 |  | 
 | 55 |      -cpu host              (emulated CPU has all features of the real CPU) | 
 | 56 |  | 
 | 57 |      -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type) | 
 | 58 |  | 
 | 59 |  | 
 | 60 | ABIs | 
 | 61 | ---- | 
 | 62 |  | 
 | 63 | Nested VMX aims to present a standard and (eventually) fully-functional VMX | 
 | 64 | implementation for the a guest hypervisor to use. As such, the official | 
 | 65 | specification of the ABI that it provides is Intel's VMX specification, | 
 | 66 | namely volume 3B of their "Intel 64 and IA-32 Architectures Software | 
 | 67 | Developer's Manual". Not all of VMX's features are currently fully supported, | 
 | 68 | but the goal is to eventually support them all, starting with the VMX features | 
 | 69 | which are used in practice by popular hypervisors (KVM and others). | 
 | 70 |  | 
 | 71 | As a VMX implementation, nested VMX presents a VMCS structure to L1. | 
 | 72 | As mandated by the spec, other than the two fields revision_id and abort, | 
 | 73 | this structure is *opaque* to its user, who is not supposed to know or care | 
 | 74 | about its internal structure. Rather, the structure is accessed through the | 
 | 75 | VMREAD and VMWRITE instructions. | 
 | 76 | Still, for debugging purposes, KVM developers might be interested to know the | 
 | 77 | internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c. | 
 | 78 |  | 
 | 79 | The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we | 
 | 80 | also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS | 
 | 81 | which L0 builds to actually run L2 - how this is done is explained in the | 
 | 82 | aforementioned paper. | 
 | 83 |  | 
 | 84 | For convenience, we repeat the content of struct vmcs12 here. If the internals | 
 | 85 | of this structure changes, this can break live migration across KVM versions. | 
 | 86 | VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner | 
 | 87 | struct shadow_vmcs is ever changed. | 
 | 88 |  | 
 | 89 | 	typedef u64 natural_width; | 
 | 90 | 	struct __packed vmcs12 { | 
 | 91 | 		/* According to the Intel spec, a VMCS region must start with | 
 | 92 | 		 * these two user-visible fields */ | 
 | 93 | 		u32 revision_id; | 
 | 94 | 		u32 abort; | 
 | 95 |  | 
 | 96 | 		u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ | 
 | 97 | 		u32 padding[7]; /* room for future expansion */ | 
 | 98 |  | 
 | 99 | 		u64 io_bitmap_a; | 
 | 100 | 		u64 io_bitmap_b; | 
 | 101 | 		u64 msr_bitmap; | 
 | 102 | 		u64 vm_exit_msr_store_addr; | 
 | 103 | 		u64 vm_exit_msr_load_addr; | 
 | 104 | 		u64 vm_entry_msr_load_addr; | 
 | 105 | 		u64 tsc_offset; | 
 | 106 | 		u64 virtual_apic_page_addr; | 
 | 107 | 		u64 apic_access_addr; | 
 | 108 | 		u64 ept_pointer; | 
 | 109 | 		u64 guest_physical_address; | 
 | 110 | 		u64 vmcs_link_pointer; | 
 | 111 | 		u64 guest_ia32_debugctl; | 
 | 112 | 		u64 guest_ia32_pat; | 
 | 113 | 		u64 guest_ia32_efer; | 
 | 114 | 		u64 guest_pdptr0; | 
 | 115 | 		u64 guest_pdptr1; | 
 | 116 | 		u64 guest_pdptr2; | 
 | 117 | 		u64 guest_pdptr3; | 
 | 118 | 		u64 host_ia32_pat; | 
 | 119 | 		u64 host_ia32_efer; | 
 | 120 | 		u64 padding64[8]; /* room for future expansion */ | 
 | 121 | 		natural_width cr0_guest_host_mask; | 
 | 122 | 		natural_width cr4_guest_host_mask; | 
 | 123 | 		natural_width cr0_read_shadow; | 
 | 124 | 		natural_width cr4_read_shadow; | 
 | 125 | 		natural_width cr3_target_value0; | 
 | 126 | 		natural_width cr3_target_value1; | 
 | 127 | 		natural_width cr3_target_value2; | 
 | 128 | 		natural_width cr3_target_value3; | 
 | 129 | 		natural_width exit_qualification; | 
 | 130 | 		natural_width guest_linear_address; | 
 | 131 | 		natural_width guest_cr0; | 
 | 132 | 		natural_width guest_cr3; | 
 | 133 | 		natural_width guest_cr4; | 
 | 134 | 		natural_width guest_es_base; | 
 | 135 | 		natural_width guest_cs_base; | 
 | 136 | 		natural_width guest_ss_base; | 
 | 137 | 		natural_width guest_ds_base; | 
 | 138 | 		natural_width guest_fs_base; | 
 | 139 | 		natural_width guest_gs_base; | 
 | 140 | 		natural_width guest_ldtr_base; | 
 | 141 | 		natural_width guest_tr_base; | 
 | 142 | 		natural_width guest_gdtr_base; | 
 | 143 | 		natural_width guest_idtr_base; | 
 | 144 | 		natural_width guest_dr7; | 
 | 145 | 		natural_width guest_rsp; | 
 | 146 | 		natural_width guest_rip; | 
 | 147 | 		natural_width guest_rflags; | 
 | 148 | 		natural_width guest_pending_dbg_exceptions; | 
 | 149 | 		natural_width guest_sysenter_esp; | 
 | 150 | 		natural_width guest_sysenter_eip; | 
 | 151 | 		natural_width host_cr0; | 
 | 152 | 		natural_width host_cr3; | 
 | 153 | 		natural_width host_cr4; | 
 | 154 | 		natural_width host_fs_base; | 
 | 155 | 		natural_width host_gs_base; | 
 | 156 | 		natural_width host_tr_base; | 
 | 157 | 		natural_width host_gdtr_base; | 
 | 158 | 		natural_width host_idtr_base; | 
 | 159 | 		natural_width host_ia32_sysenter_esp; | 
 | 160 | 		natural_width host_ia32_sysenter_eip; | 
 | 161 | 		natural_width host_rsp; | 
 | 162 | 		natural_width host_rip; | 
 | 163 | 		natural_width paddingl[8]; /* room for future expansion */ | 
 | 164 | 		u32 pin_based_vm_exec_control; | 
 | 165 | 		u32 cpu_based_vm_exec_control; | 
 | 166 | 		u32 exception_bitmap; | 
 | 167 | 		u32 page_fault_error_code_mask; | 
 | 168 | 		u32 page_fault_error_code_match; | 
 | 169 | 		u32 cr3_target_count; | 
 | 170 | 		u32 vm_exit_controls; | 
 | 171 | 		u32 vm_exit_msr_store_count; | 
 | 172 | 		u32 vm_exit_msr_load_count; | 
 | 173 | 		u32 vm_entry_controls; | 
 | 174 | 		u32 vm_entry_msr_load_count; | 
 | 175 | 		u32 vm_entry_intr_info_field; | 
 | 176 | 		u32 vm_entry_exception_error_code; | 
 | 177 | 		u32 vm_entry_instruction_len; | 
 | 178 | 		u32 tpr_threshold; | 
 | 179 | 		u32 secondary_vm_exec_control; | 
 | 180 | 		u32 vm_instruction_error; | 
 | 181 | 		u32 vm_exit_reason; | 
 | 182 | 		u32 vm_exit_intr_info; | 
 | 183 | 		u32 vm_exit_intr_error_code; | 
 | 184 | 		u32 idt_vectoring_info_field; | 
 | 185 | 		u32 idt_vectoring_error_code; | 
 | 186 | 		u32 vm_exit_instruction_len; | 
 | 187 | 		u32 vmx_instruction_info; | 
 | 188 | 		u32 guest_es_limit; | 
 | 189 | 		u32 guest_cs_limit; | 
 | 190 | 		u32 guest_ss_limit; | 
 | 191 | 		u32 guest_ds_limit; | 
 | 192 | 		u32 guest_fs_limit; | 
 | 193 | 		u32 guest_gs_limit; | 
 | 194 | 		u32 guest_ldtr_limit; | 
 | 195 | 		u32 guest_tr_limit; | 
 | 196 | 		u32 guest_gdtr_limit; | 
 | 197 | 		u32 guest_idtr_limit; | 
 | 198 | 		u32 guest_es_ar_bytes; | 
 | 199 | 		u32 guest_cs_ar_bytes; | 
 | 200 | 		u32 guest_ss_ar_bytes; | 
 | 201 | 		u32 guest_ds_ar_bytes; | 
 | 202 | 		u32 guest_fs_ar_bytes; | 
 | 203 | 		u32 guest_gs_ar_bytes; | 
 | 204 | 		u32 guest_ldtr_ar_bytes; | 
 | 205 | 		u32 guest_tr_ar_bytes; | 
 | 206 | 		u32 guest_interruptibility_info; | 
 | 207 | 		u32 guest_activity_state; | 
 | 208 | 		u32 guest_sysenter_cs; | 
 | 209 | 		u32 host_ia32_sysenter_cs; | 
 | 210 | 		u32 padding32[8]; /* room for future expansion */ | 
 | 211 | 		u16 virtual_processor_id; | 
 | 212 | 		u16 guest_es_selector; | 
 | 213 | 		u16 guest_cs_selector; | 
 | 214 | 		u16 guest_ss_selector; | 
 | 215 | 		u16 guest_ds_selector; | 
 | 216 | 		u16 guest_fs_selector; | 
 | 217 | 		u16 guest_gs_selector; | 
 | 218 | 		u16 guest_ldtr_selector; | 
 | 219 | 		u16 guest_tr_selector; | 
 | 220 | 		u16 host_es_selector; | 
 | 221 | 		u16 host_cs_selector; | 
 | 222 | 		u16 host_ss_selector; | 
 | 223 | 		u16 host_ds_selector; | 
 | 224 | 		u16 host_fs_selector; | 
 | 225 | 		u16 host_gs_selector; | 
 | 226 | 		u16 host_tr_selector; | 
 | 227 | 	}; | 
 | 228 |  | 
 | 229 |  | 
 | 230 | Authors | 
 | 231 | ------- | 
 | 232 |  | 
 | 233 | These patches were written by: | 
 | 234 |      Abel Gordon, abelg <at> il.ibm.com | 
 | 235 |      Nadav Har'El, nyh <at> il.ibm.com | 
 | 236 |      Orit Wasserman, oritw <at> il.ibm.com | 
 | 237 |      Ben-Ami Yassor, benami <at> il.ibm.com | 
 | 238 |      Muli Ben-Yehuda, muli <at> il.ibm.com | 
 | 239 |  | 
 | 240 | With contributions by: | 
 | 241 |      Anthony Liguori, aliguori <at> us.ibm.com | 
 | 242 |      Mike Day, mdday <at> us.ibm.com | 
 | 243 |      Michael Factor, factor <at> il.ibm.com | 
 | 244 |      Zvi Dubitzky, dubi <at> il.ibm.com | 
 | 245 |  | 
 | 246 | And valuable reviews by: | 
 | 247 |      Avi Kivity, avi <at> redhat.com | 
 | 248 |      Gleb Natapov, gleb <at> redhat.com | 
 | 249 |      Marcelo Tosatti, mtosatti <at> redhat.com | 
 | 250 |      Kevin Tian, kevin.tian <at> intel.com | 
 | 251 |      and others. |