|  | Nested VMX | 
|  | ========== | 
|  |  | 
|  | Overview | 
|  | --------- | 
|  |  | 
|  | On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions) | 
|  | to easily and efficiently run guest operating systems. Normally, these guests | 
|  | *cannot* themselves be hypervisors running their own guests, because in VMX, | 
|  | guests cannot use VMX instructions. | 
|  |  | 
|  | The "Nested VMX" feature adds this missing capability - of running guest | 
|  | hypervisors (which use VMX) with their own nested guests. It does so by | 
|  | allowing a guest to use VMX instructions, and correctly and efficiently | 
|  | emulating them using the single level of VMX available in the hardware. | 
|  |  | 
|  | We describe in much greater detail the theory behind the nested VMX feature, | 
|  | its implementation and its performance characteristics, in the OSDI 2010 paper | 
|  | "The Turtles Project: Design and Implementation of Nested Virtualization", | 
|  | available at: | 
|  |  | 
|  | http://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf | 
|  |  | 
|  |  | 
|  | Terminology | 
|  | ----------- | 
|  |  | 
|  | Single-level virtualization has two levels - the host (KVM) and the guests. | 
|  | In nested virtualization, we have three levels: The host (KVM), which we call | 
|  | L0, the guest hypervisor, which we call L1, and its nested guest, which we | 
|  | call L2. | 
|  |  | 
|  |  | 
|  | Known limitations | 
|  | ----------------- | 
|  |  | 
|  | The current code supports running Linux guests under KVM guests. | 
|  | Only 64-bit guest hypervisors are supported. | 
|  |  | 
|  | Additional patches for running Windows under guest KVM, and Linux under | 
|  | guest VMware server, and support for nested EPT, are currently running in | 
|  | the lab, and will be sent as follow-on patchsets. | 
|  |  | 
|  |  | 
|  | Running nested VMX | 
|  | ------------------ | 
|  |  | 
|  | The nested VMX feature is disabled by default. It can be enabled by giving | 
|  | the "nested=1" option to the kvm-intel module. | 
|  |  | 
|  | No modifications are required to user space (qemu). However, qemu's default | 
|  | emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be | 
|  | explicitly enabled, by giving qemu one of the following options: | 
|  |  | 
|  | -cpu host              (emulated CPU has all features of the real CPU) | 
|  |  | 
|  | -cpu qemu64,+vmx       (add just the vmx feature to a named CPU type) | 
|  |  | 
|  |  | 
|  | ABIs | 
|  | ---- | 
|  |  | 
|  | Nested VMX aims to present a standard and (eventually) fully-functional VMX | 
|  | implementation for the a guest hypervisor to use. As such, the official | 
|  | specification of the ABI that it provides is Intel's VMX specification, | 
|  | namely volume 3B of their "Intel 64 and IA-32 Architectures Software | 
|  | Developer's Manual". Not all of VMX's features are currently fully supported, | 
|  | but the goal is to eventually support them all, starting with the VMX features | 
|  | which are used in practice by popular hypervisors (KVM and others). | 
|  |  | 
|  | As a VMX implementation, nested VMX presents a VMCS structure to L1. | 
|  | As mandated by the spec, other than the two fields revision_id and abort, | 
|  | this structure is *opaque* to its user, who is not supposed to know or care | 
|  | about its internal structure. Rather, the structure is accessed through the | 
|  | VMREAD and VMWRITE instructions. | 
|  | Still, for debugging purposes, KVM developers might be interested to know the | 
|  | internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c. | 
|  |  | 
|  | The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we | 
|  | also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS | 
|  | which L0 builds to actually run L2 - how this is done is explained in the | 
|  | aforementioned paper. | 
|  |  | 
|  | For convenience, we repeat the content of struct vmcs12 here. If the internals | 
|  | of this structure changes, this can break live migration across KVM versions. | 
|  | VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner | 
|  | struct shadow_vmcs is ever changed. | 
|  |  | 
|  | typedef u64 natural_width; | 
|  | struct __packed vmcs12 { | 
|  | /* According to the Intel spec, a VMCS region must start with | 
|  | * these two user-visible fields */ | 
|  | u32 revision_id; | 
|  | u32 abort; | 
|  |  | 
|  | u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ | 
|  | u32 padding[7]; /* room for future expansion */ | 
|  |  | 
|  | u64 io_bitmap_a; | 
|  | u64 io_bitmap_b; | 
|  | u64 msr_bitmap; | 
|  | u64 vm_exit_msr_store_addr; | 
|  | u64 vm_exit_msr_load_addr; | 
|  | u64 vm_entry_msr_load_addr; | 
|  | u64 tsc_offset; | 
|  | u64 virtual_apic_page_addr; | 
|  | u64 apic_access_addr; | 
|  | u64 ept_pointer; | 
|  | u64 guest_physical_address; | 
|  | u64 vmcs_link_pointer; | 
|  | u64 guest_ia32_debugctl; | 
|  | u64 guest_ia32_pat; | 
|  | u64 guest_ia32_efer; | 
|  | u64 guest_pdptr0; | 
|  | u64 guest_pdptr1; | 
|  | u64 guest_pdptr2; | 
|  | u64 guest_pdptr3; | 
|  | u64 host_ia32_pat; | 
|  | u64 host_ia32_efer; | 
|  | u64 padding64[8]; /* room for future expansion */ | 
|  | natural_width cr0_guest_host_mask; | 
|  | natural_width cr4_guest_host_mask; | 
|  | natural_width cr0_read_shadow; | 
|  | natural_width cr4_read_shadow; | 
|  | natural_width cr3_target_value0; | 
|  | natural_width cr3_target_value1; | 
|  | natural_width cr3_target_value2; | 
|  | natural_width cr3_target_value3; | 
|  | natural_width exit_qualification; | 
|  | natural_width guest_linear_address; | 
|  | natural_width guest_cr0; | 
|  | natural_width guest_cr3; | 
|  | natural_width guest_cr4; | 
|  | natural_width guest_es_base; | 
|  | natural_width guest_cs_base; | 
|  | natural_width guest_ss_base; | 
|  | natural_width guest_ds_base; | 
|  | natural_width guest_fs_base; | 
|  | natural_width guest_gs_base; | 
|  | natural_width guest_ldtr_base; | 
|  | natural_width guest_tr_base; | 
|  | natural_width guest_gdtr_base; | 
|  | natural_width guest_idtr_base; | 
|  | natural_width guest_dr7; | 
|  | natural_width guest_rsp; | 
|  | natural_width guest_rip; | 
|  | natural_width guest_rflags; | 
|  | natural_width guest_pending_dbg_exceptions; | 
|  | natural_width guest_sysenter_esp; | 
|  | natural_width guest_sysenter_eip; | 
|  | natural_width host_cr0; | 
|  | natural_width host_cr3; | 
|  | natural_width host_cr4; | 
|  | natural_width host_fs_base; | 
|  | natural_width host_gs_base; | 
|  | natural_width host_tr_base; | 
|  | natural_width host_gdtr_base; | 
|  | natural_width host_idtr_base; | 
|  | natural_width host_ia32_sysenter_esp; | 
|  | natural_width host_ia32_sysenter_eip; | 
|  | natural_width host_rsp; | 
|  | natural_width host_rip; | 
|  | natural_width paddingl[8]; /* room for future expansion */ | 
|  | u32 pin_based_vm_exec_control; | 
|  | u32 cpu_based_vm_exec_control; | 
|  | u32 exception_bitmap; | 
|  | u32 page_fault_error_code_mask; | 
|  | u32 page_fault_error_code_match; | 
|  | u32 cr3_target_count; | 
|  | u32 vm_exit_controls; | 
|  | u32 vm_exit_msr_store_count; | 
|  | u32 vm_exit_msr_load_count; | 
|  | u32 vm_entry_controls; | 
|  | u32 vm_entry_msr_load_count; | 
|  | u32 vm_entry_intr_info_field; | 
|  | u32 vm_entry_exception_error_code; | 
|  | u32 vm_entry_instruction_len; | 
|  | u32 tpr_threshold; | 
|  | u32 secondary_vm_exec_control; | 
|  | u32 vm_instruction_error; | 
|  | u32 vm_exit_reason; | 
|  | u32 vm_exit_intr_info; | 
|  | u32 vm_exit_intr_error_code; | 
|  | u32 idt_vectoring_info_field; | 
|  | u32 idt_vectoring_error_code; | 
|  | u32 vm_exit_instruction_len; | 
|  | u32 vmx_instruction_info; | 
|  | u32 guest_es_limit; | 
|  | u32 guest_cs_limit; | 
|  | u32 guest_ss_limit; | 
|  | u32 guest_ds_limit; | 
|  | u32 guest_fs_limit; | 
|  | u32 guest_gs_limit; | 
|  | u32 guest_ldtr_limit; | 
|  | u32 guest_tr_limit; | 
|  | u32 guest_gdtr_limit; | 
|  | u32 guest_idtr_limit; | 
|  | u32 guest_es_ar_bytes; | 
|  | u32 guest_cs_ar_bytes; | 
|  | u32 guest_ss_ar_bytes; | 
|  | u32 guest_ds_ar_bytes; | 
|  | u32 guest_fs_ar_bytes; | 
|  | u32 guest_gs_ar_bytes; | 
|  | u32 guest_ldtr_ar_bytes; | 
|  | u32 guest_tr_ar_bytes; | 
|  | u32 guest_interruptibility_info; | 
|  | u32 guest_activity_state; | 
|  | u32 guest_sysenter_cs; | 
|  | u32 host_ia32_sysenter_cs; | 
|  | u32 padding32[8]; /* room for future expansion */ | 
|  | u16 virtual_processor_id; | 
|  | u16 guest_es_selector; | 
|  | u16 guest_cs_selector; | 
|  | u16 guest_ss_selector; | 
|  | u16 guest_ds_selector; | 
|  | u16 guest_fs_selector; | 
|  | u16 guest_gs_selector; | 
|  | u16 guest_ldtr_selector; | 
|  | u16 guest_tr_selector; | 
|  | u16 host_es_selector; | 
|  | u16 host_cs_selector; | 
|  | u16 host_ss_selector; | 
|  | u16 host_ds_selector; | 
|  | u16 host_fs_selector; | 
|  | u16 host_gs_selector; | 
|  | u16 host_tr_selector; | 
|  | }; | 
|  |  | 
|  |  | 
|  | Authors | 
|  | ------- | 
|  |  | 
|  | These patches were written by: | 
|  | Abel Gordon, abelg <at> il.ibm.com | 
|  | Nadav Har'El, nyh <at> il.ibm.com | 
|  | Orit Wasserman, oritw <at> il.ibm.com | 
|  | Ben-Ami Yassor, benami <at> il.ibm.com | 
|  | Muli Ben-Yehuda, muli <at> il.ibm.com | 
|  |  | 
|  | With contributions by: | 
|  | Anthony Liguori, aliguori <at> us.ibm.com | 
|  | Mike Day, mdday <at> us.ibm.com | 
|  | Michael Factor, factor <at> il.ibm.com | 
|  | Zvi Dubitzky, dubi <at> il.ibm.com | 
|  |  | 
|  | And valuable reviews by: | 
|  | Avi Kivity, avi <at> redhat.com | 
|  | Gleb Natapov, gleb <at> redhat.com | 
|  | Marcelo Tosatti, mtosatti <at> redhat.com | 
|  | Kevin Tian, kevin.tian <at> intel.com | 
|  | and others. |