)]}'
{
  "log": [
    {
      "commit": "cddb8a5c14aa89810b40495d94d3d2a0faee6619",
      "tree": "d0b47b071f7d2dd1d6f9c36084aa8cfcef90d1da",
      "parents": [
        "7906d00cd1f687268f0a3599442d113767795ae6"
      ],
      "author": {
        "name": "Andrea Arcangeli",
        "email": "andrea@qumranet.com",
        "time": "Mon Jul 28 15:46:29 2008 -0700"
      },
      "committer": {
        "name": "Linus Torvalds",
        "email": "torvalds@linux-foundation.org",
        "time": "Mon Jul 28 16:30:21 2008 -0700"
      },
      "message": "mmu-notifiers: core\n\nWith KVM/GFP/XPMEM there isn\u0027t just the primary CPU MMU pointing to pages.\n There are secondary MMUs (with secondary sptes and secondary tlbs) too.\nsptes in the kvm case are shadow pagetables, but when I say spte in\nmmu-notifier context, I mean \"secondary pte\".  In GRU case there\u0027s no\nactual secondary pte and there\u0027s only a secondary tlb because the GRU\nsecondary MMU has no knowledge about sptes and every secondary tlb miss\nevent in the MMU always generates a page fault that has to be resolved by\nthe CPU (this is not the case of KVM where the a secondary tlb miss will\nwalk sptes in hardware and it will refill the secondary tlb transparently\nto software if the corresponding spte is present).  The same way\nzap_page_range has to invalidate the pte before freeing the page, the spte\n(and secondary tlb) must also be invalidated before any page is freed and\nreused.\n\nCurrently we take a page_count pin on every page mapped by sptes, but that\nmeans the pages can\u0027t be swapped whenever they\u0027re mapped by any spte\nbecause they\u0027re part of the guest working set.  Furthermore a spte unmap\nevent can immediately lead to a page to be freed when the pin is released\n(so requiring the same complex and relatively slow tlb_gather smp safe\nlogic we have in zap_page_range and that can be avoided completely if the\nspte unmap event doesn\u0027t require an unpin of the page previously mapped in\nthe secondary MMU).\n\nThe mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk-\u003emm and know\nwhen the VM is swapping or freeing or doing anything on the primary MMU so\nthat the secondary MMU code can drop sptes before the pages are freed,\navoiding all page pinning and allowing 100% reliable swapping of guest\nphysical address space.  Furthermore it avoids the code that teardown the\nmappings of the secondary MMU, to implement a logic like tlb_gather in\nzap_page_range that would require many IPI to flush other cpu tlbs, for\neach fixed number of spte unmapped.\n\nTo make an example: if what happens on the primary MMU is a protection\ndowngrade (from writeable to wrprotect) the secondary MMU mappings will be\ninvalidated, and the next secondary-mmu-page-fault will call\nget_user_pages and trigger a do_wp_page through get_user_pages if it\ncalled get_user_pages with write\u003d1, and it\u0027ll re-establishing an updated\nspte or secondary-tlb-mapping on the copied page.  Or it will setup a\nreadonly spte or readonly tlb mapping if it\u0027s a guest-read, if it calls\nget_user_pages with write\u003d0.  This is just an example.\n\nThis allows to map any page pointed by any pte (and in turn visible in the\nprimary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an\nfull MMU with both sptes and secondary-tlb like the shadow-pagetable layer\nwith kvm), or a remote DMA in software like XPMEM (hence needing of\nschedule in XPMEM code to send the invalidate to the remote node, while no\nneed to schedule in kvm/gru as it\u0027s an immediate event like invalidating\nprimary-mmu pte).\n\nAt least for KVM without this patch it\u0027s impossible to swap guests\nreliably.  And having this feature and removing the page pin allows\nseveral other optimizations that simplify life considerably.\n\nDependencies:\n\n1) mm_take_all_locks() to register the mmu notifier when the whole VM\n   isn\u0027t doing anything with \"mm\".  This allows mmu notifier users to keep\n   track if the VM is in the middle of the invalidate_range_begin/end\n   critical section with an atomic counter incraese in range_begin and\n   decreased in range_end.  No secondary MMU page fault is allowed to map\n   any spte or secondary tlb reference, while the VM is in the middle of\n   range_begin/end as any page returned by get_user_pages in that critical\n   section could later immediately be freed without any further\n   -\u003einvalidate_page notification (invalidate_range_begin/end works on\n   ranges and -\u003einvalidate_page isn\u0027t called immediately before freeing\n   the page).  To stop all page freeing and pagetable overwrites the\n   mmap_sem must be taken in write mode and all other anon_vma/i_mmap\n   locks must be taken too.\n\n2) It\u0027d be a waste to add branches in the VM if nobody could possibly\n   run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if\n   CONFIG_KVM\u003dm/y.  In the current kernel kvm won\u0027t yet take advantage of\n   mmu notifiers, but this already allows to compile a KVM external module\n   against a kernel with mmu notifiers enabled and from the next pull from\n   kvm.git we\u0027ll start using them.  And GRU/XPMEM will also be able to\n   continue the development by enabling KVM\u003dm in their config, until they\n   submit all GRU/XPMEM GPLv2 code to the mainline kernel.  Then they can\n   also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM\u003dn).\n   This guarantees nobody selects MMU_NOTIFIER\u003dy if KVM and GRU and XPMEM\n   are all \u003dn.\n\nThe mmu_notifier_register call can fail because mm_take_all_locks may be\ninterrupted by a signal and return -EINTR.  Because mmu_notifier_reigster\nis used when a driver startup, a failure can be gracefully handled.  Here\nan example of the change applied to kvm to register the mmu notifiers.\nUsually when a driver startups other allocations are required anyway and\n-ENOMEM failure paths exists already.\n\n struct  kvm *kvm_arch_create_vm(void)\n {\n        struct kvm *kvm \u003d kzalloc(sizeof(struct kvm), GFP_KERNEL);\n+       int err;\n\n        if (!kvm)\n                return ERR_PTR(-ENOMEM);\n\n        INIT_LIST_HEAD(\u0026kvm-\u003earch.active_mmu_pages);\n\n+       kvm-\u003earch.mmu_notifier.ops \u003d \u0026kvm_mmu_notifier_ops;\n+       err \u003d mmu_notifier_register(\u0026kvm-\u003earch.mmu_notifier, current-\u003emm);\n+       if (err) {\n+               kfree(kvm);\n+               return ERR_PTR(err);\n+       }\n+\n        return kvm;\n }\n\nmmu_notifier_unregister returns void and it\u0027s reliable.\n\nThe patch also adds a few needed but missing includes that would prevent\nkernel to compile after these changes on non-x86 archs (x86 didn\u0027t need\nthem by luck).\n\n[akpm@linux-foundation.org: coding-style fixes]\n[akpm@linux-foundation.org: fix mm/filemap_xip.c build]\n[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]\nSigned-off-by: Andrea Arcangeli \u003candrea@qumranet.com\u003e\nSigned-off-by: Nick Piggin \u003cnpiggin@suse.de\u003e\nSigned-off-by: Christoph Lameter \u003ccl@linux-foundation.org\u003e\nCc: Jack Steiner \u003csteiner@sgi.com\u003e\nCc: Robin Holt \u003cholt@sgi.com\u003e\nCc: Nick Piggin \u003cnpiggin@suse.de\u003e\nCc: Peter Zijlstra \u003ca.p.zijlstra@chello.nl\u003e\nCc: Kanoj Sarcar \u003ckanojsarcar@yahoo.com\u003e\nCc: Roland Dreier \u003crdreier@cisco.com\u003e\nCc: Steve Wise \u003cswise@opengridcomputing.com\u003e\nCc: Avi Kivity \u003cavi@qumranet.com\u003e\nCc: Hugh Dickins \u003chugh@veritas.com\u003e\nCc: Rusty Russell \u003crusty@rustcorp.com.au\u003e\nCc: Anthony Liguori \u003caliguori@us.ibm.com\u003e\nCc: Chris Wright \u003cchrisw@redhat.com\u003e\nCc: Marcelo Tosatti \u003cmarcelo@kvack.org\u003e\nCc: Eric Dumazet \u003cdada1@cosmosbay.com\u003e\nCc: \"Paul E. McKenney\" \u003cpaulmck@us.ibm.com\u003e\nCc: Izik Eidus \u003cizike@qumranet.com\u003e\nCc: Anthony Liguori \u003caliguori@us.ibm.com\u003e\nCc: Rik van Riel \u003criel@redhat.com\u003e\nSigned-off-by: Andrew Morton \u003cakpm@linux-foundation.org\u003e\nSigned-off-by: Linus Torvalds \u003ctorvalds@linux-foundation.org\u003e\n"
    }
  ]
}
