)]}'
{
  "log": [
    {
      "commit": "f9b4192923fa6e38331e88214b1fe5fc21583fcc",
      "tree": "a75d460256e44d79f04cd28895f689f6e7cb7bad",
      "parents": [
        "37d54111c133bea05fbae9dfe6d3d61a1b19c09b"
      ],
      "author": {
        "name": "Akinobu Mita",
        "email": "mita@miraclelinux.com",
        "time": "Sun Mar 26 01:40:00 2006 -0800"
      },
      "committer": {
        "name": "Linus Torvalds",
        "email": "torvalds@g5.osdl.org",
        "time": "Sun Mar 26 08:59:30 2006 -0800"
      },
      "message": "[PATCH] bitops: hweight() speedup\n\n\u003clinux@horizon.com\u003e wrote:\n\nThis is an extremely well-known technique.  You can see a similar version that\nuses a multiply for the last few steps at\nhttp://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel whch\nrefers to \"Software Optimization Guide for AMD Athlon 64 and Opteron\nProcessors\"\nhttp://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF\n\nIt\u0027s section 8.6, \"Efficient Implementation of Population-Count Function in\n32-bit Mode\", pages 179-180.\n\nIt uses the name that I am more familiar with, \"popcount\" (population count),\nalthough \"Hamming weight\" also makes sense.\n\nAnyway, the proof of correctness proceeds as follows:\n\n\tb \u003d a - ((a \u003e\u003e 1) \u0026 0x55555555);\n\tc \u003d (b \u0026 0x33333333) + ((b \u003e\u003e 2) \u0026 0x33333333);\n\td \u003d (c + (c \u003e\u003e 4)) \u0026 0x0f0f0f0f;\n#if SLOW_MULTIPLY\n\te \u003d d + (d \u003e\u003e 8)\n\tf \u003d e + (e \u003e\u003e 16);\n\treturn f \u0026 63;\n#else\n\t/* Useful if multiply takes at most 4 cycles */\n\treturn (d * 0x01010101) \u003e\u003e 24;\n#endif\n\nThe input value a can be thought of as 32 1-bit fields each holding their own\nhamming weight.  Now look at it as 16 2-bit fields.  Each 2-bit field a1..a0\nhas the value 2*a1 + a0.  This can be converted into the hamming weight of the\n2-bit field a1+a0 by subtracting a1.\n\nThat\u0027s what the (a \u003e\u003e 1) \u0026 mask subtraction does.  Since there can be no\nborrows, you can just do it all at once.\n\nEnumerating the 4 possible cases:\n\n0b00 \u003d 0  -\u003e  0 - 0 \u003d 0\n0b01 \u003d 1  -\u003e  1 - 0 \u003d 1\n0b10 \u003d 2  -\u003e  2 - 1 \u003d 1\n0b11 \u003d 3  -\u003e  3 - 1 \u003d 2\n\nThe next step consists of breaking up b (made of 16 2-bir fields) into\neven and odd halves and adding them into 4-bit fields.  Since the largest\npossible sum is 2+2 \u003d 4, which will not fit into a 4-bit field, the 2-bit\n                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n                          \"which will not fit into a 2-bit field\"\n\nfields have to be masked before they are added.\n\nAfter this point, the masking can be delayed.  Each 4-bit field holds a\npopulation count from 0..4, taking at most 3 bits.  These numbers can be added\nwithout overflowing a 4-bit field, so we can compute c + (c \u003e\u003e 4), and only\nthen mask off the unwanted bits.\n\nThis produces d, a number of 4 8-bit fields, each in the range 0..8.  From\nthis point, we can shift and add d multiple times without overflowing an 8-bit\nfield, and only do a final mask at the end.\n\nThe number to mask with has to be at least 63 (so that 32 on\u0027t be truncated),\nbut can also be 128 or 255.  The x86 has a special encoding for signed\nimmediate byte values -128..127, so the value of 255 is slower.  On other\nprocessors, a special \"sign extend byte\" instruction might be faster.\n\nOn a processor with fast integer multiplies (Athlon but not P4), you can\nreduce the final few serially dependent instructions to a single integer\nmultiply.  Consider d to be 3 8-bit values d3, d2, d1 and d0, each in the\nrange 0..8.  The multiply forms the partial products:\n\n\t           d3 d2 d1 d0\n\t        d3 d2 d1 d0\n\t     d3 d2 d1 d0\n\t+ d3 d2 d1 d0\n\t----------------------\n\t           e3 e2 e1 e0\n\nWhere e3 \u003d d3 + d2 + d1 + d0.   e2, e1 and e0 obviously cannot generate\nany carries.\n\nSigned-off-by: Akinobu Mita \u003cmita@miraclelinux.com\u003e\nSigned-off-by: Andrew Morton \u003cakpm@osdl.org\u003e\nSigned-off-by: Linus Torvalds \u003ctorvalds@osdl.org\u003e\n"
    },
    {
      "commit": "3b9ed1a5d2d121f32d2cb4f2b05f1fc57c99c946",
      "tree": "d8ea52d6b490c0e3b3851e1887c01700cc0115c6",
      "parents": [
        "09020adb61416c4307de35941a9725a5e33d9beb"
      ],
      "author": {
        "name": "Akinobu Mita",
        "email": "mita@miraclelinux.com",
        "time": "Sun Mar 26 01:39:13 2006 -0800"
      },
      "committer": {
        "name": "Linus Torvalds",
        "email": "torvalds@g5.osdl.org",
        "time": "Sun Mar 26 08:57:11 2006 -0800"
      },
      "message": "[PATCH] bitops: generic hweight{64,32,16,8}()\n\nThis patch introduces the C-language equivalents of the functions below:\n\nunsigned int hweight32(unsigned int w);\nunsigned int hweight16(unsigned int w);\nunsigned int hweight8(unsigned int w);\nunsigned long hweight64(__u64 w);\n\nIn include/asm-generic/bitops/hweight.h\n\nThis code largely copied from: include/linux/bitops.h\n\nSigned-off-by: Akinobu Mita \u003cmita@miraclelinux.com\u003e\nSigned-off-by: Andrew Morton \u003cakpm@osdl.org\u003e\nSigned-off-by: Linus Torvalds \u003ctorvalds@osdl.org\u003e\n"
    }
  ]
}
