背景

最近在看Linux video相关的框架和一些简单的使用教程。在看到X11系统的一个教程的时候,我看到了一个不太理解的代码,它的使用背景是在X11 client向server发送一个op请求的时候,会有一个请求flag参数。这个flag告诉server有多少额外的信息(field)会发送到server以及这些信息的内容。这意味着client需要根据不同的flag来申请不同大小的空间,flag的二进制表示中有多少个1就代表有多少个field,其中每一个field都是32bit大小。这段代码如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
    //MIT HACKMEM bit counting algorithm (这个注释貌似不太对,用的应该是parallel bitcount 那个算法)
    int count_bits(uint32_t n)
    {
        unsigned int c;

        c = n - ((n >> 1) & 0x55555555);
        c = ((c >> 2) & 0x33333333) + (c & 0x33333333);
        c = ((c >> 4) + c) & 0x0F0F0F0F;
        c = ((c >> 8) + c) & 0x00FF00FF;
        c = ((c >> 16) + c) & 0x0000FFFF;

        return c;
    }

第一眼看到的时候很懵,不过作者很友好,在教程中给出了算法的来源,这个网站里面列举了很多bit count routines,我这篇博客的目的就是以自己的语言分析一下每一种的原理。

Bit Counting 函数

以下的算法大多来自于这里。注意,它们大多数只适用于32位整数!

Iterated Bit Count

代码:

1
2
3
4
5
6
7
8
9
10
11
12
    /* Iterated bitcount iterates over each bit. The while condition sometimes helps
       terminates the loop earlier */
    int iterated_bitcount (unsigned int n)
    {
        int count=0;    
        while (n)
        {
            count += n & 0x1u ;    
            n >>= 1 ;
        }
        return count ;
    }

这个算法比较直观,就不作解释了。

Sparse Ones & Dense Ones

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
    /* Sparse Ones runs proportional to the number of ones in n.
       The line   n &= (n-1)   simply sets the last 1 bit in n to zero. */
    int sparse_ones_bitcount (unsigned int n)
    {
        int count=0 ;
        while (n)
        {
            count++ ;
            n &= (n - 1) ;     
        }
        return count ;
    }

    /* Dense Ones runs proportional to the number of zeros in n.
       It first toggles all bits in n, then diminishes count repeatedly */
    int dense_ones_bitcount (unsigned int n)
    {
        int count = 8 * sizeof(int) ;   
        n ^= (unsigned int) -1 ;
        while (n)
        {
            count-- ;
            n &= (n - 1) ;    
        }
        return count ;
    }

这两个算法实际的思想是一致的,它们和直接迭代类似,区别在于把位移操作以与减一后的自己做位与来代替了。 n & (n-1) 保留了高位部分不变,把最低位的1变为0.

查表法

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
    /* Precomputed bitcount uses a precomputed array that stores the number of ones
       in each char. */
    static int bits_in_char [256] ;

    void compute_bits_in_char (void)
    {
        unsigned int i ;    
        for (i = 0; i < 256; i++)
            bits_in_char [i] = iterated_bitcount (i) ;
        return ;
    }

    int precomputed_bitcount (unsigned int n)
    {
        // works only for 32-bit ints
        
        return bits_in_char [n         & 0xffu]
            +  bits_in_char [(n >>  8) & 0xffu]
            +  bits_in_char [(n >> 16) & 0xffu]
            +  bits_in_char [(n >> 24) & 0xffu] ;
    }


    /* Here is another version of precomputed bitcount that uses a precomputed array
       that stores the number of ones in each short. */

    static char bits_in_16bits [0x1u << 16] ;

    void compute_bits_in_16bits (void)
    {
        unsigned int i ;    
        for (i = 0; i < (0x1u<<16); i++)
            bits_in_16bits [i] = iterated_bitcount (i) ;
        return ;
    }

    int precomputed16_bitcount (unsigned int n)
    {
        // works only for 32-bit int
        
        return bits_in_16bits [n         & 0xffffu]
            +  bits_in_16bits [(n >> 16) & 0xffffu] ;
    }

牺牲空间复杂度来降低时间复杂度的方法。

Parallel Count

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
    /* Parallel   Count   carries   out    bit   counting   in   a   parallel
       fashion.   Consider   n   after    the   first   line   has   finished
       executing. Imagine splitting n into  pairs of bits. Each pair contains
       the <em>number of ones</em> in those two bit positions in the original
       n.  After the second line has finished executing, each nibble contains
       the  <em>number of  ones</em>  in  those four  bits  positions in  the
       original n. Continuing  this for five iterations, the  64 bits contain
       the  number  of ones  among  these  sixty-four  bit positions  in  the
       original n. That is what we wanted to compute. */

    #define TWO(c) (0x1u << (c))
    #define MASK(c) (((unsigned int)(-1)) / (TWO(TWO(c)) + 1u))
    #define COUNT(x,c) ((x) & MASK(c)) + (((x) >> (TWO(c))) & MASK(c))

    int parallel_bitcount (unsigned int n)
    {
        n = COUNT(n, 0) ;
        n = COUNT(n, 1) ;
        n = COUNT(n, 2) ;
        n = COUNT(n, 3) ;
        n = COUNT(n, 4) ;
        /* n = COUNT(n, 5) ;    for 64-bit integers */
        return n ;
    }

那几个宏没怎么看懂,不过好像和下面的算法差不多,那么我就只解释下面这个好了.

Nifty Parallel Count

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
    /* Nifty  Parallel Count works  the same  way as  Parallel Count  for the
       first three iterations. At the end  of the third line (just before the
       return), each byte of n contains the number of ones in those eight bit
       positions in  the original n. A  little thought then  explains why the
       remainder modulo 255 works. */

    #define MASK_01010101 (((unsigned int)(-1))/3)
    #define MASK_00110011 (((unsigned int)(-1))/5)
    #define MASK_00001111 (((unsigned int)(-1))/17)

    int nifty_bitcount (unsigned int n)
    {
        n = (n & MASK_01010101) + ((n >> 1) & MASK_01010101) ;
        n = (n & MASK_00110011) + ((n >> 2) & MASK_00110011) ;
        n = (n & MASK_00001111) + ((n >> 4) & MASK_00001111) ;        
        return n % 255 ;
    }

这个算法的核心思想是先把一个32位二进制数分成4个部分:最高的8位,次高的8位,次低的8位,最低8位.对于每一个8位二进制数 (abcdefgh)2 进行以下运算:

n = 128a+64b+32c+16d+8e+4f+2g+h

n >> 1 = 64a+32b+16c+8d+4e+2f+g                     
n & 0b01010101          = 64b+16d+4f+h                  (1)   
(n >> 1) & 0b01010101   = 64a+16c+4e+g                  (2)

(1) + (2) = 64(a+b) + 16(c+d) + 4(e+f) + (g+h)          (3)  - 对应于代码中的第一行

(3) >> 2                = 16(a+b) + 4(c+d) + (e+f)
(3)  & 0b00110011       = 16(c+d) + (g+h)               (4)
((3) >> 2) & 0b00110011 = 16(a+b) + (e+f)               (5)  - 对应于代码中的第二行

(4) + (5) = 16(a+b+c+d) + (e+f+g+h)                     (6)

(6) >> 4                = (a+b+c+d)
(6) & 0b00001111        = (e+f+g+h)                     (7)
((6) >> 4) & 0b00001111 = (a+b+c+d)                     (8)

(7) + (8) = a+b+c+d+e+f+g+h                                  - 对应于代码中的第三行

于是,这个32位二进制数每8位代表的数字都代表这8位里面有1的个数。我们现在要想办法把这四个8位数加起来得到总和。代码中直接把它对255取余。这背后的原因如下:

考虑一个26位N进制数: (abcde..xyz)N

(abcde..xyz)N

= a*N^25 + b*N^24 + ... + y*N + z
= (a * (N^25-1) + a) + (b * (N^24-1) + b) + ... + y * (N-1) +y + z
= (a+b+c+...+x+y+z) + (a * (N^25-1) + b * (N^24-1) + ... + y * (N-1))

根据等比数列求和公式:

1 + Q + Q^2 + ... + Q^(n-1) = (1-Q^n) / (1-Q)
(Q^n-1) = (Q-1) * (1+Q+Q^2+...+Q^(n-1))

带入上式:

= (a+b+c+...+x+y+z) + (a*(N-1)*(1+N+...+N^24) + b*(N-1)*(1+N+...+N^23) +...+ y*(N-1))
= (a+b+c+...+x+y+z) + (N-1) (a*(1+N+...+N^24) + b*(1+N+...+N^23) + ... + y)

因此,只要 (a+b+c...+x+y+z) < (N-1) , 那么 (abcde...xyz)N % (N-1) = (a+b+c+...+x+y+z)

由于32位整数有二进制1的个数一定小于等于32的,因此可以把这个32位二进制数想象成4位256进制数再对255取余即可。

MIT BitCount

代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
    /* MIT Bitcount

       Consider a 3 bit number as being
            4a+2b+c
       if we shift it right 1 bit, we have
            2a+b
      subtracting this from the original gives
            2a+b+c
      if we shift the original 2 bits right we get
            a
      and so with another subtraction we have
            a+b+c
      which is the number of bits in the original number.

      Suitable masking  allows the sums of  the octal digits  in a 32 bit  number to
      appear in  each octal digit.  This  isn't much help  unless we can get  all of
      them summed together.   This can be done by modulo  arithmetic (sum the digits
      in a number by  molulo the base of the number minus  one) the old "casting out
      nines" trick  they taught  in school before  calculators were  invented.  Now,
      using mod 7 wont help us, because our number will very likely have more than 7
      bits set.   So add  the octal digits  together to  get base64 digits,  and use
      modulo 63.   (Those of you  with 64  bit machines need  to add 3  octal digits
      together to get base512 digits, and use mod 511.)
     
      This is HACKMEM 169, as used in X11 sources.
      Source: MIT AI Lab memo, late 1970's.
    */

    int mit_bitcount(unsigned int n)
    {
        /* works for 32-bit numbers only */
        register unsigned int tmp;
        
        tmp = n - ((n >> 1) & 033333333333) - ((n >> 2) & 011111111111);
        return ((tmp + (tmp >> 3)) & 030707070707) % 63;
    }

这个算法和前两个类似,只是这里把32位二进制数分成了11个部分,每个部分是一个3位二进制数。然后对于每个3位二进制数,通过注释中的方法得到这3位中1的个数,作为自身的值。

然后,将这11个3位二进制数邻近两两相加,把和保存在低3位,高3位清零(因为可以肯定的是它们的和 <= 6 = 3+3 < 2^3-1). 这个运算后的结果于是可以看成6位64进制数。由于32位整数最多只可能有32个二进制1,小于(64-1), 因此可以对其直接取模。