[Programming Pearls] Column 1 – bitvec

持续到2月,坚持至少一周2更, on algorithms…

Use case of a bitvec:
v = bitvec(capacity)
v.set(45322)
v.set(1000234)

if v.has(453222): print(‘ok it’s there’)

Data structure:
bytearray

Key concepts on storage:
- 1M bytes == 8M bits, meaning bitvec on 1MB memory will be able to handle 8M numbers, with the max number to be 8M – 1 (assuming we need to store 0 in the bitvec)
- 10M numbers requests 10/8M bytes – Assuming ‘M’ is always 2^20, not necessarily 1,000,000.

Key implementation problems:
In the case of v.set(num),
- What is the corresponding position (naming it as pos) for num in the underlying bytearray?
pos = num / 8 (or num >> 3)
- What is the corresponding mask index (naming it as r, remainder) in bytearray[pos]?
r = num % 8 (or num & 0×7)

The implementation of bit vector:

class prog_1_2_bitvec:
    """
    >>> v = prog_1_2_bitvec(6)
    >>> v.set(2)
    >>> v.set(5)
    >>> print('%x' % v.bytes[0]) # should be 100100
    24
    >>> v.has(2)
    True
    >>> v.has(5)
    True
    >>> v.unset(2)
    >>> v.has(2)
    False
    >>> v.unset(5)
    >>> v.has(5)
    False
    >>> v.set(3)
    >>> v.has(3)
    True
    """

    def __init__(self, max):
        if max <= 0: raise Exception('invalid max - length <= 0')
        self.bytes = bytearray(int(max / 8 + 1))
        self.max = max

    def set(self, num):
        self._valid(num)
        pos, mask = self._divide(num)
        self.bytes[pos] |= mask

    def has(self, num):
        self._valid(num)
        pos, mask = self._divide(num)
        return self.bytes[pos] & mask == mask

    def unset(self, num):
        self._valid(num)
        pos, mask = self._divide(num)
        self.bytes[pos] &= ~mask

    def count(self):
        # lie: TODO: best way to count the number of 1
        # will cover it later
        pass

    def _divide(self, num):
        d = num >> 3
        r = num & 0x7
        return (d, 1 << r)

    def _valid(self, num):
        if num < 0: raise Exception('invalid num - negative number')
        if num >= self.max: raise Exception('invalid num - num > %d' % self.max)

The implementation of multi-pass bitvec sort:

def prog_1_3_bitsort_multipass(l, max, k):
    """
    TODO: margin conditions...

    >>> max = 1 << 8
    >>> l = prog_1_4_gen_list(max, max >> 1)
    >>> print(l[0:20])  # doctest:+ELLIPSIS
    [...]
    >>> l = prog_1_3_bitsort_multipass(l, max, 2)
    >>> print(l[0:20])  # doctest:+ELLIPSIS
    [...]
    >>> test_verify_sorted(l)
    True
    """
    def sort_singlepass(l, max, i, k):
        # assumptions:
        # 1. let slice = max / k, the func is to handle numbers from range i * slice ~ (i + 1) * slice
        # 2. The mapping relationship is n => n - i * slice
        # exception cases (needs assertion):
        # 1. max / k has remainder
        slice = int(max / k)
        upper = (i + 1) * slice
        lower = i * slice
        v = prog_1_2_bitvec(slice)
        for n in l:
            if n < lower or n >= upper: continue
            mapped_n = n - lower
            v.set(mapped_n)
        sortedlist = [n + i * slice for n in range(slice) if v.has(n)]
        return sortedlist

    sortedlist = []
    for i in range(0, k):
        sortedlist += sort_singlepass(l, max, i, k)
    return sortedlist

Hints:
- Overall: define use case – UT, method signature, etc
- Write logic: write mainstraem logic with notes (in comments) or assertions for exception / corner cases.
- Organize mind: cases!!

Tests (TODO):
- Areas: UT – like UT for sort_singlepass, basic E2E (positive), negative like to pass numbers less than 0 or exceeding data volume, code coverage (previous test cases may have covered most), performance

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>