Algorithms and Data Structures for Massive Datasets

Part 1: Hash-based sketches

Chapter 2: hashing in general

Hashing complexity. Collision resolution: chaining (linked list for each bucket), linear probing (open addressing - looking for unoccupied bucket after busy one). Consistent Hashing (using Hashring)

Chapter 3: Approximate membership: Bloom and quotient filters

Usage: whether we’ve seen element before (without storing all previous)

Consists of: bit array, N hash functions (function can be same, with different seeds). Incoming element hashed, bit set to 1 for each obtained hash value.

Existence check: hash element, check if all values are set to 1.

Configuring Bloom filter. Tunable params: false positive rate, num of bits, num of hash functions

Bloom filter doesn’t have deletion. “Counting” bloom filter does.

Quotient filter: can delete, merge, resize. First N bits determine bucket, rest hash bits are stored.

QF has better performance on uniform inserts and success lookups. Takes more space

Chapter 4: Frequency estimation and count-min sketch

Measuring frequency of big amount of distinct items. Heavy hitters problem

CMS (count-min sketch) supports update and estimate. Represented by matrix, where rows correspond to hash functions.

Update adds counter value to “hashed column” in each “hash row” (when hashes are same, numbers sum up). Estimate returns min count from all hash rows.

Configuring CMS: band of overestimate, and probability of error.

Can support Range Queries. Dyadic ranges: storing several CMSs on different range level. So on upper levels we already have ranges of elements instead of single element, so all “high” dyadic ranges are smaller than initial CMS. There would be O(logN) of such intervals. This way we have something similar to BST of intervals.

Chapter 5: Cardinality estimation and HyperLogLog (HLL)

Multiset - can contain multiple instances of element, but still unordered.