Dealing with data bursts

If there is a big burst of data, we may not be able to examine all of it.

Thus, we consider algorithmic solutions to this problem.

Sampling

Hashing

We may not be able to store data in a convenient manner to answer questions quickly. However, hashing can estimate common multiset operations:

To estimate the cardinality of a multiset, we use a HyperLogLog Counter (HLL).

Observe - when we hash an item, we obtain a vector of 32 bits (integer)
- With good hash function, 50% chance that a bit is 0
  - 1/2 of items have a hash code starting with 0
  - 1/4 of items have hash code starting with 00
  - 1/8 of items have hash code starting with 000
  - etc.
We expect that when we have seen around $n$ elements, some hash code would start with $\log n$ zeroes
We record the longest string of leading 0s seen in any hash codes
- If we have seen $x$ 0s, expect that we have seen approximately $2^x$ unique items
Log-log counter - number of leading 0s in a 32-bit number $X$ is approximately $32 - \log_2 (X)$
We use HLL as an estimator of $\log_2$ of cardinality of the set