If there is a big burst of data, we may not be able to examine all of it.
- Need approaches such that lost data causes the least harm
Thus, we consider algorithmic solutions to this problem.
Sampling
- Involves randomly sampling the data in a fair manner
- In order to sample data in a fair manner, we use reservoir sampling
- Need to select $S$ values uniformly at random from a stream of $N$ values
- $N$ is very big, not known ahead of time (stream)
- Store first $S$ values
- When receiving the $k$th value, keep it with probability $S/k$
- If keeping the $k$th element, randomly discard one of the existing $S$ elements to make room
- Property: For $k \geq S$, all values $V_1, ..., V_k$ have a probability $S/k$ of being in the resevoir.
Hashing
We may not be able to store data in a convenient manner to answer questions quickly. However, hashing can estimate common multiset operations:
- Cardinality - how many unique elements are in a multiset?
- Membership - is $X$ a mamber of the multiset?
- Frequency - how often does $X$ appear in a given multiset?
To estimate the cardinality of a multiset, we use a HyperLogLog Counter (HLL).
- Observe - when we hash an item, we obtain a vector of 32 bits (integer)
- With good hash function, 50% chance that a bit is 0
- 1/2 of items have a hash code starting with 0
- 1/4 of items have hash code starting with 00
- 1/8 of items have hash code starting with 000
- etc.
- We expect that when we have seen around $n$ elements, some hash code would start with $\log n$ zeroes
- We record the longest string of leading 0s seen in any hash codes
- If we have seen $x$ 0s, expect that we have seen approximately $2^x$ unique items
- Log-log counter - number of leading 0s in a 32-bit number $X$ is approximately $32 - \log_2 (X)$
- We use HLL as an estimator of $\log_2$ of cardinality of the set