Count-min Sketch (Cms)
Use to estimate the frequency stream of data with a fixed memory. For example if we want to count the frequency of this stream of data:
A A A B C
So it should return A: 3, B: 1
Algorithm
Count-min sketch can be constructed using a 2 dimensional array with width
normally very big (1000) and height very small represent the number of hash function
To add the frequency
For example in this case, we have 5 hash functions and 7 columns:
When a new element comes, we calculate the hash value of each of the 5 hash function and add 1
to the correspondent cell (maybe with %
)
For example A
comes, we have the following
When another A
comes, we have
Same for another A
Then B arrives, however we have a hash collision that will accidentally increase the frequency of A
then C
arrives and we have another collision with B
and A
:
To retrieve the frequency
To receive A
, we found the minimum of all the counters. In this case should be 3
. That's why we need multiple hash functions.
Choosing width and height
$$width = \text{Math.ceil}(\frac{2.0}{e})$$
$$height = \text{Math.ceil}(-\frac{\text{Math.log}(1-d)} {\text{Math.log}(2)})$$
In which
- $e$: the accuracy we want to have
- $d$: the certainty which we reach the accuracy
Use cases
- Top K elements
- Keep a heap of top-k elements with the given Count-min sketch
- Calculate frequency with fixed size hash