Count-min Sketch (Cms)

Use to estimate the frequency stream of data with a fixed memory. For example if we want to count the frequency of this stream of data:

A A A B C

So it should return A: 3, B: 1

Algorithm

Count-min sketch can be constructed using a 2 dimensional array with width normally very big (1000) and height very small represent the number of hash function

To add the frequency

For example in this case, we have 5 hash functions and 7 columns:

Pasted image 20230901204957.png

When a new element comes, we calculate the hash value of each of the 5 hash function and add 1 to the correspondent cell (maybe with %)

For example A comes, we have the following

Pasted image 20230901205114.png

When another A comes, we have
Pasted image 20230901205149.png

Same for another A
Pasted image 20230901205204.png

Then B arrives, however we have a hash collision that will accidentally increase the frequency of A

Pasted image 20230901205240.png

then C arrives and we have another collision with B and A:

Pasted image 20230901205323.png

To retrieve the frequency

To receive A, we found the minimum of all the counters. In this case should be 3. That's why we need multiple hash functions.

Choosing width and height

$$width = \text{Math.ceil}(\frac{2.0}{e})$$
$$height = \text{Math.ceil}(-\frac{\text{Math.log}(1-d)} {\text{Math.log}(2)})$$
In which

  • $e$: the accuracy we want to have
  • $d$: the certainty which we reach the accuracy

Use cases

  • Top K elements
    • Keep a heap of top-k elements with the given Count-min sketch
  • Calculate frequency with fixed size hash