Floral foam 🌼, elephants 🐘 and evil mayors 😈: the useful HyperLogLog algorithm explained as if you were 10 years old!
Another take on HyperLogLog?
I explained the HyperLogLog algorithm by Flajolet et al. (2007) in an earlier blog post but noted that using the coin flip analogy I wasn’t quite able to explain it in an elevator pitch or in a bar for my friends. The learning curve from flipping a coin to binary representations of strings was too far (and even further for social media components) - especially for non-programmers.
Also, I noted that I needed to refresh my own memories myself after not working with the algorithm for a couple of months forcing myself to come up with a good visual analogy!
The following intro reduces the algorithm to its minimum first and builds up step-by-step.
Counting children - fight for your rights!
🕑 We just went back in time, you’re 10 years old again!🕑
Let’s imagine that your city wants to abolish your favorite playground.
Damn! You go to your city’s mayor and complain:
“Do not abolish the playground, I love it!” 😭"I'm very sorry but apparently only
a handful of children use the playground
so it's not needed anymore. Instead, we will
build expensive freehold appartments." 😈
You are surprised as you see plenty of children playing there everyday:
“That’s not true! All of my friends
come to the playground almost everyday.
It’s really crowded!”Well, so prove me wrong.
If you can convince me that
1000 different children play there per week
it shall remain.
You accept the task and think of a plan how you could possibly count all the children…
Watching and failing
Instead of going to school you decide to go to the playground Monday morning and count all children during the whole day.
“Alright, that’s some 2-year-olds with their moms, Felix and Laura which makes 2…” 🧑🏿🤝🧑🏽
The rest of the morning it’s quite empty, but after school plenty of your friends arrive!
“Great, that’s John, Abdul, Maria, Giovanni… so 4+2… " 🧑🏿🤝🧑🏽🧑🏿🤝🧑🏽🧑🏿🤝🧑🏽
Maria leaves the playground to buy a drink and reenters. You don’t remember whether she already entered before or not.
You get tired of counting and join your friends playing totally losing track of your count.
“Damnit! All the work for nothing. Also I cannot be here the entire day watching everyone. It’s too confusing to keep track of who entered already… What could I possibly do?”
Luckily you remember your friend’s dad working in this weird geographic niche. Once, he tried to explain you something with a coin but you don’t quite remember. You call him and he gives you a few tips…
Remember floral foam? You know, this foam that keeps its form if you touch it.
Photo by London Flower School
You go to your favorite DIY-store and buy a 2x2m wall made from floral foam.
Here comes the trick: instead of sitting next to the playground, you simply install the 2x2m wall of floral foam right at the entrance of the playground so everyone needs to pass it to enter. You get some weird looks first but leave a message and people understand that it’s for the good of the children.
Alright, how is a gigantic floral foam wall related to a complicated algorithm?
Everyone who needs to pass the wall, leaves it’s personal form in the foam.
See this shape? This was kid #1. Some foam is gone now, the rest keeps its shape. The more people pass the less foam there will be in the end. Simple right?
Kind #2 comes and leaves its shape. The colors here are only kept for distinguishing the individual areas.
Then kid #3 …
You get it. Now, imagine there was an app on your phone that could tell you just by taking a picture how many people passed the gate. Sounds crazy? Well this app is - in a way - just like HyperLogLog! Based on the area of remaining foam it can - metaphorically speaking - calculate through an empiric function how many different people passed the gate with a small error rate.
Kids #4, #5, #6 pass.
The final form of the entrance foam wall would eventually look like this.
See how the final form is a mix of all the different kids having passed the entrance?
Thanks to the foam gate and your friend’s dad you were able to conclude that plenty of kids use the playground. Finally the mayor was convinced to leave the playgrounds where they were. ;)
Now - assuming kids keep their “shape” - nothing changes if they pass 2, 3, 10 or 100 times as their form is already memorized by the foam even though we do not know exactly who passed.
Instead, if a new kid passes it usually has a unique shape and removes a tiny bit of foam while passing.
It might also be possible that a small child passes the gate and nothing changes. That’s totally alright as with the app (or rather HyperLogLog algorithm) we just guess the number of different kids anyway.
Ok, but what happens if an elephant passes the gate?
Well, yeah it might break and all your work is ruined. However, there is a good way to mitigate that risk!
Imagine, you not only had one foam gate, but let’s say 26 - one for each letter in the latin alphabet.
Depending on the first letter of your name, you are only allowed to pass through that gate.
So Marco and Maria take the M-gate. Alex and Abdul take the A-gate and elephant - you guess it - takes the E-gate.
The consequence is that if 1 elephant or 100 come, they won’t ruin everything, but just 1 out of 26 gates! Therefore, the app or HyperLogLog is still quite good at guessing.
For HyperLogLog, the gates are called buckets or registers.
If there were more playgrounds at stake and you would need to count the number of distinct children for two of them, you just link them!
So let’s take playground 1…
… and add it up to playground 2.
This would still lead to the exact same guess without increasing the error rate!
Speaking with HyperLogLog it’s called union.
Kids using both playgrounds
There is even a mathematical method of knowing how many people used both playgrounds at some point. It’s called intersection.
Privacy and HyperLogLog
Talking through the foam gate analogy:
- If you want to know whether a particular kid passed already, you make it go through the gate. If nothing changes, the kid might have gone through already, but you cannot know for sure, as maybe the gate is already very big and plenty of people can pass without any changes to the foam.
- On the other hand, you can easily tell, that a certain kid has not yet passed if the foam changes!
That’s a very important characteristic of HyperLogLog: you can only prove that someone has not been somewhere but never that someone actually has been somewhere.
Therefore the strict definition of differential privacy is not fulfilled and Desfontaines, Lochbihler & Basin (2019) claim that “Cardinality Estimators do not Preserve Privacy.”
However, you can never assure, that a smart kid (beeing grounded at the moment) did not simply jump over the fence in order to play secretely! In this way the kid was at the playground but carrying out the before-mentioned test, it would show that it wasn’t. This behavior can be used to make HyperLogLog even more secure by assuring that you never registered 100% of all kids (or social media users) but rather randomely drop some of them. Differential privacy still wouldn’t be met, but you never have certainty about anything which is a core principle of privacy.
No matter, how many kids passed the gate, the effort or computational time for the app (or HyperLogLog) remains (almost) the same. A HyperLogLog set can remember plenty of different kids’ shapes. In other words: the foam (or a HLL-set) can hence remember billions of elements in very little memory or 1.000.000.000 elements in 1.5kb of memory with only 2% error rate (Flajolet et al. 2007:127).
HyperLogLog and social media
For LBSN-Dashboard HyperLogLog is part of the data model in order to not put social media users at risk. The idea is that we create plenty of HyperLogLog sets, one for each social media component like coordinates, different terms like “tree” or “food” or social media locations like “botanical garden”. In this way, for each of these components, the HyperLogLog set represents the playground foam gate whereas the kids are represented by a unique user id or unique email.
Original posts can never be restored and one can never be sure that a particular user posted something about something. The big (data) picture however allows for general statements and (relative) distributions very valuable to the people and municipal decision-making.
Take a look at Dunkel, Löchner & Burghardt (2020) for an in-depth-take on HyperLogLog, social media and privacy.
If you made until here, read through my former tutorial or feel free to contact me! I’d love to hear some feedback whether it helped you to understand the nature of HyperLogLog.
DESFONTAINES, D., LOCHBIHLER, A. & BASIN, D. (2019). Cardinality Estimators do not Preserve Privacy. Proceedings on Privacy Enhancing Technologies, 2019(2), 26-46.
DUNKEL, A., LÖCHNER, M., & BURGHARDT, D. (2020). Privacy-Aware Visualization of Volunteered Geographic Information (VGI) to Analyze Spatial Activity: A Benchmark Implementation. ISPRS International Journal of Geo-Information, 9(10), 1-21.
FLAJOLET, P., FUSY, É., GANDOUET, O., & MEUNIER, F. (2007). Hyperloglog: the analysis of a nearoptimal cardinality estimation algorithm. Analysis of Algorithms 2007 (AofA07), 127–146.