Naren, DAU and MAU are key metrics online copmanies calculate using Data warehouses. You just simplified the calculations for that significantly! Nice explanation.
THIS COMMENT IS FOR MY PERSONAL REFERENCE. TO UNDERSTAND PROPERLY WATCH THE FULL VIDEO -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- concept 2:45 bitmaps array 5:23 active user count 6:56 user login everyday 8:17 different device count 10:07
Hi Narendra , That was very helpful. I was thinking of an improvement in there. How about using multiple bit map arrays instead of one single bitmap array ? Say , we can use 200 bitmap arrays for 200 M users and another array of 200 size which will store count of active users in ith bitmap array. This way, we can reduce time complexity by not traversing ith bitmap array if the active users for that bitmap array is 0. Correct me if I am missing any edge case. Thanks.
Great Video Sir, But I have few questions. Where do we store this bitmap? In Db or Memory? If in Db, do DBMS providers provide this functionality inbuilt? If in memory, are there any client libraries providing this? Also, what about persistance/fault tolerance if we are storing in memory?
Thankyou Narendra, very good explanation. You always explain DS as though it's a cake walk. Why have you stopped posting new videos? Hope you are doing safe and healthy. I was just thinking, why don't we store the data to day-on-day information in database with 4 bit integer for logged in per day basis or it can even be an entry into a row type DB for the last logged in date with a bit 0/1 per user. Anyways we would need to think of a purging system as we can't retain this for long even in case of bitmap array
1. our user ids might not be auto incrementing integers. in that case, I will need to store the mapping of a user id with auto incrementing integer first and for every request have to check the mapping first to make updates in bitmaps. I guess we can use something like in memory datastore like redis for that. 2. Also, we could have used count min sketch to get approximate numbers for these analytics. Thoughts?
hmmm i don't know about that. Imagine the concurrency problem you would have to update this map as the logins are happening like crazy! Also, How would you efficiently traverse the array to update the bit correspondent to the user logging in?
Perhaps I missed this, but how would you actually traverse the array to find all active users? Would you have to do a linear time scan and count how many '1's are there ?
correct me if i got it wrong, this bitmap array index went with the assumption that we would've users ids as integers right? But that's not the case in most dbs, we use mostly the uuid's then how would we correctly map/say array bitmap index corresponds to this user id? do we use some kind of hash function? if yes, then how we'd go write the unique hash function as it has collision problems ? Thanks
The last ID of your Bitmap is 999.999.999, not 1 Billion. For archive purposes this is storage efficient, but a simple SQL-DB wouldn't take much more. You can make relations and store its statistics this way, you don't have to store the entire user data every day. Also how is the system able to determine if the user is the same user or a different user? Where is the information stored? I still need a solid SQL DB, because I need to store key information to identify the user non the less. But other than that, yes, bit shifting is very fast to compute active users from specific days. As soon as you have to plot those, it gets slower, though.
Good catch about 999..., Yes you are right we need a persistent db to store who is who, this is about collecting metrics in lighting fast with less overhead. I am sure not many people will use this in there daily life. But some systems do implement. Eg: Routers to find the unique combination of source vs destination IP address for various security implementation. citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.724&rep=rep1&type=pdf
@@TechDummiesNarendraL Thanks for reply, yes this is a good reason and would've been a good example in your video as well as the clarification that this is not a replacement for a database to address interested beginners in this environment. Good video nonetheless. 👍🏻
Thanks for the video. Can a system like this account for users deleting their account? If the user was removed from the bitmap, it would mess with the indexing.
@@TechDummiesNarendraL If you had 10 users in your bitmap and you wanted to remove the 5th user, your bitmap would a length of 9. However the IDs of user 6,7,8,9 would then become 5,6,7,8.
@@girig6421 It is expected that the user-id is auto-incremented. We can keep separate bitmap of user_deleted. So, total-active-users = total-logged-in-users - total-deleted-users. The use-case mentioned by Tech Dummies is for statistics purpose.
@@girig6421 User ids should all be unique or strictly monotonically increasing. Else the design becomes practically unusable, viz., wrt (historical) metrics.
Great video , keep doing it
I giggle everytime I see thumbnail for this video xD. Great sense of humor and content.
This channel is a gold mine
Thank you for this amazing video. :D This bitmap looks like bloom filter :D
Naren, DAU and MAU are key metrics online copmanies calculate using Data warehouses. You just simplified the calculations for that significantly! Nice explanation.
Good One!
THIS COMMENT IS FOR MY PERSONAL REFERENCE. TO UNDERSTAND PROPERLY WATCH THE FULL VIDEO
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
concept 2:45
bitmaps array 5:23
active user count 6:56
user login everyday 8:17
different device count 10:07
Hi Narendra , That was very helpful. I was thinking of an improvement in there. How about using multiple bit map arrays instead of one single bitmap array ? Say , we can use 200 bitmap arrays for 200 M users and another array of 200 size which will store count of active users in ith bitmap array. This way, we can reduce time complexity by not traversing ith bitmap array if the active users for that bitmap array is 0. Correct me if I am missing any edge case. Thanks.
Great Video Sir, But I have few questions. Where do we store this bitmap? In Db or Memory? If in Db, do DBMS providers provide this functionality inbuilt? If in memory, are there any client libraries providing this? Also, what about persistance/fault tolerance if we are storing in memory?
Thankyou Narendra, very good explanation. You always explain DS as though it's a cake walk. Why have you stopped posting new videos?
Hope you are doing safe and healthy.
I was just thinking, why don't we store the data to day-on-day information in database with 4 bit integer for logged in per day basis or it can even be an entry into a row type DB for the last logged in date with a bit 0/1 per user. Anyways we would need to think of a purging system as we can't retain this for long even in case of bitmap array
1. our user ids might not be auto incrementing integers. in that case, I will need to store the mapping of a user id with auto incrementing integer first and for every request have to check the mapping first to make updates in bitmaps. I guess we can use something like in memory datastore like redis for that.
2. Also, we could have used count min sketch to get approximate numbers for these analytics.
Thoughts?
hmmm i don't know about that.
Imagine the concurrency problem you would have to update this map as the logins are happening like crazy!
Also, How would you efficiently traverse the array to update the bit correspondent to the user logging in?
Perhaps I missed this, but how would you actually traverse the array to find all active users? Would you have to do a linear time scan and count how many '1's are there ?
Yes.
correct me if i got it wrong, this bitmap array index went with the assumption that we would've users ids as integers right? But that's not the case in most dbs, we use mostly the uuid's then how would we correctly map/say array bitmap index corresponds to this user id? do we use some kind of hash function? if yes, then how we'd go write the unique hash function as it has collision problems ? Thanks
One question about the userId, as we know some system will use UUID as user id, in this case how could we set the order of the user in the bitMap?
We can have a separate mapping for that.
how would we handle case where userid is alpha numeric? How would we map a userid into a particular index uniquely?
The last ID of your Bitmap is 999.999.999, not 1 Billion.
For archive purposes this is storage efficient, but a simple SQL-DB wouldn't take much more. You can make relations and store its statistics this way, you don't have to store the entire user data every day. Also how is the system able to determine if the user is the same user or a different user? Where is the information stored? I still need a solid SQL DB, because I need to store key information to identify the user non the less. But other than that, yes, bit shifting is very fast to compute active users from specific days. As soon as you have to plot those, it gets slower, though.
Good catch about 999..., Yes you are right we need a persistent db to store who is who, this is about collecting metrics in lighting fast with less overhead. I am sure not many people will use this in there daily life. But some systems do implement. Eg: Routers to find the unique combination of source vs destination IP address for various security implementation. citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.724&rep=rep1&type=pdf
@@TechDummiesNarendraL Thanks for reply, yes this is a good reason and would've been a good example in your video as well as the clarification that this is not a replacement for a database to address interested beginners in this environment. Good video nonetheless. 👍🏻
Liked the video, 1 question: what if my userIds are not integers but guids? any efficient way for the same?
You can perform 2 things:
1. Use Row Number if there is any (recommended)
2. Use a Hash Function
How come iteration over 1 million bits is taking only 100ms?
we will need to load of all the 125 MB data of last n days in memory to count active users?? can we do it more efficiently??
Hyperloglog can do this in a more space-efficient way (1-64 KB in size). However, it would sacrifice a little bit of accuracy.
Thanks for the video. Can a system like this account for users deleting their account? If the user was removed from the bitmap, it would mess with the indexing.
Not really, even if the user is deleted the IDs of the existing users remains same.
@@TechDummiesNarendraL If you had 10 users in your bitmap and you wanted to remove the 5th user, your bitmap would a length of 9. However the IDs of user 6,7,8,9 would then become 5,6,7,8.
@@girig6421 🧐 why do you change user id for users whenever some user deleted his account?
@@girig6421 It is expected that the user-id is auto-incremented. We can keep separate bitmap of user_deleted. So, total-active-users = total-logged-in-users - total-deleted-users. The use-case mentioned by Tech Dummies is for statistics purpose.
@@girig6421 User ids should all be unique or strictly monotonically increasing. Else the design becomes practically unusable, viz., wrt (historical) metrics.
Do bitmap have only 0 & 1 ?
Bitmaps are made of bits and a bit can only have 1 or 0
@@TechDummiesNarendraL *binary.
this is useful but how come it can be used in real world example?
Analytics probably, "Get top K items purchased today", you could do all this in Redis, instead of an MR/Spark Job.
Count = French for: I'm already, a damn terroriste! How are we, dummies then? ie = #4, in XP- Orient!