I have spent few hours watching many people to just understand how deduplication works... and this simple video is the best video I have ever seen. You have explained the whole thing in just few minutes so any layman like me can understand. Thank you very much.
I know this video is old, but is very well done! Thanks! I hope it is okay that I share this to help train storage/DC sales people who need help on this topic so they truly can 'get it.' Will always make sure you get credit for sure! Great work.
Awesome, Summation! Im in the process of selecting a SAN solution and this simple video added a great piece of knowledge to my overall understanding! Thanks!
In MS world all people will have pointer to the blocks called Reparse Pointer stored in original file. And where all unique pieces of blocks get stored in server is called Chunk Store...awesome video.
We understand file level but In block level deduplication, if any 1 will change his data then how it will get store in data center ?? and if no two people have same data then ?
hi! The only thing you dont tell: How does deduplication on block data knows where each block goes? I mean, 12345 ist just a row of number for 1) different users and 2) different blocks; so it's not Me=12345M; Ted=12345 ... so, how does Deduplication knows where each block goes? and how much space does this information require compared to the origin-block information? Hope you can explain that, maybe either in a video or with a comment, THANKS!
Great explanation. Thanks but I have a question in regards to Block Level DeDuplication. If 2 guys have some song and 3 guy has different song how would block level deduplication work there. Let's say 2 of them has Coldplay - Yellow and the other guy has Iron Maiden - Fear of The Dark. How would Block Level work there.
Maybe the file is recognized as a music file so like 1 block would be saved across all of the files would be that it IS a music file but then it would be seperate blocks for the artist and songs?
I think a music file was a bad example, only because it makes it hard to visualize the bits being used to dedupe and most music files are already compressed. There are some other videos that show how it works, but basically the software recognizes patterns of bits inside every byte being backed up. Using this video's example, let's say block 1 is 0110 in binary and block 2 is 0101. Maybe it's just metadata or file headers that tells the computer it's an MP3 file (I'm not sure, just trying to use an example). This wouldn't change for ANY MP3 file being backed up, so it would be redundant to store each example of those for every MP3 file being backed up. Block 3 could be 1010 in binary, block 4 1100, and block 5 1001. This could contain the specific audio codex being used, different bit rates, or other components of an MP3 file that varies from file to file. Let's say block 3 says the bitrate is 128 Kbps, block 4 says the bitrate is 160 Kbps, and block 5 says the bitrate is 256 Kbps. The rest of the file is contained in 100s of other blocks, so those blocks will be largely unique and couldn't be deduped very well (compressed file formats like MP3 are terrible at deduping, and many times the file actually becomes larger). These binary patterns are stored in a dedupe engine used by the software, and every time a specific pattern is recognized the software points to the location in the file and determines what binary pattern can be inserted into that location in the block. All 3 files are MP3s, so we don't need to keep saving that part of the file, but we do need to know the other pieces of information to ensure the file is usable when it's restored. Over time, these redundancies can become huge amounts of data. We don't need to save block 1 and 2 for every file, we simply need to know what block 1 and block 2 look like (0110 and 0101) and what they represent. Then, when the deduplication engine sees these patterns, it knows it can skip backing them up and use a pointer to indicate where the pattern exists in the specific file. I'm far from an expert, that's just my understanding of how the deduplication process works.
You keep a table transparent to the user that wants to ignore that does the logical (what you think there is on the bup storage) and physical (what you actually have on the storage). Let's take the last case from the vid. you think you have 1234, 1245, 1235 but in fact you have 12345. The mapping table might contain: Block 1 represents 1st, 5th and 9th logical blocks. Block 2 logically represents 2nd, 6th and 10th logical blocks.... Block 5 represents 8th and 12th logical blocks.
I have spent few hours watching many people to just understand how deduplication works... and this simple video is the best video I have ever seen.
You have explained the whole thing in just few minutes so any layman like me can understand.
Thank you very much.
Very simple and straight explanation. Good job.
I know this video is old, but is very well done! Thanks! I hope it is okay that I share this to help train storage/DC sales people who need help on this topic so they truly can 'get it.' Will always make sure you get credit for sure! Great work.
Awesome, Summation! Im in the process of selecting a SAN solution and this simple video added a great piece of knowledge to my overall understanding!
Thanks!
In MS world all people will have pointer to the blocks called Reparse Pointer stored in original file. And where all unique pieces of blocks get stored in server is called Chunk Store...awesome video.
This guy did his undergrad in journalism woah
Wow. You have explained this so well. 👍
Perfect. Thank you.
perfect too simple to understand thank you
Clear and concise explanation - thanks
Still relevant to this day. Remember that kids ;)
Ps. you also have DB deduplication eg. via memory cache, so on other parts of the software or in a network. Not only disk.
Explained in plain English. Well done. Thanks.
why does every video from the early '10s look like it was the 80s. Boy has technology changed us
Awesome explanation!
Thank you. Now I have more understanding.
thank you !!
Great explanation . Thanks
We understand file level but In block level deduplication, if any 1 will change his data then how it will get store in data center ?? and if no two people have same data then ?
This helped me so much! Great job!!!!
Thanks, Nicely Explained
Thank you. Well done.
good explanation
hi! The only thing you dont tell: How does deduplication on block data knows where each block goes? I mean, 12345 ist just a row of number for 1) different users and 2) different blocks; so it's not Me=12345M; Ted=12345 ... so, how does Deduplication knows where each block goes? and how much space does this information require compared to the origin-block information?
Hope you can explain that, maybe either in a video or with a comment, THANKS!
Thats helped a lot
doesn't block change if a file within the block changes?
great explanation.
well explained, thank you!
That was Bad Ass.... great explanation and summation
well done. that is understood even if your non-technical like me. thanks
Thanks, Well Explained.
Nice work. Any recommendation which backup software is good/superior than others for block level dedupe?
Thank you very much!
This is a great explanation. I just have 1 question. Why?
Storage and performance optimization.
Thanks
Thanks for sharing this info...
Great! Thanks
thanks brah
Nice job
Great explanation. Thanks but I have a question in regards to Block Level DeDuplication. If 2 guys have some song and 3 guy has different song how would block level deduplication work there. Let's say 2 of them has Coldplay - Yellow and the other guy has Iron Maiden - Fear of The Dark. How would Block Level work there.
Maybe the file is recognized as a music file so like 1 block would be saved across all of the files would be that it IS a music file but then it would be seperate blocks for the artist and songs?
I think a music file was a bad example, only because it makes it hard to visualize the bits being used to dedupe and most music files are already compressed. There are some other videos that show how it works, but basically the software recognizes patterns of bits inside every byte being backed up. Using this video's example, let's say block 1 is 0110 in binary and block 2 is 0101. Maybe it's just metadata or file headers that tells the computer it's an MP3 file (I'm not sure, just trying to use an example). This wouldn't change for ANY MP3 file being backed up, so it would be redundant to store each example of those for every MP3 file being backed up. Block 3 could be 1010 in binary, block 4 1100, and block 5 1001. This could contain the specific audio codex being used, different bit rates, or other components of an MP3 file that varies from file to file. Let's say block 3 says the bitrate is 128 Kbps, block 4 says the bitrate is 160 Kbps, and block 5 says the bitrate is 256 Kbps. The rest of the file is contained in 100s of other blocks, so those blocks will be largely unique and couldn't be deduped very well (compressed file formats like MP3 are terrible at deduping, and many times the file actually becomes larger). These binary patterns are stored in a dedupe engine used by the software, and every time a specific pattern is recognized the software points to the location in the file and determines what binary pattern can be inserted into that location in the block.
All 3 files are MP3s, so we don't need to keep saving that part of the file, but we do need to know the other pieces of information to ensure the file is usable when it's restored. Over time, these redundancies can become huge amounts of data. We don't need to save block 1 and 2 for every file, we simply need to know what block 1 and block 2 look like (0110 and 0101) and what they represent. Then, when the deduplication engine sees these patterns, it knows it can skip backing them up and use a pointer to indicate where the pattern exists in the specific file. I'm far from an expert, that's just my understanding of how the deduplication process works.
Deduplication only works on duplicate files not on unique files.
You keep a table transparent to the user that wants to ignore that does the logical (what you think there is on the bup storage) and physical (what you actually have on the storage). Let's take the last case from the vid. you think you have 1234, 1245, 1235 but in fact you have 12345. The mapping table might contain: Block 1 represents 1st, 5th and 9th logical blocks. Block 2 logically represents 2nd, 6th and 10th logical blocks.... Block 5 represents 8th and 12th logical blocks.
Great
What is deduplication?
grossaly underestimated file deduplication: forgot Rabin fingerprints, chunking files?
nice
Mohr Junction
Doesn't have indian accent; watchable.
i suggest a shave and a haircut.
Great explanation . Thanks
Well done. Thank you!
Great explanation. Thanks