An Ironically Long Primer: Deduplication

I’ve got some other posts I want to make that have to do with storage that does deduplication, but I think it’s important that we have the context first. While there are a variety of applications and methods for data deduplication (the Wikipedia article is a good overview) I’ll try to clarify the points that matter from a practical standpoint in regards to how it is implemented in modern storage and hopefully why you would care! I will probably shoot a hole in the head of the “quick” part but these are, like, the juicy tidbits you’ll need to make good decisions and hopefully all you’ll need to know about it! I’ll try to synthesize with a TL;DR at the end.

Deduplication (or “dedupe” as the colloquial shorthand) has a lot of uses in data management as a whole. The idea behind dedupe is to eliminate storing data more times than you have to. In essence we want to avoid the need to write duplicate copies of the same data. It makes a lot of sense that we would want to keep from repeatedly having to store the same information, right? If I already know about something, and can just refer back to it when I need it in context, I can avoid having to reproduce the same information. However, there are some drawbacks to this that will hopefully be clear(er) after this.

There are different methods to dedupe data, but at this point it’s a relatively well understood technology when it comes to data storage. Some storage or software solutions will try to oversell their capabilities in this arena and make it seem like magic, but it really isn’t and (in most cases) it is table stakes these days! I will caveat that, yes, some solutions definitely do it better (and some do it horribly, tbh). BUT never lose sight of the fact that what you’re trying to do is make pragmatic and economically sound use of your storage.

Dedupe Ratios

Most of the time when we discuss the effectiveness of dedupe, it’s represented by a ratio. That ratio is based on how much effective (or logical) data you can store per usable (or physical) storage unit. So for example, if I tell you that you will get a dedupe ratio of 5:1 then you could store 5TB of data in a 1TB footprint. Pretty great, right? I can buy less storage and store more data.

Now, before we go any further into the technical bits, let’s go back to the practical part. Suppose I tell you I am offering Solution A and will give you 75TB with a 3:1 data reduction ratio and it costs $500. Then someone else offers you Solution B and they’ll give you 60TB with a 4:1 data reduction ratio for $800. Sure, you might store more on the latter solution, but does it make financial sense? The extra $300 gets you a 6% increase in total data efficiency with a 37.5% price increase. For $200 more you could double Solution A to end up with 150TB @ 3:1, an almost 47% increase in actual data stored and only 25% more in cost than Solution B.

It may sound like farce with these kind of numbers, but you’d be surprised how often a nominal increase in efficiency with a substantial increase in price happens in the IT industry!

Now back to the technical bits! How do we dedupe data, where/when do we do it, and most importantly should we do it?

When we talk in terms of how we dedupe data there’s a couple pieces to that.

Level of Granularity

What is it exactly that we want to compare for duplicates? Are we talking high level, where I see a “thing” that is a coherent and recognizable structure of data (like a file) and make sure I don’t have multiple copies of that? It would be a reasonable place to start, sure, but that might be a structure made up of thousands of similar references. Let’s go through an analogy to address the most prevalent levels of data we’d dedupe against:
file, block, and byte.

Let’s say I have a library of books. In this library, there are all types of books from fiction to fact and all the fun and interesting crap inbetween. Now let’s say I don’t have room to bring in more books, and I have a lot of books where I have multiples but people aren’t checking them out, so I want to toss out duplicates to make room.

This would be akin to file-level dedupe. I’ve got a fully constructed and coherent (we’re hoping) piece of data, and a duplicate would be a second copy of the same fully constructed piece of data.

Now let’s say I have paired down to just a single copy of every book. I want to go even further and since many of the words in the books are repeated, I really only need to know the words in the book and where they go. So I’ll take all the unique words from each book, put them in one place, then for each individual book I’ll instead use a reference to my index of words.

Now I have the basis for every book, but only skeleton structures telling me what information is needed to reconstruct that book. This would be akin to block-level. It is worth noting that depending on how large of blocks we assess, it might not be every word, but groups of words that appear together, which could make no difference or tons of difference in our efficiency!

For the next iteration, we know that we have intelligible words and references to them to get the information needed to make a book. But we have a bunch of words with letters that are repeated. So why don’t we just take the necessary alphabets we need and then as we find we need a word or group of words from a particular book? Then maybe we use the genre (fantasy, horror, poetry, etc.) to map common “chunks” together. Then we can look at a block (word or group of words) and look at the actual content to see what relevant “chunks” we need.

While it’s a bit more complex, the above example is somewhat akin to byte-level dedupe. It is less common than file or block-level, and generally associated with a system that would do post-process dedupe, which I’ll touch on below.

Why do we care, Jordan? Because not all dedupe is created equal! If you’re not talking about lots of duplicate files, but lots of duplicate content underlying those files, then it makes things very different in terms of overall effectiveness depending on what you implement.

Bottom line: You may not have a choice of what your current system does, and the choice you do have may have little to no impact at all. The most prudent is probably block-level as most storage subsystems will deal in blocks as the smallest intelligible piece of data.

Inline vs. Post-Process

Here’s your crappy analogy for this one:
Do I want a water supply line that can be regulated down to the needs of my plumbing so I can fill a glass and not burst the pipes? OR would I rather have a fire hose that fills a big tank, and then I can fill my glass later?

If we are evaluating inline, it is done as the data is coming in. It is inspected, compared to existing data, and then either written or discarded. If it is discarded, that means we already have that same data, and now we just have another reference to it using metadata. A pointer is created, referencing the existing copy of that data, and requiring only minimal space to store that reference.

If it is post-process, all the data is written to the underlying storage, and then at a later time it is inspected for duplicates. The data will be scanned after the fact to find and eliminate duplicate copies, indexing the references as it finds them.

It is often preferred and fairly typical that a modern system will use inline dedupe. Think back to the scenario above with Solution A and Solution B; what if either solution relies on post-process dedupe? Well you better hope you never have to land more data than what you have available in usable (pre-reduction) storage! Even if you do, will it dedupe in time so that you don’t fill it up as other writes come in?

Maybe I have 1TB that I’ll get 10:1 reduction on, but I’ve already written 500GB to the physical storage, and I have an incoming write that will be 600GB. Even if it would’ve reduced to 60GB, if it didn’t have the space to land before that, you’re SOL.

Sidenote: It’s not necessarily either/or with inline/post-process. Some solutions will do dedupe inline, but if enough IO is coming in that it would not be able to keep up with performance because of the overhead associated with the extra analysis it takes to find duplicates, it may shut it off and come back later for post-process. I’ve seen people get in that “SOL” example I just gave because it worked 99% of the time, until a heavy workload meant it sacrificed dedupe to keep up with performance, and for lack of better terms, shat the bed.

Why do we care, Jordan? Because dedupe isn’t free! Not just the $$$ idea I mentioned earlier, but that extra analysis takes time and resources (CPU cycles, memory, etc.) Doing it inline means it might be slower to write data, doing it post-process will be faster but less efficient in most scenarios, and depending how the process is implemented, how much utilization the system sees, etc. it could effect the performance and efficiency.

Bottom line: Inline dedupe makes the most sense most of the time so that you are guaranteed your data is being reduced (if it will reduce, more on that below). Most storage systems and software will do it inline, but you probably want to make sure it always does it inline or at least know the risks if not.

Dedupe Domains

What is our actual “dedupe domain”? This is a term that refers to the boundaries of what is going to be compared for duplicates. Is it going to be everything ever written or a subset? Going back to the book example, am I just looking at all the books in one shelf to compare? The whole library? Every library in my county/state/the world?

Some systems do global dedupe, meaning everything stored there will be compared to everything else to ensure there are as few duplicate copies as humanly (machinely?) possible. That bears certain implications, like the fact that as our data set grows it could be a monumental task to compare and keep sanity as metadata grows immensely. When we start out, hey great I don’t need to store this again, I’ll just point to this other copy and use a metadata marker! But what happens when you do that again. And again. And again and again and again? Suddenly you have metadata and indeces all over the place. Restructuring that data when you need to read something means you will have to cross-reference the indeces with the actual data, assemble the data into more coherent structures, and return it to the system requesting the data. This is often referred to as the rehydration tax as you “rehydrate” or reassemble the deconstructed data.

Another common implementation is to establish some sort of boundary, via a policy or group of sorts. Nowadays it’s actually the most common methodology to do it this way. The reason it is more popular in a lot of implementations is that now you’re hedging your bets on what data is likely to be dissimilar. For instance, what is the likelihood that you’d have a duplicate data block between a generic Windows OS disk and an Oracle database? Or similarly, an Oracle database to an Exchange mailbox? It’s probably unlikely enough or the occurrence will be so insignificant it is better to skip comparing it and reduce the overhead required for dedupe.

In this scenario you would do something like create a policy that says “any volume/drive/storage subsystem with this policy should be compared for similar data.” Now you can establish boundaries that make sense and still give you really good data reduction efficiency. You could group by application for primary storage, group by jobs in backup, etc.

Why do we care, Jordan? Back to the “dedupe isn’t free” statement; are you gunning for maximum storage space in the smallest footprint, good space with good performance, or the most cost-effective option?

Bottom line: Pick the right system for what you are trying to accomplish, but understand your data and workloads, how they are structured, and how to structure them to get the best efficiency for whatever you have.

Non-reducible Data

We’ve talked a lot about the ways to reduce data, but that’s assuming it will reduce. There’s plenty of data that you aren’t likely to see much (if any) reduction from.

Some examples would be video or image data. It is highly likely that each media file is going to be unique and is a stream of data that has to stay contiguous to be intelligible! Sure, there’s data blocks/bytes underneath, but what is it that makes those files meaningful? Where the pixels are laid out, what is being said at a given time, etc. Those aren’t discernible things to a storage subsystem. Maybe someday we’ll have a cooler way to do that, but for now you can assume those won’t really reduce. If you have multiple copies of the same file, sure, you may get reduction there.

Similarly, data that is encrypted might be problematic. If it is encrypted before it is written to the storage it will be, by design, unique. Since no two keys should be able to unlock the same data, you’re getting expected behavior but you want to be absolutely sure you’ve not forgotten this critical aspect.

A lot of storage solutions can encrypt data at rest so that it is encrypted on the underlying media and decrypted only when the data is accessed/updated, then re-encrypted.

If you are encrypting data for compliance reasons, sometimes having the storage encrypt data at rest is enough to satisfy the requirement, especially in conjunction with a larger defense in depth strategy. Or perhaps in conjunction with something like in-memory encryption on AMD EPYC systems.

Why do we care, Jordan? You don’t want to rely on data reduction to save your bacon if you can’t reduce the data! This is another reason why you should be careful if someone comes in touting a particular reduction ratio; how do they know? Unless they know your data (or you do, to challenge them) you can’t be sure.

Bottom line: Make sure you understand what data you have and how much of it. If you have a lot of unique or streaming data, you aren’t likely to see many savings. If you have to encrypt data, do it at the storage level if you can so that it gets encrypted after it gets deduped.

Summary

If you made it here without skipping to the end, I hope it was worthwhile and gives you better foundational knowledge of what dedupe actually does and where it might be practical. I made the statement that dedupe isn’t magic, but it can certainly still be nebulous, and it is often surrounded by competing technologies that try to put a “spin” on it to mystify what they’re doing and make it seem valuable. Hopefully this gives you enough information to make a good decision and ask the important questions!

For those of you here for a TL;DR here’s what I’d say:

Know what your data looks like: type of data, amount, how/how quickly it changes, and if it will even reduce.
Understand how your solution intends to group data, and how to best group your data knowing that for maximum efficiency.
Know how your solution processes data, and I cannot stress this enough, what are the impacts on performance to use dedupe and are you willing to pay the rehydration tax
Above all else does it make sense to use it? Some applications of dedupe make more sense than others, generally it is an added benefit that is considered table stakes in modern storage, but does it achieve the outcome you need it to? Think back to the $/TB example.

I hope this helps you to understand how dedupe works, where it can help, and above all else have the information needed to make prudent decisions about your technology!

I’ll be going through setting up a StoreOnce virtual appliance, whose main purpose is to dedupe storage for backups (one of the great use cases), in a later post if you are interested in seeing how this applies in the real world.