This series of articles details my thoughts and ultimate process for solving data backup. In the first part we set up the background by tossing a phone in a river. I encourage you to start there if you haven’t already read the preceding parts. This part picks continues by exploring how we can think about data and classify it. As you’ll see below, it is less important to think about what the data is, i.e., pictures versus tax returns, and more important to think about the characteristics it has.
In the previous part I asked you to think about the definition of local and remote. I also asked you to think about recovery scenarios. I’ve detailed my answers for remote and local below. My thoughts on recovery were in the writing in the previous part.
While I travel a lot for work, I am not a digital nomad and I have a permanent home-base and an office I can visit. Here are my answers:
For me, remote is going to be cloud storage. I know I don’t have the personal discipline to regularly move drives between my home and my office. I also no longer drive to work, so options like the car truck are out (and kind of a joke anyway). Cloud storage is going to allow me to really have my data far away from, possibly even on a different continent.
There are two locals suggested. One is easy, my laptop. The working copy of my data is on my laptop. This is true for the data that will meet the 3-2-1 rule requirements. Some of the variants that were introduced in the second part have fewer local copies.
My other remote is going to be one or more external hard drives. I may wind up doing some local mirroring to guard against drive failure. I don’t know for sure, as I haven’t gotten to implementation yet. Today I own three external hard drives, however they are quite old, a 1TB drive from 2010, a 3TB drive from 2012, and a second 3TB drive from 2013. These need to be thought carefully about during implementation.
I hope you had some good thoughts about these topics.
There are a lot of ways to define data. An easy one is on the ease of getting it again from a third-party or recreating it. For example, your operating system files are, generally, trivial to retrieve again from the distributor so you don’t need to bother backing them up. Your system configuration (or photos, etc.) are almost impossible to get back from a third-party and may be impossible to recreate. This classification is a great way to think about what you may want to backup versus what you don’t have to backup, however, it doesn’t help you apply the 3-2-1 rule and its variants.
Another way to look at data, is to divide it by value. How important is this data? A simple model is to say that rank the value of data as:
These dimensions start to provide us with a way of looking at the 3-2-1 rule and its variants. High value data should get the full 3-2-1 treatment. Medium data may need that or could qualify for 2-1-1 treatment if it is infrequently accessed. Low value data is a candidate for the 1-1 rule.
Storage providers often use terms like ‘hot’, ‘warm’, and ‘cold’ when classifying their products. “Hot” products are data storage that is quickly consistent and always available for read, write, and delete. “Warm” products are typically slower to read, but faster to write and delete. They may also have delayed or eventual consistency. “Cold” products are often structured to be effectively write-only. The data may take a long time reach consistency and is not expected to be needed for read or delete without notice or delays.
I believe these terms apply to data in this manner:
The level of inconvenience you are willing to put up with in a backup/restore solution increases the cooler data becomes. Retrieving a 10 year old tax return can take significantly longer than getting back a picture of my grandmother. Not having access to my active work effectively puts me out of business until it is back.
Hot data is a candidate for versioned or archive-style storage, if that is an option or consideration.
Some of my data is automatically saved in other places. For example, photos taken on my phone are automatically uploaded to a cloud storage provided by my phone vendor. Spreadsheets I create in an online editor are automatically and only saved in the vendor’s cloud storage. I believe this leads to two more data axes:
Data stored by a major provider, such as Google, is probably significantly safer than data stored on a random etherpad provided by a random human. For both of these types of data you need to also consider if you will likely have notice in the case of service shutdown or failure. A major service provider is more likely to provide notice and a transition plan for services, even free services.
Classifying your data by attribute will make it much easier to think about the backup strategy to employ. Whether time, money, or energy, backing up data costs something. Let’s minimize that cost.
Read the next part to take the data in each category and apply our goals to it. I’ve included some optional homework below. You may find it useful if plan to adopt or adapt this for yourself.
Thinking about these data classifications, make a list of what actual data you have in each category. Estimate the size of that data. Is it bigger or smaller than you thought? When thinking about how data is classified, it is often useful to go with your first instinct. Don’t let the size of the data (and the perceived potential costs) distract you right now. Trade-offs, if required, will get made once those costs are full known.
In this case, “raw” means that I can’t necessarily directly access the file with a traditional file manipulation tool such as
rsync but may instead be required to use my vendor’s application to access my data. ↩