I have a database that I want to back up. It took me 3 days of work to assemble this database, and I'm not willing to repeat that work if the server dies. So I need to back up my data somehow. Additionally, I manage a number of servers and external databases that I would like to have bulletproof backups for. I want a backup system that can manage scheduled backups easily from a lot of sources, and never fail to restore.
Backup solutions don't all guard against the same failure scenarios. Sometimes a backup guards against user mistakes. The undo system on your computer can be used to restore the previous version of your document if you type something incorrectly. Other times backups are taken so that corrupted data can be recovered. The Vim text editor keeps a
.swp file so that it can recover the contents of a file in case Vim crashes. A backup system might also be used to view past versions of your data. MacOS supports this with their Time Machine software. Backups can guard against data loss, whether accidental or malicious. You want to keep your photos safe if your computer catches fire just as much as if you get hit with the cryptolocker virus.
In my case, I want to guard against 4 scenarios: user mistakes (if I corrupt my data by accident), software mistakes (if some software corrupts data on my behalf), complete data loss, and any kind of malicious attack. So if my living space burns down and my server is a puddle of molten metal, I want to be able to get my data back. If someone uses an infected computer in my network to delete as much data as they can access, I still want my data.
For some scenarios, like user mistakes and past version recovery, storing the backups locally is perfectly acceptable. But in larger data loss scenarios, that is not good enough. If your server catches fire, you're going to lose your backups in addition to your data. The same applies if your computer is hit with cryptolocker.
So in my case the backups need to be stored off-machine and off-location. Ideally I want to store duplicate data in at least 2 separate cloud data hosts. Moreover, to guard against the possibility that someone maliciously tries to delete my data for whatever reason, the backup client should not have the ability to delete the backups one they are created on the server side.
There are 2 styles of storage for backups: full snapshot and diff-based. A full snapshot system will store a complete copy of your data every time you do a backup. The upside to this is that each individual backup is a qualified restore point. In a diff-based backup, you start with a single full copy of your data and each subsequent time you create a backup you only store a "diff" of what changed between the last state and the current state. In order to do a restore in this setup, you need to load all backups between the last full snapshot and the last diff backup. To get the current full data state, you need to start with the last snapshot and apply all the subsequent diffs to it. This involves more code and storage complexity, but it almost always uses significantly less storage space.
I don't really care much about this one. The current total size of my backups in Amazon (I have an existing primitive solution) is less than $1. Daily backups for the next 10 years probably won't put any significant dent in my wallet. Even with full snapshots, it's possible to decrease the density of older backups. And if it becomes a problem, I change to a diff based system in the future. You shouldn't solve the problem you don't have, after all. Leave yourself wiggle room to solve it later, but don't solve it ahead of time.
For uploads, you have the same strategies as storage. You can upload full snapshots every time or you can upload a diff of the current state and the last uploaded state.
Snapshot based uploads are really easy to handle in code. Just take a snapshot of everything you care about and upload it. However, they require uploading a full copy of your data every time you do an upload. That might not be a problem if you have a high upload speed, your backup size is small, or you don't back up very frequency But if you need to frequently upload a lot of data, you might find that the load on your bandwidth or the wait time starts to bother you.
A diff based upload system is a bit trickier. In order to calculate the diff you need access to the last state you uploaded. You could download the last backup, compare it to the current state, and then upload the changes, but this somewhat defeats the bandwidth savings that you wanted in the first place. You can also keep a local summary of the last state you uploaded. This could be a literal copy of the directory or a list of content hashes for each chunk of your backup. Keep in mind that the more data you summarize per content chunk, the less space savings you get from a diff based format. If you have have a summary content-hash per file, then you need to upload the full file when it changes.
You can use any combination of snapshots and diffs for uploads and storage. You can upload a diff and then turn that into a full snapshot on the sever side by applying the diff to the last snapshot. You can also upload a full snapshot and have your server calculate the diff between the last version and the current version.
My home upload speed is decent (~10MiB) and I only want daily backups, so I'm comfortable with snapshot uploads. And network speed is not an issue in my server environments. Amazon and Digital Ocean both have very snappy networks.
Not all backup solutions support encrypted data, and the ones that do handle it in different ways. For encryption to be worth anything, you need to ensure that your data was never exposed to a non-trustworthy entity in decrypted form. If the encryption is being done at the server layer, you need to make sure that your either own your server side software or completely trust whoever does. It doesn't matter if Dropbox stores your data encrypted if they have the keys to decrypt it, or could just read a copy of it before they encrypted it for storage. And even if you operate your own backup software, you still need to trust the host that your software is running on.
If you need to use someone else's backup server and you don't trust them without reservation, you must do the encryption step before you upload your backup. Keep in mind that this will prevent you from taking advantage of server side calculated diffs, since the server can't read your data at all to calculate a diff.
Local encryption also comes with a complexity cost for you as the user. You need the keys to decrypt your data. And you need to make sure that you that you don't lose those keys. So now you have to backup your keys somehow, and now all your backups are only worth as much as the backup system you use for your keys.
Encryption is a great thing in general, but it's really hard to have bulletproof encryption. And once you have it, it can be a pain to manage. I'll take it if it's easy, but it's not a hard-core requirement for me. Since it doesn't seem to me that it's going to be easy, I'm going to skip it for now. Since I will own my backup server, I can add encryption at a server level at a later date.
NOTE: I'm fully aware of the risks of not encrypting my data, and I know 100% who I'm choosing to trust (Google Drive, Amazon S3, Lets Encrypt, Digital Ocean) . I've made this considered decision for myself alone. Do not interpret this as a blanket endorsement of avoiding encryption, or any endorsement of the trustworthiness of the above parties. Think for yourself before you decide to do the same thing.
Origin and scheduling
Backups servers can act in "push" or "pull" mode.
In a "pull" system, the server decides when backups happen. The server reaches into the client systems and pulls out the data that needs to be backed up. The advantage of this approach is that all the schedules for your backups are centralized. You could have a nice UI which allows you to view/change backup schedules for your whole infrastructure. You could also have built in warnings if a backup that was supposed to take place didn't happen. The disadvantage is that your backup server needs fairly high level access to the rest of your system. This vulnerability could make a pull based system inappropriate for some business infrastructures.
In a "push" based system, each client system which needs backups decides for itself when the backup needs to happen. Once they decide, they create their own snapshot or diff and push it to the backup server. The advantage here is that you don't need to give your backup server access to all your production systems. You can also limit the complexity of your server: it is responsible for storing backups and fetching past versions, and that is all.
You can also have a hybrid of the two approaches. The server can still handle scheduling and send a notification to all the clients. Once the clients receive a backup notification, they could push a new version to the server. You could also create a separate scheduling service which has access to the rest your servers. This separate service could "pull" from all your client servers and "push" into your backup server.
For now I'm going to create just a backup server, and leave the scheduling to the client systems. In the future I might create an external scheduling/pull/push service.
Some backup systems support restores as well as backups. For example, you could click a button and your running MySQL database would restore a backup. This is a neat feature, and one that's especially good in business environments where downtime is not acceptable. But in my case I can afford to lose an hour to restoring by hand.
Automated restores are also quite a bit of work to set up and test, and require that your restore software have pretty significant access to the rest of your systems. I want my backup system to be as simple as possible. So I don't care about this feature.
Evaluating the existing options
It would be silly to jump straight into building a home-grown solution without evaluating the existing options. I might find that an existing solution fits my usecase exactly, in which case you don't need to write any software at all. Even if they aren't a perfect match, you might be able to solve part of the problem with a prebuilt solution and the rest with your own software. And at very least, the existing solutions will be a good summary of the types of features that you should be thinking about in your own solution.
Tarsnap advertises itself as "Online backups for the truly paranoid". It came very close to being a perfect fit solution for me.
Tarsnap encrypts data before upload. The client is not open-source, but it's readable source; so you can verify that the client is actually doing the encryption as they claim. So tarsnap doesn't require you to trust them at all, which is nice. It operates on a diff-based upload system, which means that your storage size and upload size is much lower.
This is really nice because they charge based on the bandwidth and storage space that they use. Their pricing is usage based, which is a fantastic thing in my book. The pricing ($0.25/GB) is a little high compared to what I would pay for duplicate storage in Amazon and Google ($0.06GB), but on the other hand I don't have to pay anything to keep a server running.
However, while it seems a bit ridiculous that a company based around backups could ever lose a backup, stranger things have happened. I don't like blindly trusting that tarsnap's servers will never lose a backup. The odds of Amazon and Google simultaneously losing the same data seems quite remote, so duplicating backups across the two of them seems safer. Additionally, the tarsnap client supports deleting archives, which violates my desire for "write only" client access.
EDIT: I was wrong about the client being allowed to delete archives. When you generate a tarsnap client key, you can specify that it is a write only key. This means that the client code is allowed to create tarsnap backups, cannot delete them. This is exactly how it should be.
I really like tarsnap. The pricing is really straightforward and Colin Percival (the author) seems to know what he's doing. It's very likely that I'll use it for something in the future. It just doesn't seem like an exact fit for this situation.
Duplicity is a command line utility that supports creating signed archives of your data (encryption) and pushing diffs of these archives into your storage database. It supports Amazon S3 and Google Cloud out of the box, so that saves me work.
Duplicity is only an upload/download tool, in the same vein as rsync. It might be useful as part of the solution, but it's not the finished product.
Bacula initially struck me as complex. I had to read their documentation for close to 10 minutes before I even began to grasp what they were offering. They also strike me as a very enterprise focused solution (they talk about tape storage a lot).
Bacula supports backup and restore jobs. It is a pull based system where you have to grant access to a lot of your infrastructure to make it work correctly.
Bacula looks like it's on the decline. It's reached the point where new features have bloated the software to the point where you need to be proficient in all it's options to get started. The community version also seems to be crippled compared to the enterprise edition, and apparently it's losing features over time.
This could be a large enterprise solution, but it seems like taking a sledgehammer to a nail. Good software strikes a balance of features and simplicity, and this one overshoots the mark by a lot.
Let's just build the thing
At this point I've spend about 4 hours writing this post and evaluating the options, and that's about as much time as it would have taken me to just a backup server.
So let's just get started (in the next post).