Backup Strategies

Why Consider Backup Strategies?

Saving copies of your work as backups is essential to prevent data loss, especially in today’s digital and multi-connected world. The reasons range from hardware failures (manmade or natural) to accidental data deletion, damage or data loss due to a power outage, violence caused by burglars as well as encrypted data due to ransomware and stealing of hardware. An example for damages might be single bit changes on disk on older file systems without checksum (fat16, fat32 and ext2). Insurances require additional measures to ensure the safety of local and distributed work.

While multiple storing data is relatively simple, backing up and restoring your data can be quite challenging. Take into account that backups can look good at first sight, but be unusable for a number of reasons, including that the full backup is corrupt because something in the I/O subsystem corrupted it or the storage media died or even worse – it is half-dead (partly readable). You might not have the hardware anymore to read the backups (e.g., floppy disk drive, tape reader or CD-ROM drive). Also, if a backup in the log backup chain is corrupt it means that restoring cannot continue past that point in the chain. Now and then but not uncommon, data is simply forgotten so that the backup is incomplete. This leads to dramatic scenes, and according to Murphy’s law [1], it is the most important data that is not there. As always.

This article will give you a short introduction into backups and the corresponding strategies [18]. Factors that are to be taken into account are storage space to keep the snapshots, speed for backup/restore and the possibility to trace the changes in your backup set.

Kinds Of Backup

There are different kinds of backups. You can divide them by strategy, by data format to be backed up, and by location. Next, we will have a closer look at these categories.

By Strategy

Full Backup

A full backup is a strategy where all the files and folders that are chosen for the backup will be backed up, actually. The selection is made based on the file structure and does not include deleted files and folders. It is commonly used as an initial or first backup to generate a full starting basis.

Mostly, a full backup is used in combination with an incremental or differential backup that subsequently follows. After several incremental or differential backups, it is common to start over with a fresh full backup again to ensure that the data sets are complete – even in a case, an incremental or differential backup was incomplete (this is known as multi-level backups).

One has to bear in mind that they have enough disk space available for full backups as this step requires as much space as the size of the data to be backed up. The smaller or less complex the project the easier it is. Nevertheless, this method has two weaknesses:

  • Unchanged data is stored more than once in all instances of full backups
  • Removed data is not part of the new full backup. This backup set acts as a complete snapshot and contains all the files and folders from this specific date.

Incremental Backup

This method only keeps the changes from one backup set to the next one. Compared to a full backup it is quite fast, and unchanged data is not touched. There are no duplicated files but both the original backup set as well as all the incremental backup sets are needed in order to see the changes and apply the stages.

According to various software vendors and technologists there exist five major types of incremental backup methods, namely synthetic, block-level, byte-level incremental, progressive incremental, and multi-level backups. Let’s disassemble the different terms in brief.

Synthetic means that the backup is not based on original data but on a previous backup set. Block-level refers to taking changes per data block on the storage media into account whereas byte-level incorporates changes based on the contents of the files (file level). Progressive incremental is based on an initial full backup followed by incremental backups, only, and there are no further full backups in between. In contrast, a multi-level backup combines full backups and incremental backups in a certain pattern. For example, a full backup is followed by six incremental backups, and this pattern is repeated every week [3]. The result is a backup for every day of the week with full backups made regularly, for example on a Sunday.

Restoring the data sets requires the last full backup, and applying each incremental backup one after the other in the right order. It is necessary that the entire backup chain is intact in order to apply all the changes made between the full backup and the day of recovery. Otherwise, the history of your data is interrupted, and newer backups cannot be applied.

Differential Backup

This method is in between a full backup, and an incremental backup. It is a backup set that contains all the changes made since the last full backup, i.e. every differential backup is just a level one incremental backup. As a result, it is much faster than a full backup, but the longer the backup is away from the corresponding full backup, the longer it takes and the more data has to be transferred. Furthermore, regarding the required storage space it is more efficient than full backups since only files that changed since the last full backup will be copied on each differential backup run. But also their amount grows with time unless a new full backup is done.

Restoring the current data sets requires the initial full backup, and applying the last differential backup set. This is expected to take much less time than applying the according incremental backup.

Reverse Incremental Backups

This method is also named Reversed Delta Backup [16]. As described in [17], “the most recent backup is always filled and older backups are reconstituted by merging all the deltas starting with the nearest future filled backup and working backwards.” The last backup becomes a full backup and minimizes the time to restore the entire dataset.

Mirroring

This method refers to mirroring your data. You can achieve this by creating a Redundant Array of Independent Disks (RAID) [5] or using a ready-made Network Attached Storage (NAS) [6] that usually comes with four to six hard drives.

To make it clear: mirroring is not backup and does not prevent you from losing data. Mirroring adds redundancy in terms of storage and thus helps you in case of a hard drive failure instead. If you delete a file on the main disk it will be immediately deleted on the mirrored disk because the disks are permanently synchronized.

By Data Format

As pointed out earlier, a backup can be done file-based, and as well by content and structure. The file-based approach refers to the entries in the filesystem such as files, directories, and links.

The simplest way is to synchronize two storage areas using one of the command line tools like rsync [12] or graphical or web-based tools like BackupPC [10] or SeaFile [11]. These tools are based on interaction but can be configured to run frequently as a background process. For Git repositories a combination of the commands git clone and git push helps [9]. Adding files and their changes in a Git repository is one possibility to do a file-/data-structure-based backup and mirror it elsewhere:

# git clone --mirror https://primary_repo_url/primary_repo.git
# cd primary_repo.git
# git remote add --mirror=fetch secondary https://secondary_repo_url/secondary_repo.git
# git fetch origin
# git push secondary --all

The structure-based approach refers to data and structure and is common for backing up a database. As an example, PostgreSQL [7,8] offers the commands pg_dump, pg_dumpall and pg_restore to handle as well as backing up and restoring both the data and the structure of the databases and tables. The output of pg_dump and pg_dumpall is a text file that contains both the SQL commands and the data that are necessary to recreate the database’s structure and to insert its content (database data).

--
-- PostgreSQL database dump
--

SET statement_timeout = 0;
SET lock_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;

--
-- Name: plpgsql; Type: EXTENSION; Schema: -; Owner:
--

CREATE EXTENSION IF NOT EXISTS plpgsql WITH SCHEMA pg_catalog;

--
-- Name: EXTENSION plpgsql; Type: COMMENT; Schema: -; Owner:
--

COMMENT ON EXTENSION plpgsql IS 'PL/pgSQL procedural language';

SET search_path = public, pg_catalog;
SET default_tablespace = '';
SET default_with_oids = false;

--
-- Name: employee; Type: TABLE; Schema: public; Owner: frank; Tablespace:
--

CREATE TABLE employee (
        employeeid integer,
        lastname character varying(50),
        firstname character varying(30),
        locationid integer
);
...

By Location

You can store your backup at different locations. Each choice has its advantages and disadvantages.

  • Local backup + The storage medium is kept close at hand or in the same building as the source. It is common to have an additional hard drive as part of a local server or attached to your system, a CD/DVD ROM, or a USB device to store data on. This kind of backup allows easy access but does not prevent you from local damage like fire and theft.

  • Offsite Backup + The storage medium is located outside the same place as the source. It is common to use a different geographic location such as a home place, a computing center, or a bank safe deposit box. This backup method requires transport to the other place but prevents you from local incidents.

  • Online Backup + The data is backed up to a remote place that is permanently connected via network to the source. This can be a remote server in a computing center, and also a cloud service. This method allows access to the data from everywhere, and also a restore directly to a different place or newly set up machine. This means independence from the place the source originates.

  • Cloud-to-Cloud Backup + More and more data is stored on remote systems in clouds. As a customer we do not know if the SaaS service providers have backups at all. We just assume it, and trust the service provider. Furthermore, getting a copy of our remotely stored data can be tricky because most Cloud Services act as black boxes. + In case your backup policy requires the storage of backup data at a different place you will have to ask your SaaS provider about its geographical redundancy. Subsequently, it can mean that you are not allowed to store your data in the cloud anymore. Also, to access your data check for an API that is provided by yourSaaS partner, and works properly. To be on the safe side a collaboration with a backup service provider is essential because it is their business to handle your data properly – that’s what you pay for.

All backups that are not at the same geographical location as the original data, including offsite and online backups, are also summarised as geo-redundant backups.

What To Backup

The answer to this question depends on the system to be backed up. On regular UNIX/Linux systems for the desktop, there are a few folders you should consider [2]. For a more detailed list we recommend you to have a look at the Linux Backup Guide [13]:

  • /home – the directories for regular users
  • /etc – the configuration files
  • /var – for database structure and content, log files, and webserver content
  • A package list of installed software. For Debian GNU/Linux and Ubuntu, the two commands dpkg -l and apt list --installed output a list of installed packages.

Binary files from software that can be reinstalled easily are not necessary to back up. For databases have a look at the according tools to dump the data stored in it. A file-based backup of databases on disk does not suffice, at least not unless you can synchronize the entire database on disk and freeze it until the file backup is finished.

Validating Your Backup

In order to make sure that all the data is backed up and nothing is forgotten it helps to have a list of files and folders, first. The next step is to check the integrity of your backed up data. Here, tools like rsync, cksum [14], crcsum [15] and sha512sum come into play. These tools help to compare local and remote files based on hash values. Cksum and sha512sum are part of the standard toolset of Debian GNU/Linux (coreutils package [19]).

And what about restore? This step is mostly forgotten. It is a “must” to check your backup regularly and to verify if the data can also be restored in full. It happened more than once that files have been forgotten, or have not even been tracked in a Git repository. We will have a closer look at validating a backup in one of the follow-up articles.

Which Backup Strategy We Use

We focus on Copy on write (COW). On the file system level we work with BTRFS. We create a snapshot that copies the data that has changed, only. Data that did not change is just referenced. This is efficient, very quick, and scalable. Based on BTRFS we have automatic data compression, and do not have to care about other measures of optimization.

Also, we have a second level and keep backups of backups. Based on a low-level snapshot on the file system level, a copy of it is synchronized with an AWS cloud service.

Conclusion

Backing up your data is essential. There is a number of different strategies to accomplish that goal. No matter which one you choose, always have a copy your data. Better safe than sorry.

Acknowledgements

The authors would like to thank Gerold Rupprecht and Axel Beckert for their help and critical remarks when writing this article.