On Reliability Backups, Restores, and Recovery![]()
by John Sellens John Sellens has recently joined the Network Engineering group at UUNET Canada in Toronto after 11 years as a system administrator and project leader at the University of Waterloo.
This paragraph is, of course, where I remind you of the basic tenets of reliability: service levels, risk evaluation, costs of failures, appropriateness for your environment, etc. One nice thing about this article's focus is that almost everyone will agree on the necessity for proper backups, so your justification document/business case may be a much easier sell this time. Let's define what we (well, I, actually) mean by backups, restores, and recovery. In this article I'm going to be concentrating on data backups to some secondary storage medium. The most common medium for doing backups is magnetic tape (which comes in a wide variety of types), but some situations call for alternatives, such as regular old hard disk (or DASD for you big iron fans), floppy disks (and their removable media cousins, such as Iomega's Zip and Jaz units), CD-ROMs, optical disks, paper tape, and punched cards (though the latter two have fallen somewhat out of favor in recent years). A "restore" is the process of retrieving a file (or a few files) from the backup media. And "recovery" is what happens when something goes very wrong (fire, flood, earthquake, theft, presidential scandal, etc.) and you need to put everything back in order. I'll also mention "archival storage." An archive is very much like a backup, except that it's intended to be kept for the long term, perhaps forever. Again, the medium used varies, depending on cost, security, and retrieval considerations. I've mentioned "archival storage" primarily so that you'll know what I'm not talking about. I should mention, just for the record, why it's important to do backups. In a sense, it's not the backups that are important; it's the restores and recoveries. Most people have had the experience of accidentally deleting or corrupting an important file, and most system administrators have been faced with users who have also done just that. And it's not just human error that can corrupt files as hard as it may be to believe, some software actually does have bugs. Although disks are more reliable now than in the past, sooner or later a disk is going to fail, burn, or get stolen, and you're going to need to undertake a disk recovery. Think of backups as insurance the mere presence of reliable backups not only prevents your disks from failing; reliable backups also keep you from getting fired the next time there's a flood in your machine room. What's important about backups? Backups should be current, consistent, and complete. Current and complete are easy to define you need to do backups on a regular and reliable schedule (usually daily), and you need to back up all the files and directories that you intended to back up. Consistency is a little less obvious your backup system should be able to deal with files that are changing during a backup. An easy example of a changing file is a large file used for database storage, which could easily change between the time you start reading the file to back it up and the time you've finished reading the file, thereby giving you a nice looking but useless backup. Backups also need to be available when you need them more on that later. (For a good discussion of various backup-related issues, see [1] in the LISA V proceedings, which also contains a number of other backup-related papers.) One of our reliability tools is "automation," and backups are a perfect task to be automated they're done regularly, they're boring (but necessary), and they're very important. Whether you use a home-brewed shell script, some freely available software, or an expensive commercial product, your goals are the same: get the backup done, and get it done right. Let's examine the different components of a backup system software, hardware, physical location, and media handling and consider how they contribute to reliability. Software Most operating systems these days come with some form of backup software, some of it more (or less) suitable for the purpose than others (see [2] for a review of the good and the bad). The best of the "stock software" lot is typically the dump/restore combination not perfect, but they tend to do a "reasonable" job. Many sites have written homegrown wrappers around the stock commands, with varying degrees of complexity. You can try to deal with labelled tapes, tape switching, unattended execution, tape cataloging, etc., but if your disks are small enough (or your tapes big enough) your backup script can be very simple. I once set up for a small, isolated UNIX machine a backup "system" that involved a clerk putting the day's tape in the tape drive, signing on as "backup" (which ran a backup script as its shell), dropping the previous day's tape in the campus mail to an off-site location, and going home, leaving the backup running dead simple, but appropriate for the situation. At the other extreme of homegrown software, I've seen systems that track tape numbers and maintain a flat-file index of which filesystems from which hosts are on which tapes. But unless your needs are very simple, I'd recommend against rolling your own and reinventing the wheel. If cost is a concern, I would recommend that you investigate the Amanda Backup Manager, available from the University of Maryland [3]. It's built around standard utilities (such as dump/restore, GNU tar, etc.), does labelled tapes, has some support for jukeboxes and tape changers, maintains a database, and works on a wide variety of systems and across the network. It is very good software and definitely worth a look. There are other freely available backup systems that you might want to have a look at, including the "Ohio State" backup software[4]. And as for commercial software, there is a wide variety to choose from pick up any industry magazine and check the ads. The commercial products typically provide a GUI management interface, online indexing of files, user-initiated restores, support for more types of hardware, etc. The commercial software seems to vary in features, different OSs that are supported, hardware support, security, polish, and price. Most have reasonable scheduling capabilities, some use "standard" tape formats, and some use proprietary formats. Have a look and see what fits your needs best. And while we're talking about software, let's talk about "live" vs. "offline" backups. By that I mean do you run your backups against a machine running in normal "time-sharing" mode, or do you shut down to "single-user" mode and kill off unneeded processes to ensure that nothing changes on a filesystem while you're backing it up? The answer is, it depends (yes, I know that's a cop out). If you can identify a time of day when your systems are likely to be lightly loaded, and your backup software can handle filesystem changes in a "reasonable" fashion, you'll likely want to do live backups and leave your systems up and running while they're being backed up[5]. This would be a good place to talk about special-purpose software for database backups, except that that's a bigger, more complicated subject than I want to cover here. And I could also talk about special filesystem support for backup ease (snapshots, locking, etc.), but I won't. As for software reliability, it's pretty much the automation, the ease of use, and the robustness of the software that you choose that are the relevant issues. You'll want to automate as much as possible, but with something as important as backups, you'll want to have positive confirmation that your backups are running properly (by reviewing logs, mail messages, etc., on a daily basis). Hardware The hardware that you choose for doing your backups affects your ease of use, media cost and reliability, and ease of replacement in case of failure. Most backups are made to some form of magnetic tape. In the old days, we relied on good old nine track reel-to-reel tape, but there aren't many people investing in that technology these days. In the UNIX environment, two of the most common tape formats are 8mm (popularized by Exabyte) and DLT, with a variety of other types (DAT, AIT, etc.) also in use. A comparison of tape formats is more than I want to get into here, but I will mention a few things for you to consider when choosing a tape format:
The next consideration is aggregation and automation. By this I mean the choice between single tape drives and tape jukeboxes or autochangers (of small or large capacity). This depends a lot on how much data you need to back up, how long you want to keep it available, how often you expect to need to restore data, and how you expect your needs to change in the future. In many cases, the extra cost of a small jukebox (ten tapes or so) will be well worth it in terms of ease of use. And closely related to aggregation and automation is the question of how many drives you need (or want) to have. How much capacity do you need? Do you need multiple drives running in parallel in order to get all your backups done in the time available? Do you need to be able to duplicate your tapes to guard against media failure or to take off-site? Do you want to be able to do restores on one drive while doing backups on the other? And finally, if you do choose a jukebox, will the tape drives still be usable when the tape-changing mechanism breaks (as it almost certainly will sometime)? When we were looking for a new backup system in 1996, we ended up choosing a DLT jukebox with two drives and room for 250 tapes, with the ability to expand both the number of drives and the number of tapes. (We actually bought two of them.) This may seem like a giant system, but it's actually only a mid-size in the world of backups, and in review, it seems to have been a good choice for our situation. And finally, you'll need a machine to run your tape drives. Choose a machine that fits with your other machines, and consider dedicating it to the task. It may seem a waste for a nice machine to be sitting idle all day, just to wake up and write a few tapes in the middle of the night; but for something as important as backups, it's often nice to have a secure, limited-access machine that you can dedicate to the process. Those of you who have been paying attention will have noticed that I didn't mention other media, such as magnetic or optical disk. I'll contend that those alternatives are appropriate for backups in only very special situations and that you'll already know if they are something that you should consider. Physical Location Where are you going to locate your backup server? Does it need to be physically close to your desk, near your servers, in a nice locked room with fire-suppression gear? Do you need easy access to it to swap out tapes? What does your network look like? Do you have adequate bandwidth for your backups in more than one place? I mention these questions to get you thinking about the physical and network security of your backups and backup system. Remember that a backup system makes a nice attack target because it will contain all your data. And what happens when you have a fire in your machine room and all your servers, including the backup server (and the tapes), melt? And don't forget about a nice UPS for your backup system. You may not be able to do any backups during a power failure (because your other machines or networks might be unavailable), but at least your backup server won't get corrupted by a sudden power outage. When we bought those jukeboxes in 1996, we were able to put one in a building across campus that didn't already contain any machines of interest. We dedicated a pair of fibers to a fast Ethernet connection, built a small air conditioned room, installed an intruder alarm, and locked the backup server and jukebox in there. That way, we ended up with off-site backups without having to remember to move tapes about. Media Handling The main considerations for media handling are how to get your tapes off-site and how to get your duplicated/cloned tapes into a location different than the originals. Many people overlook the need to get their backups physically away from the original disks. A fire, flood, or fire axe-wielding computer hater could put you out of business. I'll make my point with a short story, wherein I learned the necessity of off-site backups. I used to do some programming for a university professor on a PC in his office as part of a major, multiyear, externally funded research project. I came in one day, and the IBM PC AT (it was a long time ago) was gone, along with every 5 1/4" diskette in the office, including the backups. Fortunately, we had another set of backup diskettes that were fairly recent at the professor's home. Without those, we would have been in major trouble. (Most people learn a "backup lesson" at some point. I'm just lucky that mine was more abstract than most.)
Testing and Recovery Practice The classic UNIX backup horror story involves multiple filesystems being backed up to a single tape, a hapless system administrator who accidentally specified the rewind instead of the nonrewind tape device, and a company president who just accidentally deleted a very important file. There are two main reasons for testing your backup system. The first is to ensure that you're actually creating good backups, that you can restore from, and that you're backing up the files and directories that you actually intended to back up. Write a script to step through your tapes to check the dump headers on each file and generate a report. Pick some files at random from various machines, and make sure that you can find them on your backups. The second reason for testing and practice is to ensure that, when the emergency comes, you know what to do and how to do it. When the root disk on your main central server gives up the ghost, make sure that you can rebuild it from your backups. This is a convenient place to note that sometimes commercial products that generate backups in ``native" formats are a real blessing. Many operating systems let you easily restore a dump file onto new blank disk. But if you're using backup software with a proprietary tape format, you may have to do a complete OS installation, install the backup software, and only then start doing the actual restore. (I'll point out that it's convenient to be able to attach a new disk to some other running machine, do the restore there, and then install the new disk in the broken machine.) Next Time Next time I plan to talk a little bit about disaster recovery and the kinds of things that you will need to consider when thinking about what to do if a disaster ever strikes your organization. Notes [1] Steve Shumway, "Issues in On-line Backup," LISA V proceedings, San Diego, 1991, pp. 81-87. [2] Elizabeth Zwicky, "Torture-testing Backup and Archive Programs: Things You Ought to Know But Probably Would Rather Not," LISA V proceedings, San Diego, 1991, pp. 181-185. [3] <ftp://ftp.cs.umd.edu/pub/amanda/> contains everything about Amanda, including copies of "The Amanda Network Backup Manager" by James da Silva and Ólafur Gudmundsson from LISA VII, 1993, and "Performance of a Parallel Network Backup Manager" by da Silva, Gudmundsson, and Daniel Mossé from the 1992 Summer USENIX Technical Conference. [4] <ftp://ftp.cis.ohio-state.edu/pub/backup/> for the "Ohio State" backup software, including Steve Romig's paper from LISA IV. [5] You may wish to consider a full, offline backup before you do OS upgrades or hardware changes. If something goes wrong, it can be very comforting to know that you've got a nice safe backup nearby.
|
![]() 13th April 1998 efc Last changed: 13th April 1998 efc |
|