SAGE ;login: - On Reliability

on reliability

You and Your Users

John Sellens has recently joined the Network Engineering group at UUNET Canada in Toronto after 11 years as a system administrator and project leader at the University of Waterloo.

This time around, let's discuss user community interaction and how it relates to reliability. Recall that we're providing services, and that users are, of course, the reason we provide those services. How can we use our user interaction to improve reliability, communicate that increased reliability is one of our goals, and help our users become part of the reliability equation? (I'll mention that I'm writing this on an airplane, and as a user of the airline's services, I very much want to know that they care about reliability, and I'm more than willing to do my part.) Let's look at the question three ways: communication to users; communication from users; and education, training, and publications.

Communication to Users

A long time ago I learned that in a number of situations it is not sufficient simply to do your job -- you must also be seen to be doing your job. By that I mean that sometimes you must be obvious about what you are doing (and why) while you're doing it. Consider, for example, a security guard -- the fact of being visible in itself can act as a deterrent and reduce the likelihood of "an incident."

Reliable system administration is another one of those tasks that are enhanced by visibility. For example, if a system is obviously being run in an organized and disciplined manner, are users more likely to act that way themselves, and thereby bring us closer to our goals? How can we be obvious about what we're doing, and why, and how can users help? A documented, predictable computing environment will be thought of as far more reliable than it would otherwise.

The standard method for successful presentations is to tell your audience what you're about to tell them, tell them, and then tell them what you've told them. There's a convenient parallel for communication to your user community:

Tell them what you expect to do for them and what you expect from them.
Follow through and work toward your commitments, keeping them informed of your progress (or lack thereof) and of any failures or incidents that might occur.
Provide statistics, incident reports, and plans for improvement as you progress.

Tell Ya What I'm Gonna Do

Advance communications are often going to be the largest component of your "formal" user communication and are likely to tend toward the "static" rather than the "dynamic." You should outline (in greater or lesser detail as your situation demands) what services you will (plan to) provide -- for example, centralized file service, printing, user consulting, and authentication services. The list of services will presumably have been arrived at through a process of user consultation, executive fiat, or divine inspiration (or a combination of all three), tempered by your experience and expertise and suggestions on what might be most appropriate and useful in your organization. These are your "service offerings."

Next, you should document your goals for performance and availability, both in terms of machine and network performance and availability, and in terms of guaranteed response and repair times. These are your "service-level agreements." For machines and networks, you would typically look to such metrics as percentage uptime, response times, and network latency guarantees. For example, you could state that your network file server will be unavailable less than an hour a month (99.9% uptime), that round-trip packet times between major points on your network will be less than 10ms, or that there will always be at least 5GB of available disk space (e.g., for image processing). For response and repair times, some examples might be next-business-day response for installing new personal network connections, accounts created within three hours, file restores within eight hours, first response to problem reports within two hours, etc. The important thing is to make sure that your goals and guarantees are aligned with the business needs of your organization. Figure out what services and activities are most important to your users and then determine how you can organize your resources (human and machine) to best balance those needs in delivering your services.

The last component in setting expectations is policies and practices. Reasons to establish service and usage policies include:

conformance with, and the ability to serve, organizational goals (e.g., reserving CPU capacity for product developers working to get the next release out the door)
making it possible to meet service-level agreements (e.g., reserving two hours on Sunday so that maintenance doesn't interrupt activities during the week)
ensuring that users aren't interfering with services to others (e.g., no Quake during office hours)

As a service provider, you will be better off if you proactively define the policies and practices you will use to deliver your services. That is, it's better to define your own goals before some other less "reasonable" goals are imposed on you. You will typically want to consider such things as:

standard change/maintenance windows (e.g., Saturday mornings, Tuesday nights from 11:00 pm until 3:00 am, etc.)
emergency repair procedures, and what constitutes an "emergency"
standard operating hours for services such as the help desk and hardware repairs
a policy for off-hours response and a method of deciding what can wait and what has to be fixed immediately
a method for responding to unexpected "incidents," security-related and otherwise
escalation procedures in case goals aren't met or problems are more substantial than they first appeared

You will also probably want to outline possible remedies or repercussions for those times when you fail to meet your stated goals and guarantees. For an internal service organization, these are more likely to involve public "humiliation," bad performance reviews, or lousy parking spots in the company lot. For an external service provider (e.g., an ISP or a computing service bureau), the result of a failure to meet the goals and guarantees is likely to involve money.

Finally, you will need to outline the policies that your user community is expected to follow. These policies contribute to reliability by (we hope) freeing you from having to deal with malicious or unexpected acts by your users and establishing a base of understanding between users and service providers. The better the understanding between the two groups, the easier it is to provide a reliable and understood service. You will likely want to cover such areas as:

security, password sharing, snooping, sniffing, hacking, etc.
virus protection, and under what circumstances it is acceptable to add software to your systems and networks.
protection of the organization's equipment (e.g., no latté with more than three sugars allowed within two feet of a keyboard)
disk-space limits, CPU hogging, and other resource-consumptive areas of contention
what may or may not be connected to the network, and where (inside or outside the firewall, etc.)

Change, Status, and Failure Reports

These reports should be part of ongoing, day-to-day communications with your user community. The two most important attributes of these reports are timeliness and completeness -- you need to provide the information that your users want, need, or deserve at the most appropriate time.

Change reports outline a planned change to your systems or networks and its impact (or, even better, lack of impact) on your users, and summarize or confirm its implementation. These reports are important so that your users can plan their activities around expected service disruptions, and so that they can understand and prepare for changes to software or interfaces. Reports would typically outline the expected date and time of the change, its impact, and how to obtain additional information. Internally, you should also document (in advance) how to implement that change, how to test the implementation, and, most important, how to back out of the change if necessary. Change reports should be issued far enough in advance that users have adequate preparation and warning time, but not so far in advance that they are forgotten.

Status reports are the ongoing mechanism by which your services can be measured, both for trend analysis and to allow for evaluation against your service-level agreements. You would typically track as many of your defined metrics as possible in log files, graphs, printed reports, Web pages, summary numbers, and so forth. I recommend MRTG[1] as a terrific way to graphically track virtually anything against the calendar -- network traffic, uptime, users, disk space, news or mail traffic, routing-table size, numbers of outstanding trouble tickets, and on and on. If you've got the time and you can automate the collection and reporting (and it won't adversely affect your systems), track as many metrics as you can think of, even if you don't publish them all. The more information you have, the easier your troubleshooting and capacity planning will be. It's common in customer service organizations to track various service metrics (time to resolution, phone queue time, calls per hour, etc.). Many (most?) system administration organizations could benefit from more proactive status reporting.

Failure reports are, obviously, something that we would all like to keep to a minimum. They are the method by which you report system or network "incidents" to your customers, and (ideally) document your plans to ensure that such failures are reduced or eliminated in the future. They can also serve as a symbolic way for you to "take responsibility" for an outage and demonstrate your commitment to improving your service. Don't feel you need to wait until an outage is resolved to issue a failure report; your users will appreciate your openness and consideration and will also be less likely to contact the help desk about expected up times if you've already posted bulletins. The latter can be quite useful if you're with a small organization and the people responsible for answering the phones are the same people who are busily trying to fix the problem.

Statistics, History, and Revisiting the Future

This is where analysis and prediction come into play. Using the information that you have gathered -- along with projected changes in usage, information on new projects, and normal usage increases -- you can generate current and historical statistics, identify trends, and predict the future. And when you've done it once, you will also be able to use it to repredict the future and compare your predictions to what actually happened. The capacity-planning uses of this information are obvious, and these reports are also a good way to demonstrate your professionalism to your users (and your management!).

Communication Initiated by Users

In order to be able to correct problems, deal with "incidents," and enable some ongoing improvement of your processes, you need to provide ways for your user community to get you the information you need (and the information that they need you to have). I've defined this communication as user-initiated in an attempt to gently remind you that communication must be two-way -- you (of course) have a duty to respond (preferably in a timely and effective way).

It typically makes sense to think of two sets of user-initiated communication:

the ongoing day-to-day problem reports, help requests, and requests for enhancement (and, if you're very good, thank you notes for a job well done)
the periodic status reports, general reviews, direction or policy guidelines, and overall satisfaction indicators that help determine your direction and activities

The first can be expected to come from almost anyone within your organization (and sometimes from outside your organization too), while the second are more likely to come from organizational management, steering committees, user groups, and the like, as well as your own surveys and inquiries.

The most common communication method is (of course) the telephone. Most of us have probably experienced those calls out of the blue reporting some major problem, with an expectation that we will drop everything and solve it immediately. But as I review the communication process, I'll also review the alternative communication methods that you should consider and/or support.

Let's assume that you have some number of people responsible for dealing with user or customer queries (the "help desk"), and divide the communication process into three stages: the request, tracking and resolution, and your response. (These three stages apply to both sets of user-initiated communication, but you will of course adjust your reaction and process depending on the type of communication.)

The Request

When someone needs help, or wishes to report a problem or request an enhancement, they need some "reporting method." Consider these alternatives:

Phone: a "well-known" generic problem-reporting phone number, with responsibility for answering it distributed among your help-desk staff in some reasonable fashion (queue, dedicated "on-call" person, round robin, whoever isn't busy, etc.). It is, of course, convenient if there is mnemonic value to the number (e.g., company extension 4357 -- HELP), but the important things are that it exists and it's publicized. Note that some telephone systems can provide call statistics for your help-desk calls, which is handy when you're trying to prove that you need more staff.

Email: a well-known email alias, such as <[email protected]> -- most of the same considerations apply here as for telephone contacts.

Newsgroup postings: a local newsgroup used to report problems or request enhancements in larger environments. (I've seen this used in a university.) This can cause the users to feel a little bit like they're sending their problem floating off like a message in a bottle, but with prompt response and tracking it can be effective. A useful side effect of this method is that it makes it possible for other users to reply, which can lessen your overall support load. This is most likely to occur in an environment like a university where large numbers of people (i.e., the students) tend to be enthusiastic generalists with a low per-hour cost and some amount of free time, or at least time available for procrastination.

Office: a consulting office or physical help desk, where users or customers can walk in and (one hopes) get helped while they wait. This is also a great place for distributing printed documentation.

Web form: the obvious method for the late 1990s, which could generate email, trouble tickets, or (potentially) even voice mail (but why would you bother?), or just about anything else. But make sure that the form is easily findable on (or from) your organization's internal Web site.

Hallway chat: Try to avoid this one, because you'll never remember all the details, you'll forget about it entirely, or something else will go wrong and get in the way of addressing the original request. (The even more problematic versions of this include the barstool chat, the running out the door "by the way, I've got a question," and the offhand comment in the washroom.) I always make it a practice to say something like "Sure, we'll get right on it, can you send some mail to `request' summarizing the situation?"

Regardless of the method used to make the request, the better and more complete the information you get with the request, the easier it will be to solve the problem; and the better you are at solving problems, the more reliable your systems and services will be. Consider the use of some form of problem-reporting form or checklist to help improve the quality of your initial data collection.

Tracking and Resolution

Once you receive a request, you must use some method to keep track of it to make sure that it doesn't get forgotten. Regardless of how simple or how sophisticated your tracking system is, it will probably involve the following six activities:

Recording - the initial information received from the user
Delegation - assigning the task to someone
Tracking and note-taking - during the investigation of the problem
Resolution - an indication that the problem or request has been fixed or addressed
Reporting - notification to the user of resolution and ongoing efforts if still in progress
Archiving - a work record and (buzzword alert) "knowledge base" for future reference

We often don't analyze the process in such depth, but even those little pink telephone message slips can be used as the mechanism to implement those six activities. You might, however, appreciate the added features offered by even the most rudimentary automated tracking systems.

The key benefits of a problem- and request-tracking system include:

Reassurance for the user or customer by the assignment of a ticket number, or by query and status tools and messages if they're available. It provides an indication that you take user/customer response seriously.
An ongoing work record, which is useful for keeping track of changes, for standard answers and fixes, and as a customer service log. It's far easier to convince management that you need more people if you have reliable statistics to back up your claim of being overworked.
Provision of a to-do list and a mechanism to ensure that nothing gets lost. But make sure that you review the list -- it won't help if you just record the request and then forget all about it.
Enabling outstanding requests to be escalated for more focused attention (even if it's just your boss reviewing the two-month-old requests and asking you pointed questions).
A conversation/interaction history, which makes it far easier to pass a task on to someone else and get them up to speed.
Aid in assuring a measure of consistency in the way you respond to requests, which will lead to more effective and efficient support.
A training tool for staff. If they can review what others have done before them, they can do better themselves.

All of these improve your reliability.

Your Response

I mentioned the need to report and reply to the requester, but it's worth repeating. The purpose of all this effort is to reply to the requester, providing advice, a fixed bug, a plan, a report, or even a statement or apology that you're unable to help (because of resource constraints, different areas of responsibility, and sometimes impossible problems). Your response should be timely, complete, and correct, and it's almost always worthwhile to provide interim status reports if it will take a nontrivial amount of time to respond to a request. Consider the method you use to deliver your response; choose among email, paper, phone, or face to face, depending on the problem, the response, and the person who made the initial request.

Education, Training and Publications

One of the best ways to make yourself (or your group) more effective is to help your users and customers to help themselves whenever possible. A one-time investment in training materials (with a small amount of ongoing maintenance and revision) can provide a lasting positive effect on the effectiveness of all involved (service provider and service consumer). To tie this more clearly to reliability, a more effective user community will make fewer errors and will lead to a more reliable organization, and a system administrator who has fewer user requests (because of a better user education, documentation, and training program) will be better able to deal with the systems themselves, the long-term planning, and the problem avoidance that all contribute to higher reliability.

Consider some of the following mechanisms for getting the word out:

Email or Newsgroup Postings

These are probably most appropriate for notices, brief announcements, etc., and usually are not very effective as an educational tool. One exception I have seen (as mentioned above) is the use of a local newsgroup for posting and resolving problems, which often provides a good learning environment for the innocent bystanders reading the group for entertainment.

UNIX Man Pages and Other Traditional Help Systems

As more computing activity happens in a GUI workstation environment, the traditional text-based help systems and man pages are losing some relevance for end users. But these methods are still very important for system administration and other behind-the-scenes activities, and also for people working in a command-line environment. And it is of course possible to put a Web front end on traditional help text.

Web Pages

It's probably a fair bet that the vast majority of internal systems documentation is being put on the Web now. For client-server, general process documentation, help desk, desktop support, and the like, there's probably no better alternative

Local Guides

Local booklets and user guides can be very useful, as long as you have a reasonably sized topic area that doesn't need constant updating. An example might be a user and security policies document. While you would almost certainly want to make an online version available, there's still a lot of value in printed materials. I'll mention the SAGE "Short Topics in System Administration" series as a successful and effective example of local community guides (as long as you're willing to agree that SAGE is primarily a community).

One-page Guides

A number of universities and large organizations have developed very effective single-page topic guides, intended to be handed out from the help desk or consulting office. These are typically intended to be almost complete references to every (local) thing you need to know, on topics such as setting up PPP, introduction to email, and how to print. When someone comes in to ask a question, you can provide the answer and also send them away with prewritten instructions.

Classes and Tutorials

Of course, sometimes there is no good substitute for good old-fashioned face-to-face learning, in a classroom or workshop setting. These are probably more effective for larger, more involved topics, or for areas in which it's important to get a fast start. A common example is when major new applications are put into place (such as a new purchasing system), and people have to be up and running almost immediately.

Any training or information mechanism will take time to put together and get going. Resist the urge to put it off for another day. If you can't find the time yourself, hire a contractor or a third-party firm, or buy prepackaged course materials. The time you save may be your own.

Next Time

In the next, and most likely last, "On Reliability" article, I plan to cover certain aspects of security and review how your security policies and practices affect the reliability of your systems. As always, if I've left something out, or I've forgotten your favorite topic, I'd enjoy hearing from you.

References

[1] Tobias Oetiker, "MRTG -- The Multi Router Traffic Grapher," The Twelfth Systems Administration Conference (LISA '98) Proceedings, December 6-11, 1998, pp. 141-147.

A modest collection of links to problem- and request-tracking systems is at <http://www.net/~jsellens/tracking/>.

Need help? Use our Contacts page.

22 Mar. 1999 jr
Last changed: 23 Mar. 1999 jr

Issue index

;login: index

SAGE home