USENIX ;login:

fundamentals of troubleshooting TCP/IP

Tim Hoff is a network administrator. He has earned Cisco's CCNA and CCDA certifications.

As a senior system and network administrator, I have frequently observed a lack of understanding of computer networks in general and TCP/IP in particular. The problem is most acute in help-desk and first-level desktop support. Understanding TCP/IP will enable help-desk and first-level support to isolate network problems more quickly. Correct and efficient problem identification reduces the duration of outages, thereby minimizing lost-productivity costs and improving customer satisfaction.

Consider the following scenario. Around 7:30 a.m. a user in accounting calls the help desk to report that the financial database is down. The user is very anxious to get access, because month-end reports are due to the CFO by 10:00 a.m. The help desk hangs up and sends a 911 page to the on-call database administrator to check the database server. This may be the right thing to do. But what if the network connection between the user's PC and the server is broken? In that case, the DBA is annoyed (not as much as by a page four hours earlier, but still annoyed) and time has been wasted. The problem still isn't fixed, so someone has to be paged to check the network. But there may not be one single person for this. There might be someone who deals with routers and someone else responsible for the PC network card and patch cable. How can a technician determine whether or not the problem is network-related?

The purpose of this article is to provide a basic framework for help-desk and first-level support to understand the basics of TCP/IP networks and give them a process to help identify the probable causes of problems. I'll start with a description of end-node configuration and TCP/IP addressing. End-nodes are devices such as PCs, printers, and servers. For convenience, I will simply refer to PCs. Then I'll outline a flowchart of general troubleshooting steps and common tools that are available on Microsoft Windows PCs (and many other operating systems).

TCP/IP Configuration

When a PC is attached to a TCP/IP network, several parameters must be set. First, every device needs a unique network address. IP addresses are 32 bits long and are partitioned into a network portion and a host portion. The subnet mask's job is to identify the network portion explicitly. Finally, the PC needs the address of a default gateway or router. Whenever the PC cannot talk directly to the intended destination, it sends the data to a router. The router's job is to forward packets toward the final destination.

IP addresses are frequently written in a form known as dotted-decimal or dotted-quad. The 32-bit IP address is broken into four groups of eight bits. In binary, eight bits can represent values between 0 and 255. Valid addresses fall in the range of 1 to 254. A group of all ones or 255 in decimal represents a broadcast address. For example, 192.168.30.255 means all nodes on the 192.168.30.0 network, assuming classful addressing is used.

Originally, the Internet protocol provided huge, large, and big networks referred to as class A, B, and C respectively. A class A network used the first octet (or eight bits) for the network number, leaving 24 bits of host. This provides for 16,777,214 host addresses. Class B and C networks used the first two and three octets respectively. This scheme works very well with dotted-decimal notation.

Unfortunately, classful addressing could not keep up with the Internet's tremendous growth. As a result, people have been forced to switch to classless addressing and routing, or CIDR. Instead of using fixed increments of 8, 16, and 24 bits for the network number, any value between 1 and 30 can be used. The bits assigned to the network portion must begin with the leftmost bit and be contiguous. For example, 255.255.192.0 is a valid subnet mask of length 18 bits. However, 255.255.193.0 and 255.255.208.0 are invalid because there are gaps in bits of the mask when written in binary form.

It is still common for people to refer to class A, B, and C networks. However, this is usually just a way of saying 8, 16, and 24 bits respectively. Actual classes are determined by the following rules: If the first bit is a zero, then it's a class A address; if the first bit is one and the second bit is zero, it's a class B address; if the first two bits are both one and the third bit is a zero, then it's a class C address. Table 1 shows the possible address ranges based on these rules.

Start End

Class A        1.0.0.0            126.0.0.0
Class B        128.0.0.0        191.255.0.0
Class C        192.0.0.0        223.255.255.0

Table 1. Traditional Network Classes

Classful addresses implied a subnet mask. Today, a subnet mask must be specifically configured. The role of the subnet mask is "neighbor determination.'' Consider a PC with IP address 172.16.24.13 that wants to send a packet to 172.16.30.5. If the two PCs are on the same subnet, then the sender and receiver can communicate directly. Otherwise, the send must hand the packet off to a router, specified by the default gateway, for delivery to the final destination.

The IP address 172.16.24.13 falls into the range of class B networks. Using a 16-bit mask, both the sender and receiver are on the same subnet (i.e., 172.16.0.0). However, the sender is configured with a subnet mask of 255.255.255.0. The sender applies this mask to its IP address using a binary-AND operation to extract the "network portion.'' It performs the same computation on the destination address. If the two results match, then the destination is a neighbor. However, in this case, 172.16.24.0 and 172.24.30.0 are different, so the sender must use its default gateway.

The final step is to resolve the IP address into a data link address. For those familiar with the OSI seven-layer model: IP operates at layer three, the network layer. The network layer is capable of routing traffic between networks. The data link layer, layer two, handles physical addressing. It determines which devices attached to a wire listen to a given transmission. Layer-two addresses are known as MAC or burned-in addresses. Every network adapter must have a unique 48-bit address assigned by the manufacturer.

IP addresses are translated into MAC (hardware) addresses for transmission over the local Ethernet or token-ring network. The mechanism by which the sender discovers the MAC address is called ARP (Address Resolution Protocol). If the destination is a neighbor, then a frame is created using the recipient's MAC address as the destination field. Otherwise, the router's MAC address is used. In either case, the IP packet header will include the recipient's IP address in the destination field.

Finally, the domain-name system (DNS) needs to be mentioned. Technically, DNS is not required for IP to work. However, in practice the Internet would grind to a halt without it. Human beings are simply better at remembering names than numbers. Can you imagine how much fun it would be to surf the Web if you had to remember addresses like 204.71.200.74, 131.106.3.253, and 209.249.27.175 instead of www.yahoo.com, www.usenix.org, and ftp.download.com? DNS provides a distributed database that maps host names and IP addresses. The downside is that applications often fail when names cannot be resolved. Since most software is written to work with either names or addresses, temporarily substituting IP addresses for host names can be a useful debugging technique and also a workaround for DNS problems.

Troubleshooting Methodology

TCP/IP networks are based on connectionless datagrams. A datagram can be thought of as a postcard and the network as the postal service. A postcard has sections for the recipient's address and a brief message. The sender drops the postcard in a nearby mailbox, where it is picked up by the postal service. The postal service forwards the message from office to office on its way to the final destination. Two pieces of mail may travel different routes or arrive in a different order from that in which they were sent. If for some reason the card cannot be delivered, the post office may stamp the reason on it and return it to the sender. Sometimes the mail just disappears, possibly forever.

PDF of Figure 1

Networks behave like the postal service, only much faster. However, in networking there are more options for discovering which messages are getting through and which are not. The flowchart can be used to help identify the most likely cause of connectivity problems.

Before starting on the process below you should verify that the PC has the configuration settings discussed earlier. Depending upon your site, these parameters may be set manually or you may be using DHCP (Dynamic Host Configuration Protocol). You should also verify that the TCP/IP software was correctly installed and initialized on the machine. You can do this by pinging either the PC's IP address or 127.0.0.1. This loopback address is reserved to refer to the local machine.

The ping program is a standard utility for testing network reachability. The term "ping" is an acronym for packet internetwork groper. It works by sending an "ICMP echo request" packet to the destination. If the destination receives an echo request, it is supposed to transmit an echo reply to the sender. The ping application usually displays status messages in response to each echo request. The message is usually "Reply from w.x.y.z," with a round-trip time and hop-count metric, or "Request timed out." Occasionally, you will see status messages indicating "destination unreachable" or "TTL expired in transit." These are clues that a network link may down or that there may be a routing loop. The exact error messages will vary depending on the implementation of ping on your computer.

Like many network applications, ping understands host names and IP addresses. If you try to ping a host name that cannot be resolved using DNS (or some other lookup), the ping software will usually indicate an unknown host or bad IP address. This may be the result of a problem with the DNS server itself, or it may be that the network connection to the DNS server is down. If the name server responds to ping attempts, then nslookup can be used to verify that the DNS process is alive.

If the destination does not respond to ping attempts, the next step is to determine whether or not the destination is a neighbor. If it is on the same subnet, it may be a cable, hub, or switch problem (i.e., layers one and two of the OSI model) or the PC may be hung or powered off. If the destination is not local, then the problem may be related to the router. However, because the default gateway is also a neighbor, it may still be a hardware problem.

If the default gateway responds to pings, then the traceroute utility can be used to explore the data path between the PC and the destination. Traceroute, which is also known as tracert and trace depending on your operating system, makes use of the time-to-live field of the IP header. Each time that a router processes a packet it decrements the TTL field. When the field reaches zero the router sends back an ICMP message informing the sender that the packet expired in transit. Traceroute starts out with a TTL of one and increments the counter until the final destination responds or some maximum value is reached. Even if you do not have access to the router for viewing its configuration, you can discover the path that the packets are traveling and where the path ends or loops back on itself.

Summary

Troubleshooting network problems can be challenging. This article provides a framework to guide the reader through some basic diagnostic tests. Each test provides more information about what is working and what is not. Running the tests is straightforward. The skill lies in assembling the individual pieces into an accurate picture of the situation. Developing diagnostic skills requires practice.

The process described in this article provides several fundamentals. You can become more effective by understanding your local environment. If your organization uses managed hubs or switches, these devices may also have IP addresses. Use ping to test connectivity to these as well as to the default gateway. Also, most sites use an addressing scheme that makes router addresses predictable. For example, our routers and switches have addresses in the range of 1 to 30 starting with 1. Talk with the network administrators about diagnostic tests that are appropriate for your environment.

Remember that when people report a problem they tend either to state the problem in terms of the high-level task they are trying to perform or to claim that a specific component is broken. Our hypothetical user reported that the "financial database was down." It is easy to accept the user's statement as true and forward the call to the DBA. However, taking the time to determine what is still working will help to identify which problems are network-related and which problems need to go to the appropriate application group.