Saturday, August 24, 2013

Network emergencies

Sometimes things go wrong
Even when you've built a good network, there is always something outside of your control, that happens unexpected and which you could not have foreseen. And when things go wrong, they go horribly wrong.

Try to think of the last time you had an emergency in your network. Initially nobody understands what had happened or where. Everybody is guessing and working franticly to figure out what's going on and how to fix it. The people who have to talk to the network users are desperately trying to get any information they can pass on to and are disturbing you in your work and line of thought. Everybody is on edge and frustrated.

This way of "dealing" with emergencies is very unproductive, frustrating and inefficient. If you or your company doesn't have a simple, easy-to-remember plan for dealing with emergencies, then this pattern is going to repeat itself every time there is an emergency. If you have ever done a first aid course you know a simple, easy-to-remeber plan, that will take you through an emergency situation: 1 Stop the accident; 2 Give life saving first aid; 3 Call for help; 4 Give normal first aid. This takes you through the situation. And you need something similar for network emergencies.

In this post I'll try to give you my idea of a plan. This should work regardless of wether you are in a big company or you are single network administrator.

The Plan:
1: Verify the problem
2: Identify the problem
3: Save the users
4: Correct the problem

This plan is easy to remember and it's the shortest path to dealing with any network problem. I'll go through the steps, what they mean and why they are in this order.

Users are notoriously bad at describing what the problem is. You get users, saying "I can't connect to X" or "My network is down". This can be anything from a bad cable to a non-responsive DNS server. You need to ask specific questions, that verify what the user is saying or you need to be able to reproduce the problem. It is absolutely imperative to verify the problem as the first step. If you don't know where to look, conflicting messages will have you running in different or wrong directions.

Once you have verified what the users are experiencing, you need to identify the problem. Identifying the problem is very different from identifying the cause of the problem! Even though identifying the problem also identifies the cause. For instance, knowing, that the DNS server is down is different from knowing the there has been a power cut to the server room. A way of identifying a problem is by looking at common denominators. If you haven't verified the problem at this stage, you don't have the common denominators or you might have some false common denominators. Common denominators could be geographic, providers, equipment and so on.

When you have identified the problem, you can start saving the users. You can reroute them if all connections going through a specific node are down, or you can send out an emergency generator if the powers is down and so on. You might not be able to save all the users, but save the ones you can. This will take a big load of stress away and leave you time to work on the last step.

The last step is of course to correct the problem. You should now be in a situation where you know where and what is going on and you should also be under the smallest amount of pressure to get the user back online. If you haven't identified the cause of the problem in the previous steps, now is the time.

You might often be tempted to start at step 3 or 4, they seem the obvious place to start. You must discipline your self to stick with the plan, experience has shown me numerous times, that this is the shortest path to the solution.

Other tools
Whether you are alone or part of a team, there are some organizational tools, you need to help you along. You need a coordinator and a communication plan. If you are alone, you are the coordinator, but if you're in a team, it is imperative, that no one is in doubt as to who is coordinating the work. Otherwise you end up in a situation where all the technical people delve deep into their computers, each trying to solve exactly the same problem, without communicating with each other.
The organization you're in determines who will be the coordinator. Normally it starts with the person who receives the first error report and is passed on "upwards" to some leader. The important thing is, that no one is ever in any doubt as to who is coordinating the work and that anyone can get hold of the coordinator and all external communication go through this coordinator. Having the communications go through one person, gives everybody else working on the problem the time and space they need to solve it quickly. It also eliminates any conflicting information, that may arise, since there is only one source.

Speaking of communication. It is extremely frustrating to be on the "outside" of a network emergency. Either as a user or a non-technical manager. But communication is also important between the people trying to solve the problem. If you don't communicate your findings or the solutions you have made to everybody else, chances are, that someone else is doing exactly the same thing as you and wasting their time.

Here is a plan on handling information. You can adapt to your needs:
1: Information is in one place only
2: New information relating to the plan above must be made available in this place
3: The coordinator is responsible for external information
4: The coordinator ensures, that important external people are informed
5: Keep receiving users reports and give the users the latest external information
6: When the problem is solved inform any involved external people about this

I hope these tools can help you be better prepared for your next network emergency.