Saturday, August 24, 2013

Network emergencies

Sometimes things go wrong
Even when you've built a good network, there is always something outside of your control, that happens unexpected and which you could not have foreseen. And when things go wrong, they go horribly wrong.

Try to think of the last time you had an emergency in your network. Initially nobody understands what had happened or where. Everybody is guessing and working franticly to figure out what's going on and how to fix it. The people who have to talk to the network users are desperately trying to get any information they can pass on to and are disturbing you in your work and line of thought. Everybody is on edge and frustrated.

This way of "dealing" with emergencies is very unproductive, frustrating and inefficient. If you or your company doesn't have a simple, easy-to-remember plan for dealing with emergencies, then this pattern is going to repeat itself every time there is an emergency. If you have ever done a first aid course you know a simple, easy-to-remeber plan, that will take you through an emergency situation: 1 Stop the accident; 2 Give life saving first aid; 3 Call for help; 4 Give normal first aid. This takes you through the situation. And you need something similar for network emergencies.

In this post I'll try to give you my idea of a plan. This should work regardless of wether you are in a big company or you are single network administrator.

The Plan:
1: Verify the problem
2: Identify the problem
3: Save the users
4: Correct the problem

This plan is easy to remember and it's the shortest path to dealing with any network problem. I'll go through the steps, what they mean and why they are in this order.

Users are notoriously bad at describing what the problem is. You get users, saying "I can't connect to X" or "My network is down". This can be anything from a bad cable to a non-responsive DNS server. You need to ask specific questions, that verify what the user is saying or you need to be able to reproduce the problem. It is absolutely imperative to verify the problem as the first step. If you don't know where to look, conflicting messages will have you running in different or wrong directions.

Once you have verified what the users are experiencing, you need to identify the problem. Identifying the problem is very different from identifying the cause of the problem! Even though identifying the problem also identifies the cause. For instance, knowing, that the DNS server is down is different from knowing the there has been a power cut to the server room. A way of identifying a problem is by looking at common denominators. If you haven't verified the problem at this stage, you don't have the common denominators or you might have some false common denominators. Common denominators could be geographic, providers, equipment and so on.

When you have identified the problem, you can start saving the users. You can reroute them if all connections going through a specific node are down, or you can send out an emergency generator if the powers is down and so on. You might not be able to save all the users, but save the ones you can. This will take a big load of stress away and leave you time to work on the last step.

The last step is of course to correct the problem. You should now be in a situation where you know where and what is going on and you should also be under the smallest amount of pressure to get the user back online. If you haven't identified the cause of the problem in the previous steps, now is the time.

You might often be tempted to start at step 3 or 4, they seem the obvious place to start. You must discipline your self to stick with the plan, experience has shown me numerous times, that this is the shortest path to the solution.

Other tools
Whether you are alone or part of a team, there are some organizational tools, you need to help you along. You need a coordinator and a communication plan. If you are alone, you are the coordinator, but if you're in a team, it is imperative, that no one is in doubt as to who is coordinating the work. Otherwise you end up in a situation where all the technical people delve deep into their computers, each trying to solve exactly the same problem, without communicating with each other.
The organization you're in determines who will be the coordinator. Normally it starts with the person who receives the first error report and is passed on "upwards" to some leader. The important thing is, that no one is ever in any doubt as to who is coordinating the work and that anyone can get hold of the coordinator and all external communication go through this coordinator. Having the communications go through one person, gives everybody else working on the problem the time and space they need to solve it quickly. It also eliminates any conflicting information, that may arise, since there is only one source.

Speaking of communication. It is extremely frustrating to be on the "outside" of a network emergency. Either as a user or a non-technical manager. But communication is also important between the people trying to solve the problem. If you don't communicate your findings or the solutions you have made to everybody else, chances are, that someone else is doing exactly the same thing as you and wasting their time.

Here is a plan on handling information. You can adapt to your needs:
1: Information is in one place only
2: New information relating to the plan above must be made available in this place
3: The coordinator is responsible for external information
4: The coordinator ensures, that important external people are informed
5: Keep receiving users reports and give the users the latest external information
6: When the problem is solved inform any involved external people about this

I hope these tools can help you be better prepared for your next network emergency.

Monday, June 17, 2013

Refraction


Actually I was going to write a different post, but then Edward Snowden chose to show the World what the people with tinfoil hats have been saying all along.

After the revaluation of PRISM, many people have hit the keyboard suggesting to companies and individuals, what they can do to protect their privacy online. The suggestions that are appearing range from well considered ideas to the ridiculous. 
I've seen articles on IT news sites, that basically suggest you pull your data home from the cloud and tighten security for employee usage of computers. These people haven't really grasped the extent of PRISM and see it as an opportunity to bring on even more production restricting rules on their employees. This will in the best case do nothing to ensure privacy and in the worst case, send even more data in the direction of the NSA (see my other post).
Other people have sensibly suggested that you encrypt your traffic, use non-centralized services and hide your identity online.
These are all great suggestions, but I afraid that if people don't understand the scope of privacy, they might get caught in a trap of false security. A bit like locking your car with the windows open.

That's why I'm writing this post to give you a framework of privacy. Something, that can help you understand the risks, understand which privacy tools do what and help you decide which options are right for you. I want to use PRISM to refract privacy and break it up into it's constituent components.

This framework is an abstract. It's simple and it gives you a way of thinking about online communication. It does not provide any definite solutions, although I will try to give you some at the end.

Framework
For any online communication, there are four kinds of participants:

The initiator
The transporter
The facilitator, and
The receiver

There can be multiples of each and the facilitator isn't alway present.

The initiator is the person, software or device which initiates a communication. I deliberately don't use the word "sender", because a device sending a reply to a communication is not an initiator.

The transporter is the device, software or company, which transports your communication to either the facilitator or receiver, or both.

The facilitator is a device, software or company facilitating the kind of communication that is going on. Usually on the application layer or higher.

The receiver is the person, company, software or device receiving the communication.

To give you an idea of what I'm talking about I'll give you an example: You send an email to some friends.

The initiator is you or your email client, depending on your frame of reference.The transporters would be your ISP, their interconnect provider and your friends ISP. The facilitators is your SMTP server and their IMAP server. The receivers are your friends and whoever they have hanging out around their computers at the time they receive the mail.

I hope this example gave you a understanding of the framework. Remember, it's abstract, so you can view this in any way or at any level you want. 
The initiator could be you, your employees, the spyware on your computer, your operating system and many more. The transporters are the ISPs, mobile operators, wireless providers, networking equipment, the office network, anything that moves your communication. If your communication isn't directly with the receiver, the facilitator is anything that makes it possible. Like a FTP server, Facebook, Youtube, a web server, some cloud service, your VoIP operator, you cellular operator facilitating a SMS service. The receiver can be anyone you intended to view or read your communication and anyone they share it with. This can be someone you're chatting to, or all the people who viewed what you put on Pintrest.

How to use the framework
If you want to secure your privacy, you need to secure it with all participants. This is where most people leave the windows of the car open. They might be encrypting their traffic all the way to the FTP server, but then overlook, that the server it self has been compromised.

When you start thinking about your communication with this simple framework, you easily see, that implementing more restrictions on employees doesn't rely help.

Solutions
Now I promised you some solutions. I'll provide a few here, but these will also bee in the abstract. I'm not recommending any solution over another, since you have to decide which one best suits your situation and needs. Each solution can be implemented in many different ways.

Almost all solutions have a side effect, which you also must take into account. These include: Loss of bandwidth, loss of personalization, need for more computing power and loss of realtime backup. Just to mention a few.

Trust
Now you might not consider this a solution at all. I consider it the best solution of all, since it comes with no side effects at all. Unfortunately it is also the toughest to achieve. But if you can trust all participants in the communication, you have achieved 100% privacy. This is most likely never going to happen, but if you can trust one or two participants, you have already achieved a lot of privacy.

Hide
This is "the classic" solution to privacy. You encrypt your data, so no one other than the receiver can read it. While this goes a long way in privacy protection, your privacy is still vulnerable at the receiver. Furthermore, hiding your data, might prevent transporters and facilitators to see it, but it does not prevent them from getting access to your metadata. They can still see, who you are communicating with, when, how much and from where.

Anonymize
This is a really effective technique to secure your privacy. If you can disguise your identity (and maybe the receiver can to), you can prevent facilitators and transporters (and sometimes receivers) to collect the metadata on you. Anonymization can be done be relaying, proxying  or tunneling your data between various places. But remember a proxy or tunnel service is also another facilitator!

Overload
While technically not actual privacy, information or data overload makes it much more difficult for anyone eavesdropping on your communications to find the "real" communication and it significantly raises the cost of doing so. This is an easy way to keep "small fish" out of your communications, but won't do any good in protecting your privacy from the NSA, which has access to unlimited funding from the american taxpayers. 

Decentralization
Another option is to cut out the facilitator completely. If there is not central server hosting your service, then it can't collect your private information. Many P2P technologies exist, that allow you to share files, chat, transfer money etc. without a central facilitating server. One of the drawbacks is, that you have no server, to back your data up from.

All of the above
As I said in the beginning different solutions apply to different parties in the communication. If you can combine the best ones of the above (and others that I've forgotten?) to suit your needs you have a pretty good solution for your privacy.


So before you randomly start beefing up your privacy, take a look at it through this framework and see if your efforts really will make a difference.

Saturday, April 20, 2013

Plan B

Do you have a plan B?
Of course you do! You have redundant power, redundant equipment with redundant routing engines on redundant sites and redundant uplinks from different providers on different transport mediums. So you're covered. Then there is no need for you to read this.

As network architects and engineers we want to build resilient solutions, that can sustain damage and problems occurring within our know paradigm. We all know the everyday things that can go wrong because we've seen them before. We know we can loose power, equipment can malfunction and cables can be disconnected. We know how frustrating these events can be and the "panic" that can erupt, not to mention the potential financial losses associated withe them. We take them into account when we design a solution. This is natural and correct thinking based on experience. But is it a Plan B?

Without neglecting their importance, the resilient solutions are mainly in place to make our (working) lives more normal. We don't like to get a call at 3.00 o'clock on a saturday morning and spending the rest of the weekend fixing stuff and then spending all monday listening to our boss yelling. We see these events from within our own paradigm. We want to be sure, that "our stuff" isn't broken, so that we don't have to go out and fix it. This is great, but it isn't a Plan B. A Plan B has to have a paradigm within the organization. It has to be able to handle disruptions outside of our control.

Staying within our own realm of knowledge and capabilities, what is the worst thing that could happen? What would bring down the entire organization to a point where it might not recover?
If you still come up with "power failure" give it 5 more minutes.

There is no standard answer to this question since all organizations are unique. But I'll try to give you some options to choose from.
- No Internet connection for 2 weeks.
- No access to certain service for 2 weeks.
- Your customers are not able to get/access your service/product for 2 weeks.
- Your customers are not able to pay for your services for 2 weeks.
- No communication with the outside world for 2 weeks.
- Your non-Internet-connected systems are taken down for 2 weeks.
- A 3. party gets access to you (non-Internet-connected) systems without your knowledge.

Would any of these be more fatal for your organization then a 3 hour power failure?

"Yes, but that's not going to happen", I can hear you say - really?

Last month there was a so called "Internet war" going on between Cyberbunker and Spamhouse. Allegedly it was a 300 Gbps DDoS attack, even though I haven't seen any evidence to support this. Real or not, this is not an impossible scenario so have you considered what would happen if your organizations network or services are in harms way of such an attack for days/weeks?

In 2008 Pakistan Telecom accidentally blocked access to Youtube for most of Asia. The block was intended to be national only but affected two-thirds of the global Internet population. Not being able to access Youtube might be harmless for most organizations, but what if it was a different service - maybe PayPal? Also you need to remember, that this was a simple misconfiguration of a router. Imagine someone doing a deliberate attack!

Many countries and regions around the world are dependent on one or two major uplinks. That is, their entire access to the remaining Internet is dependent upon one or two physical cables or one or two uplink providers. During the arab spring we have seen several countries having their Internet access cut off to the outside world. This is possible when the uplinks either physical or by number of providers are limited. You might live in a relatively free country, but you might also be in a geographical region that is serviced by only one or two uplink providers. If that/those providers fail, you loose your connection. This can be a simple misconfiguration by the provider, like removing a route or an ASN.

There have been incidents in the US, were people not able to buy gas at certain stations, which had no Internet connection. Not because the pumps at the station needed a connection to pump gas, but all the payment systems (including cash payment) did not operate, since they relied on an connection.

Some people make secure networks, like for instance Europol, which are physically separated from the Internet in order to gain security and avoid any of the "bad stuff", that takes place on the Internet. This is a false sense of security. If you are running the same protocols, that run the Internet, you are vulnerable to the same problems. Your infrastructure doesn't even have to be running TCP/IP to be hit. Stuxnet jumped to SCADA systems, which weren't connected to Internet at all and weren't running TCP/IP.

As the Internet gets more complex and more services rely on the Internet and more governments and other organizations try to break or interfere with the basic protocols running the Internet, the likelyhood of events like the ones mentioned above increases.

What can you do?
Make a Plan B!
This isn't something you can do alone. This has to take place with people from your entire organization. And, as with so many things, it's the process of making the plan that is far more important the the actual outcome.
You can't know in advance what might happen or when, but you can identify the most crucial spots in your organization, the ones, that can receive damage beyond repair. Simply identifying them and knowing them is the most crucial part of the process.
Once you know them, then you can think about how you can cope, should damage or disruptions ever occur. This doesn't have to be expensive. Look for the simplest and most low-tech or no-tech solution possible.
To get you started: Identify and examine the processes and service your organization relies on. Look around your organization. If there is no Internet connection, will you be able to use the phone? Receive payment from customers? Pay your employees? What happens if anyone gains access to non-Internet-connected parts of your infrastructure?

Once you have come up with a Plan B - test it! 
I can't stress this enough, if you do not test it, you don't know if it's going to work and you might be living with a false sense of security. Even if it is a basic as installing a backup power system - pull the plug on a live environment (don't do this in peak business hours) and verify that it works.

Tuesday, March 5, 2013

The networks you don't want

This might be a bit of old news, but hopefully you can avoid some troubleshooting headaches with this post.
As the shortage of IPv4 addresses increased, IANA reclassified several /8 networks from "Reserved" to "Unallocated" around 2008 and started allocating them to RIRs in the following years. In the recent year or two the RIRs have begun to allocate them to their LIRs and they have begun to allocate them to their users.
The networks in question are 1.0.0.0/8, 2.0.0.0/8, 5.0.0.0/8, 23.0.0.0/8, 27.0.0.0/8, 31.0.0.0/8, 36.0.0.0/8, 37.0.0.0/8, 39.0.0.0/8, 42.0.0.0/8, 100.0.0.0/8-113.0.0.0/8, 173.0.0.0/8-185.0.0.0/8, 197.0.0.0/8 and 223.0.0.0/8.

This has caused some issues on the Internet, which can be divided into 3 categories: "Existing traffic", "Other uses" and "Security".

The issue with existing traffic was investigated, when APNIC was assigned the 1.0.0.0/8 network. Before they started assigning it to their LIRs, they found out, that on average 160Mbps of traffic to this network existed, with bursts of up 850 Mbps. Now, one might think, that there is always some traffic to any unallocated network on the Internet, so they also made a benchmark against one of their existing unallocated networks. For this network an average of 10 Kbps traffic existed, with hardly any traffic above 100 Kbps. It's not clear what this traffic is, but it's most likely leakage from private networks.

The second issue "Other uses" became very real for users of Hamachi in 2012. They where using the 5.0.0.0/8 network for their VPN service. Once real 5/8 addresses started showing up on the Internet in June 2012, this service stopped working properly and users had to change their network setup.

The final issue is the one found in the security of some network devices like routers and firewalls. Many devices have a bogon list, which automatically filters out unwanted addresses. These are either blocked or not routed. When the bogon list isn't up to date the devices are blocking legal traffic.

The biggest issue of the above is the last one. This can potentially block users from accessing certain sites on the Internet, which in turn gives the user a bad experience and generates less traffic on the site.

What can you do?
Whether or not you have been assigned these addresses you should check your own network equipment, especially firewalls, for bogon lists. Most equipment comes with a predefined list from the vendor and most vendors update their lists with new software releases. Even if you update the software you should check against the currently allocated addresses, which can be found on the IANA website. Any lists, you have created yourself, must always bee kept up to date with the IANA list. This shouldn't be a big task, since IANA has no more addresses to allocate.
If you have been assigned a previously reserved address space, you should use them carefully. Before you even start using them, try to allocate them on a PC and dump the incoming traffic. If you get a lot of traffic (above 100 Kbps), you should contact the ISP or upstream, who is sending you the traffic and tell them, that someone is using your IP space.
Wherever possible you should use these addresses for infrastructure. Try to avoid using them where end-users would get them assigned, either directly or indirectly, on endpoints. You could use them in your VoIP infrastructure but should avoid using them on PCs or for NAT. If you still need to use them for theses purposes and you encounter sites, which you can't access, the only option you have is to contact that site and ask them to check their network setup for bogon lists, since this is the most likely cause of the problem. Unfortunately some sites use low quality firewalls, where the vendor either doesn't update the bogon list or where the administrator can't or doesn't know who to turn it off. It can often be difficult to explain the problem to these administrators and the best argument you can use, is that they are loosing out on a lot of traffic, because they are unintentionally blocking millions of users.



Sunday, February 17, 2013

The standard user

In my last post I wrote about a breach in security at Europol. A network engineer stored a backup of device configurations on his private NAS, which wasn't protected and publicly accessible from the Internet. The configurations contain the passwords for the devices. The focus of my last post was on the general lack of security implemented by users, but that was not what shocked me the most about this incident. What I found disturbing was, that devices are configured with a standard user and password.

I've seen this in networks countless times, every device is configured with the same standard user and password. When someone needs access to the devices, they are given the username and password and security is left to the integrity of individuals. It seems to be the general rule, that networks are configured this way.

This is wrong in so many ways. What happens, when a user no longer should have access to the devices? This can happen either when the user no longer is employed by the company or is assigned a different job role within the company. What if an external consultant needs access to a single device? He will now have the password for all devices and have access even when he isn't working as a consultant for the company any more.

Some people argue, that their devices are only accessible from within their network and that they have protected the network with a firewall. This doesn't prevent an employee, who changes roles within the company, from keeping their access or sharing it with other employees. It does not prevent an external consultant from gaining access to all devices. Nor does it keep track of who logged in to a device or changed its configuration.
Have you ever thought about what to do, if the password is compromised? How many devices do you need to log into to change it. How do you know, that you didn't miss a device? How will you redistribute the new password to all the users, who needs access?
What about the firewall protecting your network? How do you log into that? Does it also use a standard user? Maybe it's a different one and the passwords are only known to a select few, but that doesn't change the fundamentals of the problem.
You wouldn't implement a standard user on your computers, why would you do it on your network devices?

The solution to this is extremely simple and virtually free. Just like computers authenticate against Active Directory or LDAP, almost all network devices support a central authentication service, normally RADIUS or TACACS+. With an authentication server, every user gets his own username and password. You can group users to limit access to certain parts of your devices and limit access to certain devices. When user access rights change or a user no longer should have access to the network, it is simple to change or delete the user on the server so that the changes immediately take effect in the network. When users have individual logins, you can keep track of which users log in to what devices and of the changes they make.

While there are commercial servers available, both RADIUS and TACACS+ servers can be downloaded and used for free. FreeRADIUS is, as the name implies, a free RADIUS server, it even comes with the Dialup Admin web interface for easy administration. The TACACS+ deamon from Shrubbery Networks is a free TACACS+ server. Neither of these are resource intensive, so you don't need a dedicated server to run them.

The short time you invest in setting up proper authentication for your network devices, will be repaid with peace of mind and easy administration.

Monday, February 11, 2013

Convenient security

I recently saw a documentary about security. Specifically about the new e-pinters, universally accessible NAS drives and security cameras. According to the documentary some 7.000 e-printers, 14.000 NAS drives and thousands of security cameras are publicly accessible without password over the Internet. And they only investigated one vendor of each product! They copied forgotten passports in scanners, accessed passwords for a secure network at Europol (stored on a NAS) and turned of the security cameras in a shop. They estimated that about 80% of all security breaches are caused by users, not attackers!



Who is to blame?
The documentary aimed a placing the blame with the producer of the devices. I don't agree with that. If you (the user) don't bother to figure out how to lock your car and leave it open, you don't blame the producer. If a network engineer makes a backup of network configurations on his personal NAS and doesn't secure it, so that Europols passwords are freely accessible on the Internet, you can't blame the NAS producer. At least a network engineer should know better.

Why don't users secure their devices?
Unless they are working with security, users generally will choose convenience above security. Contrary to what some employers think, most employees actually want to do their job. When faced with security barriers at work, both implementations and regulations, most employees will work around them if they seriously prevent them from getting their work done.

This isn't just true for users in big enterprises, it is also true for SOHO users, who might not have an IT department. They purchase IT equipment that will make their work easier, not increase security. You can think of this in another way. When was the last time you checked the security features of a car you purchased? I'm not talking about the airbags. Did you ever check the quality of the locks or car alarms? Do you know how easy people can steal your car? If you're like me, you probably don't know these things. You assume, that when you lock your car it's "safe" and you buy insurance. You look at other features like milage, cruise control, bluetooth adapters and so on. Why should this be different when people purchase IT equipment? They look at storage capacity, wireless connectivity, Internet accessibility and so on.

Even IT professionals, will choose convenience over security. I can give you some real life examples.
Many years ago, I was a volunteer in an organization, among other things we needed to publish a booklet. There organization had computers available for the volunteers and these where locked down vigorously by a security obsessed IT administrator (he later started a security company). You couldn't even install a new font one these machines. It was impossible to work on. So I disconnected a PC and plugged my own laptop in (not many people had laptops back then). I now had access to the entire network, printers and everything. Now we could create and publish our booklet. Another more recent example was when I worked at a company, that like many had a 20MB e-mail attachment limit. I needed a vendor to send my materials above 20MB. How to do that? I called the IT dept. and it turned out, the company had a large file exchange portal. I needed to create the vendor as a user on the portal. Send him a link, where he could log in. He could upload the file. I could then go to an internal page and download it...or I could just ask him to send it to my private mail account. You guess which option I chose.

So how come companies spend a lot of money on security and we still have these problems?
It's the way companies view security. It's a paradigm of "washing hands". The CTO asks the IT department "are we secure"? The IT dept. says "yes, we have implemented the best firewalls and the strictest security policies for the users". If there is a security breach, the IT dept. can say, someone broke the policies. The CTO can say we have done everything within our available resources (i.e. we need more money) and so on. Now everybody has washed their hands.

The users pose the biggest security threat to them selves and the network. We need a strategy to deal with that. Unfortunately, current strategies reflect the existing paradigm of "washing hands".

In general two strategies have been used to increase security. One has been to increase security measures and the other has been to educate users. I believe neither can achieve the goal.

"The more you tighten your grip, the more star systems will slip through your fingers."
- Princess Leia

This is exactly what happens when you tighten security. At some point employees will feel, that security is something they need to overcome in order to do their work and it is virtually impossible to protect against widespread security breaches inside the network. If the only tool you have, is tighter security, you increase the problem, every time you try to fix it.

Other people advocate that we should educate the users, so that they understand the security risks and know which precautions they must take. While it is always good to explain to the users, why certain security measures are in place, I don't believe that we can educate our way out of the problem. Education requires one main ingredient: The desire to learn. If the users don't have any interest in learning about security, the education will be a waste of time.

So what can we do?
In an age of BYOD, I believe that we should give the users what they want - convenience! If we want to secure users and our networks, we need to make the users work for us, by working for the users. Let me give you an example.
If we have a big company, with a large IT department, we could supply them with all their IT needs. If an employee would like a NAS, he could buy this from the IT department. Who then would set it up and support it. If the price is the same as in the shops, he's getting an additional service (support) for free and the benefit for the company, is that we can secure the device and know, that any data stored on it is safe.
This is just an example, but if we start thinking in this way - we can find solutions, that work in our particular situations. We need to think of the users as part of the solution, not part of the problem.

Monday, January 28, 2013

ISND, PART 2

Following up on my previous post on ISND (International Standard of Network Documentation). This post is about how you can do good network diagrams.
But first I have to correct my self, based on the feedback I got on my previous post.
I wrote that "builders don't get a 10 page report describing how the house should be built" and have been corrected. In the construction business, there is written documentation. There is documentation on what materials to use, which codes to follow and general descriptions. If you include the public codes, there's a lot of written documentation.
We are lucky in that our industry doesn't have public codes regulating how networks should be built. There are "codes" in the private domain, which a customer may refer to when requesting a solution but no public ones.
So to correct myself, I'm not suggesting, that you don't produce any written documentation at all!
You should.
But you need to separate what goes into your written documentation and what goes into your diagrams. The diagram is your blueprint - your building instruction, this tells you what to build. Your written documentation is the detail, this tells you how to build it.

Since this post is about diagrams, I've decided that I might make a follow up on this one, to talk about the the written documentation.


To recap from my last post, I stated that you need to make different diagrams for different purposes. Here I will give you some examples of how to do that. This is obviously not going to be complete standard, I don't have the time and space to write that here and you couldn't be bothered to read it. Instead I will try to give you an idea of the concepts you should be using and which ones to avoid, with some examples to get you started.

Keeping the architects drawings in mind, we start by dividing our network diagrams into two categories. The design sketch, this is our High Level Design (HLD) and the technical drawings, these are our Low Level Designs (LLDs).

The purpose of the HLD is to conway an idea, a vision, to a client. It shows the big picture. These are the things you sketch on a whiteboard using your classic iconography and most of you should be familiar with this. But remember: Your HLD contains no technical information. Don't be tempted to add IP addresses, VLAN numbers and so on. They should be readable on A3 size paper. Split them up. Just like the architect has sketches of different parts of the house, you to should have diagrams of different parts of the network. When you do HLDs you must think about your client. He should be able to understand them and recognize the elements, that he has requested for the design.

The audience of your LLD are the people how are going to build the network or extend it in the future. Here you must remember to keep everything in diagrams - often people attach a text to the diagrams explaining how it works. This should not be necessary if you do your diagrams correctly and put the detailed information into the written documentation.

In your LLD you always need to make diagrams for your Layer 1, Layer 2 and Layer 3. In addition, you can make other functional diagrams, that are needed for your design, such as rack diagrams, power diagrams, routing diagrams, security diagrams and so on. Which ones are required, depends on your design. You should be in no doubt as to where to put what information.
Remember to put in a legend explaining your symbols.
Any information you need to add, should be drawn with basic technical information on the elements, such as IP addresses, device names and so on. If you need to reference detailed information in the written documentation, do that through the legend as well.
In the LLD you need to get away from the idea, that you can depict a device as a single icon, with lines coming out of it. You use those for your HLD. There is so much going on inside and between devices, that you need to make them big. Use icons in the top corner to show what kind of device it is, but make lots of room.
Don't be afraid to put devices inside devices. Use decent sized interfaces, that can contain information. Put interfaces inside interfaces if you need to. Use an interface shape, that clearly shows ingress and egress. When putting devices, interfaces or areas inside each other, think about which element is a part of something bigger.
If you find yourself typing, the same information over and over again, change it to a symbol and put it in the legend.
Use combinations of, color, shading, corners, line types and thickness as symbols.
Don't be afraid to use the space you need. Have you ever seen a blueprint for a house, that fits on a piece of A4?
If your design becomes crowded, you need to group things together and put them in separate diagrams. When you do this, you need to be consistent in your grouping. Make a rule, like: "If there are more than 3 devices doing the same thing connected to another device, they will be grouped."

Here is an overview of what could go into various diagrams and how you can depict it. It's not complete, but from the context your should be able to figure out where, what information goes and how you can depict it.

Layer 1 diagrams should depict everything physical. You want to show how it's wired together. This could contain your:
Devices: Depict the type as a symbol and the name as text. Use a distinct shape for devices, like a rectangle.
Physical interfaces: Media as asymbol and a name text. If you use special MTU, duplex and speed depict that with color and numbers, otherwise put it in the legend.
Cables: Use colors for types and the end of the line colored differently to show subtypes.
Locations: Group things together in their physical location. Use a distinct border shape for locations.

Layer 2 diagrams should depict your domains, regardless of what they travers. The reader should be able to follow a domain everywhere it goes. Inside and outside of devices. The reader also needs to know the transport method of your domain, is it a VLAN, L2 VPN, a (virtual) switch. This could contain your:
Devices: Any device, physical or virtual, see Layer 1.
Interfaces: Physical and logical interfaces, use iconography to represent the type (LAG, redundant ethernet and so on). Place interface on the edge of your device if they interconnect with other devices or on the inside of your device, if they exist only logically there.
VLANs: Inside and outside your devices. Different VLANs can be show by color. If you need to represent both S-VLANs and C-VLANs together, place a thick S-VLAN at the interface and spilt it into several C-VLANs on the interconnecting part. Use a distinct line type for VLANs. You can put the ID either in the legend or in the middle of the line.
Spanning Tree: Interfaces can be marked with different states, either by color or symbols.
MPLS: Isn't techincally layer 2, but is required to represent other layer 2 domains. MPLS can be represented as a virtual device surrounding the MPLS area, with the MPLS interfaces connecting through that area. Use a distinct line type and distinct shading and shape for your area.
L2 Circuits: Can be represented as a virtual device with two interfaces inside the MPLS device described above. You can then interconnect these interfaces with their real interfaces on the physical devices.
L2 VPN: Can be represented in the same way as L2 circuits, as its own device inside the MPLS device, with VLAN representation, just like on physical devices.

Layer 3 diagrams should depict your subnets and only elements participating on the IP layer. The reader should be able to see how to get from one subnet to another. This could contain your
Devices: Only layer 3 devices, logical or physical
Interfaces: Logical and physical, but only layer 3 interfaces.
Subnets: Use cloud or bus depictions. Put the subnet in CIDR format there. The connecting interfaces should only have the remainder of the IP addresses displayed. If you use a bus depiction, place the gateway at the end of the bus or mark it out (in bold) on a cloud depiction.
L3 VPN: Are represented as devices inside the MPLS device described in Layer 2 in the same manner as L2 VPNs.
VRRP: Can be represented as a logical interface without a device. Connect one side to the participating interfaces and the other to the subnet.
DHCP: Put a logical device with your DHCP information containing the appropriate interfaces inside the physical device.

These are the basic three diagrams, that you should always have. In addition, you can create functional diagrams suiting your purposes. Here are some examples.

Security diagrams should depict where and how security is handled. This example assumes you're dealing with a layer 3 firewall, but this can just as easily be applied to a layer 2 firewall. I know NAT is in here, which technically has nothing to do with security and if this is your only "security" element you could include that in the Layer 3 diagram. The diagrams could contain your
Devices: Any device performing a firewall/security function.
Interfaces: Particitpating in a firwall/security function.
Subnets: All subnets affected by a firewall/security function.
Zones: Represent these as colored cones. With the top of the cone placed on the interface creating the zone.
Policies: Create a logical container for policies inside the device. Connect it to the zones with arrowhead indicating the direction of the policy.
Rules: Create a logical container inside your policies or devices. Give it an icon indicating its action and place the affected ports/address in the container. If there is to much information to put here or in the legend, refer to the written documentation.
NAT: Static NAT is represented as a line with arrowheads at each end between the public and private IP addresses. Source NAT is represented as a line with an arrowhead at the NATing address, between the public IP and the subnet. Destination NAT is represented as a line with an arrowhead at the NATed end, between the public and private address/port.

Routing diagrams should tell the reader about the overall routing. What peers and areas exist, how do we get from one area or subnet to another. This could contain your
Devices: Any device performing a routing action, physical or logical.
Interfaces: Any interface participating in a routing action.
BGP: Include all devices inside a colored ASN area. Use lines to represent peerings. iBGP peerings can have the ASN color and eBGP peerings can have two colors representing the ASNs.
OSPF: Include all devices inside a colored OSPF area. Use lines to represent the OSPF neighbors. Use shading to indicate differnt kinds of areas.
Static routes: Put an arrow with the route outside the egress interface.

This is by no means a complete standard and you don't need to do the representations in the way they are described here. If a different representation works better for you, use that instead. I hope it has given you a better understanding of how to do network diagrams and that you have been inspired.
If it sounds like a lot of work, think about why you are doing your documentation.