Whether you like it or not, life is a never ending set of SLAs (Service level agreements). The SLA you have in place to collect your kids from school at 3:15pm, the escalation that will occur if you are not waiting outside the school gate at 3:30 pm or perhaps your job which expects you to arrive by 9am. The management of an IT network is no different, except we often forget to put in place SLAs to improve operational efficiency. There are three types of SLAs when managing networks.
An operational SLA, an infrastructure SLA and a service SLA. Operational SLAs deal with the execution of service delivery. Infrastructure SLAs deal with the systems in place to manage the device management in a break/fix situation. Finally, the service availability SLAs take care of the applications and services which operate over the infrastructure. In each of these SLAs one or more components are likely to be outsourced, which in turn will have in place a set of SLAs with your organisation.
The three types of SLAs and how to achieve them.
Operational Service Level Agreements
Every successful network management system should have a specific minimum set of SLAs with the end user. As the administrator of the IT network, the end user is your customer regardless of whether they pay a monthly subscription fee to use your service or not. Even if these SLAs are not contractually binding in anyway they should be present, if only as a measurement of your own department's performance.
It is also important to set realistic SLAs. If there is only one support specialist available on a given day, each subsequent call can not be given the same level of treatment. SLAs need to be staggered and dependent upon the available resources. A typical operation SLA may consist of the following components:
|1. Hours of operation.|
|2. Number of times the phone will ring before being answered.|
|3. The maximum amount of time in a call queue before speaking to an operator.|
|4. The amount of time the ticket is in a trouble queue before there is a response from a specialist.|
|5. The amount of elapsed time before a case is updated.|
|6. Case escalation time and response if required.|
|7. Resolution and closure.|
Most managed service organisations will produce internal weekly reports on the operation of the network operations centre (NOC) itself and weekly/monthly reports on the end user department or operation.
Avoiding ad-hoc Service Operations
I often see large institutions or organisations manage service delivery on their complex networks without any change management or audit process. Someone will come over to the support staff's desk and say: “I need to have this port opened between these two sites to do XYZ.” The IT guy logs in to the firewall console make the changes, and the job is done. This may be the quickest way, but there is no audit trail of the change. There is no way for others in the department to know what the purpose of the change was, whether it could affect anything else on the system?
When considering infrastructure SLAs, it is important to remember that the resilience and redundancy built into a network provides a predictable model of availability. Some organisations build so much redundancy into the initial design of their network that the availability of the service approaches a 99.999% uptime, which is a measure of the sum of all components having a high degree of redundancy.
It is important to realise that uptime and availability are unrelated in this regard, since it is possible for the infrastructure to be up, but not available due to a malfunction. The amount of unexpected downtime for "5 nines" availability is typically measured in minutes per year. Most manufacturers will also publish a mean time before failure (MTBF) figure, which states if the equipment is operated within certain environmental boundaries (temperature, vibration, power cycle operations and insertions) the failure figure can be factored into the overall availability figure.
A good infrastructure design should include an availability matrix for the first five years of operation. It should be a combination of the MTBF and uptime based on the design (redundant components, network links, replicated systems) and availability of spare parts. If you have recently installed network and your networking vendor has not provided this information, ask for it.
Infrastructure SLAs are predictable, meaning that resolution becomes as simple as break/fix. The SLA often being the availability of spare parts or replacement units. Corporates usually choose to outsource infrastructure SLAs to service delivery organisations due to their back-to-back SLAs, nationwide availability and risk reduction for the organisation. Smaller organisations usually order additional equipment, which is stored and pre-configured on site for later deployment if required.
Mean Time to Repair (MTTR)
Once the fault is assigned to a specialist, the MTTR SLA becomes the next milestone. If the fault is indeed hardware infrastructure related, then the resolution time is relatively short. However, if the issue is network service related, a process of isolating whether it is application or network related must take place before investigating. For example, let us take a situation where an outbound e-mail stops working in an organisation.
The Troubleshooting Work Flow May Start like This:
Is internet working – yes.
Call the internet service provider (ISP), is it having any outbound issues – no.
Is the e-mail server reachable (ping) – yes.
Are there any error messages on the e-mail server "SMTP service not responding" – yes.
Is there a shortage of memory – no.
Is there disk space – yes.
Does restarting the service work – yes.
Open support case with the e-mail application provider.
Collect logs and diagnostics for support representative.
Wait for problem to re-occur.
Receive and apply software patch for Mail Server.
Close third party support case.
Update support ticket and close case (after a period of stability).
In the above example the fault was application related, but the diagnostic steps started with the network. This is usually the case with all application faults because it is easier to eliminate the network hardware as a source of trouble than the application. This situation is also fairly typical of most application related issues, meaning it is intermittent. Service was restored about halfway through the troubleshooting process.
However, from a service perspective the fault still exists, in fact we have no evidence of our own the problem is resolved. From a user perspective, the SLA may have been satisfied before the fault has been properly resolved within the allotted time-frame. The MTTR in this case is the total time elapsed between the initial diagnostic activity and the application of the software patch. If the patch failed and the fault reoccurred after a week then the MTTR should be adjusted accordingly.
Service Availability SLAs
The third type of SLA is probably the most important as it relates directly to the end users' overall experience and expectation. A good example of this is the evolution of the telephone within a corporation. When private PBX solutions first became available in the 1970s, the telephone company installed equipment in the customer's basement and placed telephones on each user's desk.
The corporate telephone was for all intents and purposes part of the telephone company's own infrastructure which attracted the same aggressive SLAs for availability. What this meant to end users was each time they lifted the receiver, they expected to hear a dial tone. Telephone companies achieved this high service availability SLA by installing generators, batteries and military specification components with a wide range of operating temperature and redundant connections to the local exchange. This was effectively an infrastructure SLA held with the customer.
After the break up of the central telephone company in many countries, the designation of the network termination equipment (NTE) and customer premises equipment (CPE) was created. For the first time, a demarcation was created where the Telco's SLA area of responsibility (NTE) finished and the customers began (CPE). Customers, could for the first time connect their own switching equipment directly to the telephone exchange. However, they were now responsible for the end user SLA.
Enterprise class PBXs soon filled the gap left by the telephone company and soon after internet protocol (IP) telephony took over as the next generation of critical data services was handled by the enterprise itself. The deployment of IP telephony has a considerable initial overhead in terms of hardware redundancy, QoS (Quality of Service) assurance, and network services (DHCP, DNS, TFTP). Lifting the receiver today and hearing a dial-tone is the combination of a network service, Hardware infrastructure and Software application.
The end user expectation is such, that based on history, the implied SLA which is expected from any telephone (with the exception of cellphones) is generally about 99.9999%. This makes the establishment of service availability SLAs even more critical than hardware availability SLAs because the nature of applications and network services are more unpredictable than the break- fix nature of networking hardware.
Some Network Management systems provide the ability to track the response time of applications. This allows granular SLAs to be established on the expected performance of applications. This type of SLA is known as a application KPI (Key Performance Indicator).
SLAs and Escalation
A SLA is worthless unless there is an escalation process attached to it. SLAs are time based and if this boundary is broken a process of escalation which everyone in the organisation knows must happen. I have worked for organisations where a critical weekend SLA was missed and the case was escalated to the department manager who was on vacation, then a vice president who was out of phone coverage and finally to the organisation's chief executive officer (CEO). When a SLA is escalated that far up the food chain, you know the response is not going to be favourable for the team working that shift!
You can Summarise the Three Types of SLAs for IT Management as Follows:
Operational Service Level - Held with the customer or end user of the system or solution. Infrastructure Service Level - Held with the Hardware design, redundancy, and with the support organisation looking after it. Service Availability Service Level - The end users perception of service availability, a combination of infrastructure and application availability. Often measured using KPIs.