Vendor Highlight Archive


Although SNMP (Simple Network Management Protocol) was introduced in 1988, it was seen as a short-term answer to managing devices on IP networks, it has been extremely successful. The beginning of network management started in the late 1980's. Computer networks were becoming increasingly complex and difficult to manage.

There were several network protocol standards during this time. TCP/IP was the mainstay of the academic community used in the UNIX operating system.

Novel's proprietary IPX protocol was used mainly by enterprises and Apple Talk was used by specialist design bureaus and desktop publishing.

There was a need for a network management solution that could be transported over different protocol stacks. Also, it  was important that it did not contribute to the overhead of the network. It also needed to be small enough to be embedded into computer hardware and software without impacting the performance of the device or service itself.

The best known and widest used network management protocol is referred to as SNMP.

Today, you will find SNMP embedded in virtually all enterprise class computing equipment including:


Servers - Power supply failure, RAID disk failure, CPU, memory, processes.

PCs - Disk space, OS errors, CPU utilisation, memory

Switches - Utilisation, CPU, memory, interface errors

Routers - Utilisation, CPU, memory, interface errors

UPS - Battery health/temperature/input & output voltage, load

Printers - Toner/ink levels


SNMP consists of three components:

1. A managed device or service.

2. An agent which resides on the managed device itself. It is important to remember that in the case of a server for example, you may have a SNMP agent which interacts with both the actual hardware (RAID, temperature, power) and also the operating system (CPU, memory and disk).

3. A Network Management System (NMS). The NMS is like a language interpreter. Initially when a new device is added to the NMS, limited communication can take place because the interpreter doesn't know how to speak the device's language.

This is where the vendor specific management information base (MIB) comes into play.

The MIB is like a code book for specific object in a hierarchy tree like structure called an object identifier (OID), which contains all the specific attributes for the device you are managing.

Each equipment vendor will publish specific MIBs for monitoring the device. Depending on the complexity of the device, the number of available OIDs will vary.

The technical name for this interpreter is called ASN.1

How Does it Work

When the NMS wants to find a value for something contained in the vendor MIB, for example the temperature of a Cisco switch.

The NMS will send a get request for the value of the following string (1.3.6.1.4.1.9.9.91.1.1.1.1.4).

mibtree

Notice how the tree starts from root and adds numbers to build the complete object identifier (I only included the tree structure to where the Cisco management part starts). If the object exists then a value is returned to the NMS.

Because the NMS contains the Cisco MIB code book, it knows the returned value is the temperature of the switch. If no value is returned, then an error message is returned.

This flexibility allows equipment vendors to include any type of managed object they want. As long as they also publish the updated MIB to support the object. This provides universal support across every SNMP management system.


Versions

There are three versions of SNMP.

The SNMP version 1 is not secure and the read and write access is controlled by simple community strings sent across the network in clear text. It was developed in the days prior to computer hacking.

As for the SNMP version 2c, it is the most widely used version today.

It allows for a lot more information to be sent in each SNMP poll request, which makes it useful for transferring response time information back to a NMS. However, it still has security concerns.

The SNMP version 3 uses full encryption, authentication and payload integrity checking. Many NMS systems support all three versions simultaneously, although many devices only support versions 1 and 2c.

It is important to note that SNMP version 2 is incompatible with SNMP version 1, and must use a SNMP proxy to translate between different versions if the NMS doesn't support both versions.

In order to assist with this situation, most vendors are now supporting all three versions of the SNMP stack.

Monitoring & Traps

SNMP provides two very separate services: Monitoring and Traps.

With monitoring, we are interested in a regular poll (querying the device) on a regular interval and determining the status of one or more monitored services such as CPU utilisation for example.

This allows the NMS to plot a nice over time graph that shows the maximum and minimum throughput for a given time period.

To reduce the load on the device polled, SNMP statistics are set to a regular polling interval, 15 minutes is common. This means if an event like a traffic spike occurs within the polling interval, you won't know about it until the next poll cycle is complete.

Traps, however, are generated by the device itself. The device (router, switch or PC) can be programmed to send a trap alert if a condition has is exceeded.

For example, if the amount of free disk space on a server reaches less than 10%, the SNMP service can send an alert warning to the NMS before the disk is full.

Get & Set

In order to make alarms useful for the environment you are monitoring, threshold variables have to be set on the device. This is achieved using a SNMP SET command.

Normally, the OID and the variable you are interested in changing is sent in the SET command.

Nearly all SNMP NMS systems have the ability to do this from a GUI interface, so you won't need to enter complicated strings from the command line interface.

Similarly, the SNMP GET retrieves the values of the variables in the system. Normally each device comes preconfigured with many defaults which may not be suitable for the monitoring environment.

This could include parameters such as errors, throughput, temperature etc. The MIB allows the NMS to translate the information polled from the device into useful management statistics.

Standard MIB Libraries

The Internet Engineering Task Force (IETF) administers the SNMP standards. This ensures a certain degree of interoperability between different vendors.

This common interoperability is contained in two standards known as MIB1 and MIBII. Within these pair of standard libraries is a performance monitoring object called RMON.

RMON uses the concept of a probe to collect performance statistics from network devices. The probe removes the work of correlating and tabulating the data from the device itself.

It is designed to be operated continuously to provide overtime trend metrics. Initially, the RMON probe itself was a separate device which reported back to the NMS.

However, with more powerful processing in modern computer hardware, the functionality is integrated into the NMS itself.

The problem with RMON is that it requires an embedded agent in the device being monitored and. Also, it is processor intensive depending on the amount of processing performed by the device.

Equipment vendors will often only implement 2 or 3 of the RMON group features into their products for this reason.

The following is a list of the standard tests which the RMON library provides:

Packets dropped
Packets sent
Broadcast packets
Multicast packets
CRC errors - (Cyclic redundancy check) This happens when the far-end reports the checksum didn't match the payload sent or there is a cabling fault corrupting the idle flag pattern. You can get CRC errors even if there is no data being transmitted. You need to check the cabling and network cards.
Runts - number of packets dropped because they exceed the minimum size on the interface
Giants - number of packet drops because they exceed the maximum size on the interface
Fragments - this is the distribution of packets which fit into various sizes like less than 64bytes or 128 - 256 bytes
Jabbers - From the old days when HUBs ruled the earth
Collisions - Before switches came and removed collisions
Counters for packets

 

Types of Tests

Most types of NMS depend on the device being reachable using a simple PING test. If the device replies, then the test was successful.

This type of active test is helpful to determine whether basic connectivity exists and whether the connection is reliable.

PING can provide a response time metric measured as a time to live (TTL) of the ping ICMP protocol itself. This type of test is known as a synthetic test as it is not a measurement of the actual traffic running through the device.

A simulated payload is used to calculate the TTL.

Normally, most NMS use a dashboard type system that represents the successful periodic PING responses as coloured green (successful), yellow (warning) or red (fault) lights.

Auto-Discovery

Nearly all commercial SNMP polling tools will include an auto-discovery tool that will send out a variety of host discovery queries such as PING, NETBIOS, ARP, SNMP etc on the subnet ranges on your network.

It will also attempt to build a network map diagram representing your network. Some of these tools even integrate with external mapping products to provide a geographical dispersion of hosts.

However, in nearly every network monitoring situation less is better. It may look cool to see 200 PCs automatically discovered and allocated into a subnet branch,

However, few auto-discovery tools ever build the complete network correctly, and usually require some form of manual intervention.

Also, you may find that your IDS detects the NMS server as a security threat, as it sweeps the network and may get locked out of the network.

Obviously link monitoring and alarms will be critical for servers and network components. However, be wary of a yellow warning alarm indicating that a ping response time is delayed. This usually indicates congestion rather than a failing interface.

Where SNMP is Most Useful

SNMP + ICMP (Ping) is most useful for monitoring infrastructure (availability of devices like switches and routers) and servers. It is readily available, relatively simple to configure and accurate.

It will tell you about throughput, interface errors and packet drops on a switch interface and CPU and memory utilisation on other devices.

SNMP Management Vendors

Since SNMP is embedded into nearly every enterprise vendor equipment, management systems for SNMP are readily available.

They can be available free or be very expensive. As with everything in this world, there is no free lunch. The more you pay the more features you get.

A free SNMP polling solution may only support limited functionality and reporting without any technical support.

On the other hand, a comprehensive solution from one of the big name vendors will integrate a full fault management solution, provide support for many different types of devices, allow remote device configuration and will include flow based reporting.

Limitations

SNMP depends on a polling interval. If the polling interval is frequent and there is a large number of devices, then SNMP polling and trap traffic may flood the network, and the NMS may start reporting incorrect result and triggering alarms on false traps.

SNMP uses UDP port 161 which becomes an unreliable service if the polling interval is too frequent and there is a large number of devices on the network.

It is possible that when a device sends a trap alert to the NMS host that is lost and ignored, keeping the polling interval as wide as possible is the best practice.

Failure alarms aren't reported in real time.

  • SNMP does not have any visibility into applications.

  • Vendor MIB libraries become obsolete as vendors upgrade device software to provide additional functionality, requires upgrading MIB library also.

  • SNMP can't report on the actual applications.

 

NetFlow

NetFlow was designed as an improvement over the standards based SNMP protocol, which although reliable and accurate, could not provide any information into conversations or the relevant protocols used.

Furthermore, at the time of the emergence of quality of service (QoS), a mechanism was needed to report of the different QoS levels present.

Cisco created the first commercial NetFlow version back in 1997, which it introduced as software in its high-end routers.

It uses the concept of a NetFlow collector which is responsible for analysing and correlating the flows from the flow exporter device typically a router or switch.

Later, NetFlow was adopted by vendors like Juniper (JFLOW), Foundary (SFLOW), and a standard was eventually adopted by the IETF, which became IPFIX.

NetFlow uses the concept of the cache entry that includes status information for all flows active in the export device. The cache builds this information by first processing the initial packet in a flow using the normal switching path.

The record of the flow is kept in the NetFlow cache for each active flow. Key fields in each NetFlow record are used for identifying flows that export data to a flow collector for further correlation.

Flow records are created by matching packets with flow characteristics that are similar and then tracking or counting the bytes per flow or packets.

The flow information is exported to a flow collector at a regular interval using flow timers. Each collector maintains flow information in a history table that is switched within the export device. NetFlow uses approximately 1 to 2 percent of the total switching traffic within the router.

Every packet is counted when using the non-sampled NetFlow mode. This displays a detailed over time view of traffic entering and leaving the export device.

sFlow

Whilst NetFlow is a software solution embedded into Cisco IOS routers and switches, sFlow is a hardware implementation of NetFlow embedded in a chip on many other network vendor products.

Whilst Cisco's NetFlow specification can use both sampled and non-sampled modes, sFlow supports sampled only metrics. This means only 1 in n packets is forwarded to the flow collector.

Using complex algorithms a predictability calculation is performed on the sampled traffic to improve the accuracy of the flow export.

Vendors which use sFlow include Juniper, Foundry, HP and Extreme Networks.

The biggest benefit of sFlow technology is that it removes the processing burden of CPU processing the flow traffic from the switch or router. Most third party flow collectors support both Cisco Netflow and the sflow standard.

A flow is a uni-directional data stream which must have the following components to be considered a flow.

Source IP address

Destination IP address

UDP or TCP source port, can be 0 for other protocols

UDP or TCP destination port, can be 0 for other protocols

IP Protocol type

SNMP Interface index number

IP TOS (Type of Service)

In later versions of NetFlow, the concept of user defined fields was made available that extended the protocol to support other vendor specific attributes like (MPLS TAGS and BGP neighbours).

Initially, it was only available on early high-end devices due to the impact of CPU and hardware resources on lower performance devices.

The router was responsible for monitoring and storing all the flows on pre-configured interfaces, exporting them to the collector once the conversation or flow was complete.

Complicated cache flow algorithms are used on the router or switch device to determine whether a flow is part of an existing flow, or is completely new and requires a new flow entry in the table.

CPU utilisation is impacted the most when it comes to NetFlow. This is because the number of flows increases the amount of processing power required to sort and order the flows increases.

High-end routers and switches feature a dedicated NetFlow module which removes the processing burden completely from the host.

There is very little benefit in enabling or disabling other NetFlow features such as exporting to multiple destinations or collecting BGP peer information.

Tests by Cisco demonstrates this doesn't seem to affect the processing overhead complex logic algorithms are used to determine when to export a flow.

The cache is a finite resource on a router or switch. If a flow exceeds the flow export interval by default, it is exported after 30 minutes and heuristics is used to aggressive age groups of flows simultaneously before the cache is saturated.

Obviously this is an extreme example since most "well behaved flows" are short and will send a TCP FIN at the end of the flow, which indicates to the cache the flow has finished.

Sampled NetFlow

Sampled NetFlow significantly reduces the processing burden from the router acting as a flow exporter.

Instead of maintaining a complete table of all the flows present on the device, it samples the interfaces flows at regular intervals and provides a sample of the traffic running on the device.

The collector needs to be configured with the sample rate used so it can compensate for flow discrepancies.

A typical configuration would be a 1:100 packet sample rate. With this sample approximately 80% of flows would be included in the NetFlow cache. However, this also reduces the accuracy of the reporting.

A sampled NetFlow is a compromise between performance versus accuracy.

The question remains though, for capacity planning purposes, how important is accuracy? Since all the reported protocols will be skewed by the same sample rate, the protocol distribution will be fairly accurate.

Using NetFlow in conjunction with SNMP, will provide accurate overall utilisation statistics and provide a fairly comprehensive view of the network protocols.

Limitations

NetFlow can add a processing overhead on low end routers and switches. Also, it may not be possible to enable NetFlow on old devices with large amount of services and traffic.

The reports generated by NetFlow is an estimation of network traffic, switching between different time interval views (1 minute or 15 minutes). Samples may yield different results due to the sampling rate used.

NetFlow is UDP based. Therefore, if the network is congested and the flow collector doesn't receive the flow transmission, it is lost completely. (Cisco developed a SCTP transport to overcome this limitation in later versions).

When comparing SNMP and NetFlow collection together typically NetFlow will generally show less data than its SNMP MIBII counterpart for a number of reasons.

Some data types are not supported by NetFlow. Protocols such as address resolution protocol (ARP) although part of the IP protocol family, is not classified as an actual IP packet and is not included with NetFlow collection.

Cisco CDP (Cisco Discovery Protocol) packets are another example. This type of traffic is usually a lot less than other traffic types but it will still skew the values in an application detail report.

Layer 2 packet headers are not counted in NetFlow because it typically counts just complete IP datagrams. An ethernet header, for example, is stripped off before the NetFlow processing and is not added into the octets count for that packet. This can contribute significantly to a difference.

For example, if the header that is removed is 18 bytes (MAC+LLC+CRC) and the average frame size on the link is 300 bytes, that is over 5% difference between the two.

NetFlow timers make lining up samples difficult. The SNMP MIBII counts are updated immediately or on a very short cycle (< 1 sec).

The NetFlow counts are dependent on the timers. Even with very short timers, 15 sec inactive or 1 min active, there is still a potential 1 min gap between data on the interface and data sent to the collector.

Increasing timers make aligning the samples more difficult. In this situation, you’ll often see a lag between the two graphs. A link against the router may show a spike that appears in some level in the next sample on a corresponding collector application view.

This is a common cause for NetFlow spikes showing higher than router link.

NetFlow uses sampled data. The random nature of what packet is sampled can skew the data.

Statistically, this data should be very close as frames timers increase with greater flow rates. This is an important consideration when evaluating data from NetFlow based solutions.


By Craig Sutherland 

 

You have no rights to post comments