To understand how Delivery monitoring works, there are a few topics you should be acquainted with. These include:

  • Network traffic - describes how traffic flows through a network.
  • TruPath - describes how TruPath, AppNeta’s network performance monitoring technology, works.
  • ICMP, UDP, and TCP - describes the three protocols used to measure network performance.
  • QoS - describes what QoS is and how it affects traffic flows.
  • Data and voice traffic - describes the difference between data and voice traffic from a performance monitoring perspective.
  • Network performance metrics - describes the various performance metrics that are collected and calculated.

Network traffic

As data packets flow through a network between two endpoints they encounter a number of network devices (for example, routers, switches, firewalls, load balancers, etc.). Each of these devices needs to store and forward the packets as they are passed along. The amount of traffic passing through a device at a given time, and the priority of that traffic, affects the amount of time a given packet will be queued (if at all) on that device. Once a queue starts to fill, packets on that queue wait their turn to be forwarded. This waiting causes a delay between packets when they are received at their destination. If there is so much traffic on a network device such that one of its queues fills up, any additional packets destined for that queue are dropped - causing data loss. The rate at which a given queue can repeatedly fill and drain without data loss is effectively the maximum capacity of the device for traffic using that queue. The lowest capacity device on the network path between the two endpoints determines the capacity of that path.

TruPath™

TruPath™, AppNeta’s patented network performance monitoring technology, is the heart of Delivery monitoring. It probes a network using short bursts of packets (called “packet trains”) and waits for the replies. It uses information like the time the packets take to go from a source to a target and back, the delay between packets on their return, packet reordering, and the number of packets lost, to directly measure key network performance metrics (round-trip time (RTT), latency, jitter, and data loss), and to infer others (total and utilized capacity). At the same time, it can determine if there are Quality of Service (QoS) changes along the network path. All of this network performance information is sent to APM for analysis and presentation.

To obtain this information, TruPath employs two distinct instrumentation modes: Continuous Path Analysis™ (CPA) and Deep Path Analysis™ (DPA)(also known as Diagnostics). CPA mode runs continuously, and every 60 seconds (by default) places roughly 20-50 packets onto the network and analyzes the replies. If a network dysfunction is detected (for example, a higher than acceptable data loss) TruPath first confirms the dysfunction is present (by sampling every six seconds for ten samples) and then, once it is confirmed, automatically shifts to DPA mode and runs a diagnostic test which probes not only the target, but all devices on the network path from the source to the target. In this mode as many as 400-2000 packets can be sent in a series of packet trains in order to delve into the cause of the performance issue. As part of the diagnostic test (and every five minutes during CPA), the route taken by all protocol types (ICMP, UDP, and TCP) is determined. All the information collected in both CPA and DPA modes is sent back to APM.

Commonly used packet sequences are 1, 5, 10, 20, 30 and 50 packets in length. Because the packet sequences are very short, the overall load on the network is kept very low, typically averaging 2 Kbps for CPA and only 10-200 Kbps during a DPA diagnostic test. For very slow speed links or networks with other restrictions like small maximum MTU size, TruPath automatically adjusts its traffic loads to minimize network impact even further.

Single-ended and dual-ended network paths

TruPath can be employed in either a single-ended configuration or a dual-ended configuration. In the single-ended configuration, a single AppNeta monitoring point is required to run the TruPath software. It acts as one endpoint (the source) of the network path being monitored. The other endpoint (the target) can be any TCP/IP device. ICMP echo requests are sent to the target and ICMP echo replies are returned. The advantage of a single-ended configuration is that only one monitoring point is required. The disadvantage is that any network characteristics that are direction dependent (for example, the differences in capacity in each direction on an asymmetrically provisioned link) cannot be detected.

single-ended-setup.svg

Figure 1: Single-ended configuration

In the dual-ended configuration, one monitoring point is the source and another is the target. Both ICMP and UDP packets are used for monitoring - ICMP to the intermediate network devices and UDP to the monitoring point at the far end. The advantage is that you get a more accurate picture of your network performance as independent measurements are taken from source to target and from target to source. This enables you to determine, for example, the network capacity in each direction. The target monitoring point can be one of your AppNeta Enterprise Monitoring Points or an AppNeta WAN Target, depending on your needs.

dual-ended-setup.svg

Figure 2: Dual-ended configuration

ICMP, UDP, and TCP

TruPath uses three common protocols to gather network performance metrics: ICMP, UDP, and TCP.

ICMP is a control message protocol used by network devices to send error messages and operational information. It is not typically used to transfer data between systems. AppNeta monitoring points running Delivery monitoring use ICMP echo request and echo response packets (“ping” packets) for the majority of our continuous monitoring - collecting network performance metrics. ICMP is also used to expose QoS marking changes during diagnostics tests, and as part of a traceroute for determining the route ICMP packets take from a source monitoring point to a target.

UDP is a core Internet protocol used to transport data. It is connectionless with very little protocol overhead. Other than checksums for data integrity checking, there is no error checking. Also, there is no guarantee of packet delivery, packet ordering, or duplicate protection. UDP is used for applications where error checking and correction are either not necessary or are provided by higher level protocols or the application using it. It is typically used by time-sensitive applications where losing packets is preferable to spending time retransmitting lost packets. Applications that use UDP include real-time voice and video streaming and Voice over IP (VoIP). Delivery monitoring uses UDP for continuous monitoring on dual-ended paths, to expose QoS marking changes during diagnostics tests, and as part of a traceroute for determining the route UDP packets take from a source monitoring point to a target.

TCP is also a core Internet protocol used to transport data. It differs from UDP in a few ways. It is a connection-oriented protocol that provides a reliable, ordered, and error-checked way to transfer data over an IP network. It is used for applications where reliability is more important than reduced latency. Applications that use TCP include WWW browsing, file transfer, email, and video streaming services like Youtube and Netflix. Delivery monitoring uses TCP as part of a traceroute for determining the route TCP packets take from a source monitoring point to a target.

QoS

Networks today handle many different traffic types including file transfers, WWW browsing, email, VoIP, video conferencing, and streaming media, each with different characteristics and requirements. For example, file transfers must not lose data but delays between packets in the transfer are not a problem. On the other hand, VoIP traffic is very sensitive to delay and jitter - the variation in packet delay. Quality of Service (QoS) is the mechanism used to manage packet loss, delay, and jitter by categorizing traffic types and then handling them appropriately.

Using QoS, traffic flows can be prioritized such that, for example, delay-sensitive traffic can be allocated dedicated bandwidth and separate queuing on a network device so that it passes through the device more quickly than delay-insensitive traffic. To this end, different traffic types can be marked with Differentiated Services Code Point (DSCP) values so that they can be categorized and handled appropriately by network devices.

Some DSCP markings have agreed upon meanings and others do not. For example, DSCP 0 (the default value) means forward with “best effort”, whereas DSCP 46 (0xEF) means “high priority expedited forwarding”. However, it is up to the network administrators responsible for individual network devices to configure their devices to respect the different markings and treat them appropriately. Because honoring these values is not mandatory, there can be variations in how traffic is handled at various hops along a network path through the Internet. In some cases, DSCP markings are even changed as they pass through a hop. This is a potential cause of poor quality with delay-sensitive traffic.

Specifying DSCP markings on test packets sent by TruPath enables you to determine how traffic that uses those markings is treated by the network. In addition to seeing potentially different performance metrics than traffic with unmarked packets, you will also be able to see which hops (if any) are changing the markings. You can specify the test packet DSCP markings during network path creation.

Data and voice traffic

Delivery monitoring provides tools to evaluate network performance for both data and voice traffic. The primary difference between the two is that voice traffic has smaller payloads with wider packet spacing. The exact signature of test packet sequences sent for voice measurements depends on the VoIP codec selected during network path configuration. By default, TruPath collects measurements using both data and voice test packets. It is also possible to change this (by changing the Target Type) when a network path is created or after it is created.

Network performance metrics

TruPath provides a number of network performance metrics that are displayed as charts in APM.

Round-trip time and Latency

Probably the most basic Delivery metrics collected are round-trip time (RTT) and latency. RTT is the time it takes for a packet to go from a source to a target and back. Latency, the time it takes for a packet to go from a source to a target, is calculated as one half of the RTT of the fastest packet in a packet train.

High latency values have a detrimental effect on applications that use TCP and time-sensitive applications that use UDP. For TCP, the effect of latency is compounded due to the way its congestion control mechanism works. This results in a major decrease in TCP throughput. Modern video streaming services like YouTube and Netflix use TCP. For time-sensitive applications that use UDP (for example, real-time voice and video streaming and Voice over IP (VoIP)), large latencies can introduce both conversational difficulty and packet loss.

There are several ways latency gets introduced into your data stream. The first is propagation delay. This is the time it takes for a signal to propagate across a link between one device and another. In general, the farther apart the devices are, the greater the propagation delay. The second is queuing delay. Queuing delay is introduced when a network device is congested and can’t route incoming packets immediately upon ingress. Finally, there is handling delay. This is the time it takes to put a packet on the wire. Generally, this is negligible compared to the other two.

Data and voice jitter

Jitter, also known as packet delay variation, is a measure of variation in latency. Jitter affects time-sensitive applications that use UDP but does not affect applications using TCP. For example, real-time voice and video streaming are affected by jitter because each packet produced contains a tiny sample of the source media. In order to accurately recreate the media at the receiver end, those packets must arrive at a constant rate and in the correct order. If not, the audio may be garbled, or the video may be fuzzy or freeze. All networks introduce some jitter because each packet in a single data stream can potentially experience different network conditions. For example, they can take different paths or experience different queuing delays. Severe jitter is almost always caused either by network congestion, lack of QoS configuration, or mis-configured QoS.

Delivery monitoring provides both data and voice jitter metrics. Data jitter is that measured when packet trains emulating data traffic are used. Voice jitter is that measured when packet trains emulating voice traffic (smaller payloads with wider packet spacing) are used.

Data and voice loss

Packet loss, whether the packets are data or voice, is simply a measure of the number of packets that did not make it to their intended destination. Packet loss can occur for a variety of reasons including traffic congestion along the network path, an overloaded network device, bad physical media, flapping routes, flapping load balancing, and name resolution issues.

The effect of packet loss can range from insignificant to critical depending on its severity and the application that is experiencing it. For example, with applications that use TCP, light data loss will generally be unnoticed because TCP will detect the issue and have the lost packets retransmitted. That said, heavy data loss can cause many retransmissions and can significantly impact throughput. Users would notice slow response times. For applications that use UDP, VoIP for example, the loss may or may not have a significant affect on the conversation depending on how much loss is experienced as UDP packets are not retransmitted.

Delivery monitoring provides both data and voice loss metrics. Data loss is that measured when packet trains emulating data traffic are used. Voice loss is that measured when packet trains emulating voice traffic (smaller payloads with wider packet spacing) are used.

Capacity

Total capacity is the highest transmission rate that you can achieve between a sender and a receiver. In Delivery monitoring, total capacity is calculated based on measurements from numerous test iterations containing various packet patterns. The calculation takes into account variations in latency and cross traffic. Available capacity is the part of the total capacity that is available for use. Utilized capacity is the part of the total capacity that is in use.

Note that capacity and bandwidth are different for our purposes. Bandwidth is the transmission rate of the physical media. It is the number quoted by your ISP but it does not take into consideration any protocol overhead or queuing delays. Capacity, on the other hand, takes these into consideration. Given this, the bandwidth number is typically higher than capacity. Capacity, however, is a better representation of how application data experiences the network.

Measuring capacity

TruPath uses packet dispersion analysis to calculate capacity. To understand how it this works, imagine two packets of equal size are sent back-to-back with no other traffic on the line. We’re interested in the distance between those packets by the time they reach the target. The packet dispersion is the time between the arrival of the last byte of the first packet and the last byte of the second packet.

To calculate the total capacity of the path in bits per second, we divide the packet size (in bits) by the dispersion (in seconds). The dispersion value used to determine total capacity is the minimum dispersion observed over a series of packet trains. Available capacity is calculated using the average dispersion of a series of packet trains taking lost packets into consideration in the calculation. Utilized capacity is the total capacity minus the available capacity.

These calculations are for dual-ended paths. For single-ended paths the one-way dispersion is half the difference of the round-trip times of each ping/response pair.

The measurement process is iterative:

  • Measurements are taken every minute.
  • Each iteration consists of multiple packet trains, each with a specific packet size and number of packets (up to 50) per train.
    • Sending multiple packet trains reduces the effect of packet loss.
    • Sending large packets guarantees queuing. If packets did not get queued, there would be no dispersion and capacity would be overestimated.
  • The initial packet size is the path MTU (PMTU).
    • PMTU is the largest packet size a path can handle without fragmentation.
  • APM considers the best option for total capacity measurement to be a packet train with the largest packet size that experiences no packet loss.

Anything that affects the round-trip time of test packets including low-bandwidth links, congested links, and operating system effects, are accounted for in the capacity measurement. These are the same effects your application data experiences.

Mean Opinion Score (MOS)

The Mean Opinion Score (MOS) is an estimate of the rating a typical user would give to the sound quality of a call. It is expressed on a scale of 1 to 5, where 5 is perfect. It is a function of loss, latency, and jitter. It also varies with voice codec and call load. If audio codec G.722.1 is selected for a session, an MOS score will not appear.