To understand how Delivery monitoring works, there are a few topics you should be acquainted with. These include:

  • Network traffic - describes how traffic flows through a network.
  • TruPath - describes how TruPath, AppNeta’s network performance monitoring technology, works.
  • ICMP, UDP, and TCP - describes the three protocols used to measure network performance.
  • QoS - describes what QoS is and how it affects traffic flows.
  • Data and voice traffic - describes the difference between data and voice traffic from a performance monitoring perspective.
  • Network performance metrics - describes the various performance metrics that are collected and calculated.

Network traffic

As data packets flow through a network between two endpoints they encounter a number of network devices (for example, routers, switches, firewalls, load balancers, etc.). Each of these devices needs to store and forward the packets as they are passed along. The amount of traffic passing through a device at a given time, and the priority of that traffic, affects the amount of time a given packet will be queued (if at all) on that device. Once a queue starts to fill, packets on that queue wait their turn to be forwarded. This waiting causes a delay between packets when they are received at their destination. If there is so much traffic on a network device such that one of its queues fills up, any additional packets destined for that queue are dropped - causing data loss. The rate at which a given queue can repeatedly fill and drain without data loss is effectively the maximum capacity of the device for traffic using that queue. The lowest capacity device on the network path between the two endpoints determines the capacity of that path.

TruPath™

TruPath™, AppNeta’s patented network performance monitoring technology, is the heart of Delivery monitoring. With minimal network load or impact, it allows you to continuously monitor your network to quickly detect problems and see changes over time.

TruPath probes a network using short bursts of packets (called “packet trains”) and waits for the replies. It uses information like the time the packets take to go from a source to a target and back, the delay between packets on their return, packet reordering, and the number of packets lost, to directly measure key network performance metrics (Round-trip time (RTT), Latency, Jitter, and Data loss), and to infer others (Total and Utilized capacity). At the same time, it can determine if there are Quality of Service (QoS) changes along the network path. All of this network performance information is sent to APM for analysis and presentation.

To obtain this information, TruPath employs two distinct instrumentation modes: Continuous Path Analysis™ (CPA) and Deep Path Analysis™ (DPA)(also known as Diagnostics). CPA mode runs continuously, and every 60 seconds (by default) places roughly 20-50 packets onto the network to the target and analyzes the replies. If a network dysfunction is detected (for example, a higher than acceptable data loss) TruPath first confirms the dysfunction is present (by sampling every six seconds for ten samples) and then, once it is confirmed, automatically shifts to DPA mode and runs a diagnostic test which probes not only the target, but all devices on the network path from the source to the target. In this mode as many as 400-2000 packets can be sent in a series of packet trains in order to delve into the cause of the performance issue. As part of the diagnostic test (and every five minutes during CPA), the route taken by all protocol types (ICMP, UDP, and TCP) is determined. All the information collected in both CPA and DPA modes is sent back to APM.

Commonly used packet sequences are 1, 5, 10, 20, 30 and 50 packets in length. Because the packet sequences are very short, the overall load on the network is kept very low, typically averaging 2 Kbps for CPA and only 10-200 Kbps during a DPA diagnostic test. For very slow speed links or networks with other restrictions like small maximum MTU size, TruPath automatically adjusts its traffic loads to minimize network impact even further.

Single-ended and dual-ended network paths

TruPath can be employed in either a single-ended configuration or a dual-ended configuration. In the single-ended configuration, a single AppNeta Monitoring Point is required to run the TruPath software. It acts as one endpoint (the source) of the network path being monitored. The other endpoint (the target) can be any TCP/IP device. ICMP echo requests are sent to the target and ICMP echo replies are returned. The advantage of a single-ended configuration is that only one Monitoring Point is required. The disadvantage is that any network characteristics that are direction dependent (for example, the differences in capacity in each direction on an asymmetrically provisioned link) cannot be detected.

Diagram showing packets  of variable size and spacing traversing a single ended path ending up missing and unordered.

Figure 1: Single-ended configuration

In the dual-ended configuration, one Monitoring Point is the source and another is the target and UDP packets (rather than ICMP packets) are used for monitoring. The advantage of dual-ended paths is that you get a more accurate picture of your network performance as independent measurements are taken from source to target and from target to source. This enables you to determine, for example, the network capacity in each direction. The target Monitoring Point can be one of your AppNeta Enterprise Monitoring Points or an AppNeta WAN Target, depending on your needs.

Diagram showing packets of variable size and spacing traversing a dual-ended path in both directions. The reverse direction is initiated by the target Monitoring Point.

Figure 2: Dual-ended configuration

ICMP, UDP, and TCP

TruPath uses three common protocols to gather network performance metrics: ICMP, UDP, and TCP.

ICMP is a control message protocol used by network devices to send error messages and operational information. It is not typically used to transfer data between systems. AppNeta Monitoring Points running Delivery monitoring use ICMP echo request and echo response packets (“ping” packets) for the majority of our continuous monitoring - collecting network performance metrics. ICMP is also used to expose QoS marking changes during diagnostics tests, and as part of a traceroute for determining the route ICMP packets take from a source Monitoring Point to a target.

UDP is a core internet protocol used to transport data. It is connectionless with very little protocol overhead. Other than checksums for data integrity checking, there is no error checking. Also, there is no guarantee of packet delivery, packet ordering, or duplicate protection. UDP is used for applications where error checking and correction are either not necessary or are provided by higher level protocols or the application using it. It is typically used by time-sensitive applications where losing packets is preferable to spending time retransmitting lost packets. Applications that use UDP include real-time voice and video streaming and Voice over IP (VoIP). Delivery monitoring uses UDP for continuous monitoring on dual-ended paths, to expose QoS marking changes during diagnostics tests, and as part of a traceroute for determining the route UDP packets take from a source Monitoring Point to a target.

TCP is also a core internet protocol used to transport data. It differs from UDP in a few ways. It is a connection-oriented protocol that provides a reliable, ordered, and error-checked way to transfer data over an IP network. It is used for applications where reliability is more important than reduced latency. Applications that use TCP include WWW browsing, file transfer, email, and video streaming services like Youtube and Netflix. Delivery monitoring uses TCP as part of a traceroute for determining the route TCP packets take from a source Monitoring Point to a target.

QoS

Networks today handle many different traffic types including file transfers, WWW browsing, email, VoIP, video conferencing, and streaming media, each with different characteristics and requirements. For example, file transfers must not lose data but delays between packets in the transfer are not a problem. On the other hand, VoIP traffic is very sensitive to delay and jitter - the variation in packet delay. Quality of Service (QoS) is the mechanism used to manage packet loss, delay, and jitter by categorizing traffic types and then handling them appropriately.

Using QoS, traffic flows can be prioritized such that, for example, delay-sensitive traffic can be allocated dedicated bandwidth and separate queuing on a network device so that it passes through the device more quickly than delay-insensitive traffic. To this end, different traffic types can be marked with Differentiated Services Code Point (DSCP) values so that they can be categorized and handled appropriately by network devices.

Some DSCP markings have agreed upon meanings and others do not. For example, DSCP 0 (the default value) means forward with “best effort”, whereas DSCP 46 (0xEF) means “high priority expedited forwarding”. However, it is up to the network administrators responsible for individual network devices to configure their devices to respect the different markings and treat them appropriately. Because honoring these values is not mandatory, there can be variations in how traffic is handled at various hops along a network path through the internet. In some cases, DSCP markings are even changed as they pass through a hop. This is a potential cause of poor quality with delay-sensitive traffic.

Specifying DSCP markings on test packets sent by TruPath enables you to determine how traffic that uses those markings is treated by the network. In addition to seeing potentially different performance metrics than traffic with unmarked packets, you will also be able to see which hops (if any) are changing the markings. You can configure the test packet DSCP markings on a network path in a number of ways.

Data and voice traffic

Delivery monitoring provides tools to evaluate network performance for both data and voice traffic. The primary difference between the two is that voice traffic has smaller payloads with wider packet spacing. The exact signature of test packet sequences sent for voice measurements depends on the VoIP codec selected during network path configuration. By default, TruPath collects measurements using both data and voice test packets. It is also possible to change this (by changing the Target Type) when a network path is created or after it is created.

Network performance metrics

TruPath provides a number of network performance metrics that are displayed as charts in APM. They can be classified as either primary metrics or derived indicators:

Data and voice loss

Packet loss, whether the packets are data or voice, is simply a measure of the number of packets that did not make it to their intended destination. Packet loss can occur for a variety of reasons including traffic congestion along the network path, an overloaded network device, bad physical media, flapping routes, flapping load balancing, and name resolution issues.

The effect of packet loss can range from insignificant to critical depending on its severity and the application that is experiencing it. For example, with applications that use TCP, light data loss will generally be unnoticed because TCP will detect the issue and have the lost packets retransmitted. That said, heavy data loss can cause many retransmissions and can significantly impact throughput. Users would notice slow response times. For applications that use UDP, VoIP for example, the loss may or may not have a significant affect on the conversation depending on how much loss is experienced as UDP packets are not retransmitted.

Delivery monitoring provides both data and voice loss metrics. Data loss is that measured when packet trains emulating data traffic are used. Voice loss is that measured when packet trains emulating voice traffic (smaller payloads with wider packet spacing) are used.

Round-trip time (RTT) and Latency

Probably the most basic Delivery metrics collected are round-trip time (RTT) and latency. RTT is the time it takes for a packet to go from a source to a target and back (the RTT chart in APM shows the average RTT over the selected time period). Latency, the time it takes for a packet to go from a source to a target, is calculated as one half of the RTT of the fastest packet in a packet train.

High latency values have a detrimental effect on applications that use TCP and time-sensitive applications that use UDP. For TCP, the effect of latency is compounded due to the way its congestion control mechanism works. This results in a major decrease in TCP throughput. Modern video streaming services like YouTube and Netflix use TCP. For time-sensitive applications that use UDP (for example, real-time voice and video streaming and Voice over IP (VoIP)), large latencies can introduce both conversational difficulty and packet loss.

There are several ways latency gets introduced into your data stream. The first is propagation delay. This is the time it takes for a signal to propagate across a link between one device and another. In general, the farther apart the devices are, the greater the propagation delay. The second is queuing delay. Queuing delay is introduced when a network device is congested and can’t route incoming packets immediately upon ingress. Finally, there is handling delay. This is the time it takes to put a packet on the wire. Generally, this is negligible compared to the other two.

Data and voice jitter

Jitter, also known as packet delay variation, is a measure of variation in latency. Jitter affects time-sensitive applications that use UDP but does not affect applications using TCP. For example, real-time voice and video streaming are affected by jitter because each packet produced contains a tiny sample of the source media. In order to accurately recreate the media at the receiver end, those packets must arrive at a constant rate and in the correct order. If not, the audio may be garbled, or the video may be fuzzy or freeze. All networks introduce some jitter because each packet in a single data stream can potentially experience different network conditions. For example, they can take different paths or experience different queuing delays. Severe jitter is almost always caused either by network congestion, lack of QoS configuration, or mis-configured QoS.

Delivery monitoring provides both data and voice jitter metrics. Data jitter is that measured when packet trains emulating data traffic are used. Voice jitter is that measured when packet trains emulating voice traffic (smaller payloads with wider packet spacing) are used.

Capacity

There are a number of capacity related metrics. They are derived based on measurements from numerous test iterations containing various packet patterns.

  • Total capacity - the peak transmission rate observed in the last three hours by TruPath. The calculation takes into account variations in latency and cross traffic.
  • Available capacity - the part of the Total capacity that is available for use.
  • Utilized capacity - the part of the Total capacity that is in use. It is calculated as Total capacity minus Available capacity.

Note that capacity and bandwidth are different for our purposes. Bandwidth is the transmission rate of the physical media. For internet connections, it is the number quoted by your ISP. Capacity is an end-to-end measurement (from the source monitoring point to the target) and is determined by the most constricted part of the path from source to target. Given this, the bandwidth number is typically higher than capacity. Capacity, however, is a better representation of how application data experiences the network.

How does AppNeta measure capacity?

In order to provide continuous measurements with minimal impact, TruPath uses packet dispersion analysis to calculate capacity rather than loading the path like PathTest or a Speedtest. To understand how this works, imagine two packets of equal size are sent back-to-back with no other traffic on the line. We’re interested in the distance between those packets by the time they reach the target. The packet dispersion is the time between the arrival of the last byte of the first packet and the last byte of the second packet.

We calculate capacity (in bits per second) as follows:

  • Total capacity - Divide the packet size (in bits) by the dispersion (in seconds). The dispersion value used is the minimum dispersion observed over a series of packet trains.
  • Available capacity - Use the average dispersion of a series of packet trains taking lost packets into consideration in the calculation.
  • Utilized capacity - The Total capacity minus the Available capacity.
Measurement process

The measurement process is iterative and continuous:

  • Every minute, multiple packet trains are sent, each with a specific packet size and number of packets (up to 50) per train.
    • The initial packet size is the path MTU (PMTU) - the largest packet size a path can handle without fragmentation.
    • Sending large packets guarantees queuing. If packets did not get queued, there would be no dispersion and capacity would be overestimated.
    • Sending multiple packet trains reduces the effect of packet loss.
  • The Total capacity measurement is derived from the packet train with the largest packet size that experiences no packet loss.
  • Anything that affects the round-trip time of test packets including low-bandwidth links, congested links, and operating system effects, are accounted for in the capacity measurement.

Why am I not seeing the capacity I expect?

If the capacity values being reported are not what you are expecting, there are several possibilities to consider.

Shared media environments

TruPath measures the capacity available to small bursts of traffic rather than that available to sustained traffic loads. This enables TruPath to run every minute without affecting business traffic, however there are environments where this can affect the capacity values you might expect:

  • Subscriber access environments (for example, FTTH (Fiber), cable, satellite):
    • Higher than expected capacity on the downlink: this is often the case because providers allow small bursts to pass at higher rates than sustained traffic. For sustained traffic, providers ‘rate-limit’ bandwidth to what you have contracted for.
    • Lower than expected capacity on the uplink: this is often the case because providers only allocate bandwidth on demand, and if background demand is low, TruPath’s low demand will not trigger the allocation of more bandwidth.
  • Wireless environments - Because wireless is simplex, transmit and receive traffic must compete with each other. All other stations sharing the local wireless LAN must compete for transmission time, so the effective capacity measured by TruPath tends to be lower than the instantaneous data rates supported by the wifi equipment.

To see the capacity under a sustained traffic load, you can use PathTest or a Speedtest.

Capacity and bandwidth are different

Bandwidth is the transmission rate of the physical media link between your site and your ISP. The bandwidth number is what the ISP quotes you. Capacity is the end-to-end network layer measurement of a network path - from a source to a target. Link-layer headers and framing overhead reduces rated capacity to a theoretical maximum. This maximum is different for every network technology. Further reducing capacity is the fact that NICs, routers, and switches are sometimes unable to saturate the network path, and therefore the theoretical maximum can’t be achieved. ‘Saturate’ means the ability to transmit packets at line rate without any gaps between them. All switches can run at line rate for the length of time that a packet is being sent but some are unable to send the next packet without any rest in between. This determines the ‘switch capacity’. APM provides a range for Total Capacity that you can expect given the physical medium and modern equipment with good switching capacity.

The following table shows the expected capacity of various link types:

Standard Standard link speed L1 + L2 overhead Theoretical total capacity Optimal total capacity
DS0 or ISDN 64 Kbps 3.9% 61.5 Kbps 61.5 Kbps
ISDN dual channel 128 Kbps 3.9% 123 Kbps 123 Kbps
T1 (HDLC+ATM) 1.544 Mbps 11.6% 1.365 Mbps 1.325-1.375 Mbps
T1 (HDLC) 1.544 Mbps 3.5% 1.49 Mbps 1.40-1.49 Mbps
E1 2.0 Mbps 3.5% 1.93 Mbps 1.86-1.95 Mbps
T3 45 Mbps 3.5% 43.425 Mbps 42.50-43.45 Mbps
10M Ethernet half-duplex 10 Mbps 2.5% 4.875 Mbps 4.8-4.9 Mbps
10M Ethernet full-duplex 10 Mbps 2.5% 9.75 Mbps 9.7-9.8 Mbps
100M Ethernet half-duplex 100 Mbps 2.5% 48.75 Mbps 48.5-49.0 Mbps
100M Ethernet Full-duplex 100 Mbps 2.5% 97.5 Mbps 90-97.5 Mbps
Gigabit Ethernet Full-duplex 1 Gbps 2.5% 975 Mbps 600-900 Mbps

Note: Total capacity is based on the assumption that traffic will flow in both directions. Therefore, you can expect the total capacity for half-duplex links to be roughly half of what it would be with full-duplex.

Consider the target

Some devices make better targets than others. Choosing a good target is important in order to get good measurements.

Asymmetric links, if measured using single-ended paths, will show the capacity of the slowest of the uplink and downlink directions. This can be misleading. Measuring a link using a dual-ended path will show the capacity of each direction. If you are unsure whether you have an asymmetric link, setting up a dual-ended path (for example, to an AppNeta WAN Target) will allow you to determine this.

Persistent low capacity condition

When a low capacity condition is persistent rather than transient, it is caused by a network bottleneck, not by congestion. The bottleneck can be at any point on the path between the source and the target. To determine if the link to your ISP is the bottleneck, create an additional network path to an AppNeta WAN target. Using Route Analysis, confirm that the only common part of the two paths is to the ISP. If this is the case, and the capacity measurements are the same, then the bottleneck is likely the link to your ISP. Otherwise, the bottleneck is somewhere else on the path.

Capacity chart shows no capacity

This can be due to sustained packet loss. See Packet loss is present.

Packet loss is present

Capacity is measured by sending multiple bursts of back-to-back packets every minute (as described in TruPath). To measure total capacity, at least one burst must come back with zero packet loss. If that is not the case, then the capacity measurement is skipped for that interval. If packet loss is intermittent, the result is a choppy Capacity chart. If packet loss is sustained, the Capacity chart will show no capacity while the packet loss is present.

Confirm with PathTest

If none of the previous subsections is applicable to your situation, you can use PathTest or a Speedtest to corroborate the capacity measurements. Remember that these are load tests and they measures bandwidth, not capacity.

  • If the PathTest or Speedtest results support the capacity measurements, it is possible that you’re not getting the proper provisioning from your ISP.
  • If the PathTest or Speedtest results do not support the capacity measurements, contact AppNeta Support so we can help you investigate further.

To see what capacity measurements in a shared media environment might look like, consider a path from a Monitoring Point in a home office with a fiber optic (PON) internet connection (where the subscribed bandwidth is 150 Mbps up and 150 Mbps down) to an AppNeta WAN target. The following screenshot shows the capacity charts for this example:

Screenshot of Capacity charts with the outbound side showing Total capacity of 30Mbps and the inbound side showing a Total capacity of close to 500 Mbps.

Where we’d expect to see a Total capacity of 150 Mbps up and 150 Mbps down, we actually see about 30 Mbps up (left side) and 489 Mbps down (right side).

The Total capacity on the uplink is lower than 150 Mbps because bandwidth is dynamically assigned as required on this type of link and, in this case, the link is being lightly used and so does not require more.

The Total capacity on the downlink is higher than 150 Mbps because there is relatively little sustained downlink traffic. The downlink is a higher capacity pipe (possibly 2.4 Gbps or 10 Gbps in this case) shared between several subscribers. Unlike the uplink, the downlink will have bandwidth assigned all the time, it’s just split between subscribers. However, to prevent one subscriber from dominating for too long there is a rate limiter. The rate limiter allows some traffic through at the peak rate (~500 Mbps in this case) and only “kicks in” if a higher traffic load is sustained. AppNeta’s test traffic will tend to “slip” through the rate limiter and measure the peak available bandwidth rather than the rate limited bandwidth.

In order to test the uplink and downlink capacity of a link with a sustained load you can use PathTest or a Speedtest. For example, running PathTest in the fiber optic environment above we see the results we’d expect. In this case, the measured bandwidth is close to 180 Mbps in each direction:

Screenshot of Pathtest results showing about 180 Mbps in each direction.

For cases where you need to measure capacity over time, you can use rate-limited monitoring. Rate-limited monitoring is similar to PathTest in that it loads the network while testing, but instead of a single measurement, it makes measurements at regular intervals over time. Contact AppNeta Support to enable rate-limited monitoring.

Mean Opinion Score (MOS)

The Mean Opinion Score (MOS) is an estimate of the rating a typical user would give to the sound quality of a call. It is expressed on a scale of 1 to 5, where 5 is perfect. It is a function of loss, latency, and jitter. It also varies with voice codec and call load.