- How long was a path in violation?
- High capacity
- Low capacity
- Sustained packet loss
- APM and ISP capacity numbers differ
- Recognizing oversubscription
- Interhop analysis
There are a number of issues that can occur on a network path. This page describes some of those issues and the things that can be done about them.
How long was a path in violation?
On the Events page, use the search to filter down to a single path, and then sort by event time. It should be pretty easy to pair violations and clears, depending on how many conditions violated during the filtered time range, and then from there use the timestamps to manually calculate the duration of the violation.
There are no downloadable reports that can provide this information, and the events page only shows the last 7 days of events. For analysis of a greater time range, you’ll need to go to the Path Performance page and hover over the event markers, or review your email notifications if they were set up.
If total capacity measurements are way beyond what you expect, this is usually because the link to your ISP is physically capable of greater speed, but your ISP has used a traffic engineering technique called ‘rate limiting’ to clamp you to the amount specified in your SLA. Usually transactional data and control data is allowed through at full capacity because they are short bursts of traffic, but sustained data transfers like streaming media will trigger the rate limiter.
Because Delivery monitoring is extremely light weight, it too might not be able trigger the rate limiter. As a result, you’ll end up seeing the entire capacity of the link, rather than the amount that has been provisioned for you by your ISP. If you have another monitoring point at the target you can use PathTest to load the network enough to trigger the rate limiter and allow you to determine the rate-limited capacity of your link. Another option is to use Rate-limited monitoring.
There are several reasons why total capacity might be lower than expected. But the first reason might actually be your expectations if you’re used to measuring bandwidth. Remember that capacity is always an end-to-end measurement where the bandwidth provisioned by your ISP is almost certainly with respect to one or a few links. That aside, total capacity for any link or set of links will always be less than bandwidth because it’s a network layer measurement while bandwidth is a physical layer measurement. Link-layer headers and framing overhead reduces rated capacity to a theoretical maximum, which is different for every network technology.
Further reducing capacity is the fact that NICs, routers, and switches are sometimes unable to saturate the network path, and therefore the theoretical maximum can’t be achieved. ‘Saturate’ means the ability to transmit packets at line rate without any gaps between packets. All switches can go line rate for the length of time that a packet is being sent. The trick is to be able to send the next packet without rest in between. This capability is referred to as ‘switch capacity’ and is practically impossible to control for even within your own administrative domain, let alone across the public Internet.
Considering both of the these factors—the latter being variable—APM offers the range for total capacity that you can expect given the physical medium and modern equipment with good switching capacity.
Half-duplex links: Total capacity is based on the assumption that traffic will flow in both directions. Therefore, you can expect the total capacity for half-duplex links to be roughly half of what it would be with full-duplex.
|Standard||Standard link speed||L1 + L2 overhead||Theoretical total capacity||Optimal total capacity|
|DS0 or ISDN||64 Kbps||3.9%||61.5 Kbps||61.5 Kbps|
|ISDN dual channel||128 Kbps||3.9%||123 Kbps||123 Kbps|
|T1 (HDLC+ATM)||1.544 Mbps||11.6%||1.365 Mbps||1.325-1.375 Mbps|
|T1 (HDLC)||1.544 Mbps||3.5%||1.49 Mbps||1.40-1.49 Mbps|
|E1||2.0 Mbps||3.5%||1.93 Mbps||1.86-1.95 Mbps|
|T3||45 Mbps||3.5%||43.425 Mbps||42.50-43.45 Mbps|
|10M Ethernet half-duplex||10 Mbps||2.5%||4.875 Mbps||4.8-4.9 Mbps|
|10M Ethernet full-duplex||10 Mbps||2.5%||9.75 Mbps||9.7-9.8 Mbps|
|100M Ethernet half-duplex||100 Mbps||2.5%||48.75 Mbps||48.5-49.0 Mbps|
|100M Ethernet Full-duplex||100 Mbps||2.5%||97.5 Mbps||90-97.5 Mbps|
|Gigabit Ethernet Full-duplex||1 Gbps||2.5%||975 Mbps||600-900 Mbps|
Once your expectations are in line with what APM measures, make sure you choose a good target, because some devices are better than others.
Capacity can also be misleading on a single-ended path. When you have a link with different up/down speeds, e.g., your cable connection at home, a single-ended path only shows the slowest of the two. For example, if you have 50 Mbps download and 5 Mbps upload on your home DSL connection, a single-ended path only shows a capacity of 5 Mbps. You should always use dual-ended paths for asymmetric paths, and if you see measurements that don’t look right at least set up an additional dual-ended path to verify that asymmetry isn’t the issue.
Next, it is important to note that when low capacity as a persistent rather than transient condition it is caused by the bottleneck, not by congestion. And the bottleneck can be at any point in the path not just the first/last mile. It could instead be far away on the public Internet. To verify which is the case, make an additional path to an AppNeta WAN target, verifying through path route that a different route is taken. If the capacity measurements are the same, then the bottleneck is likely the link to your ISP. Otherwise, the bottleneck is somewhere else on the path, and the capacity you’re seeing is accurate.
Are you seeing corresponding packet loss? Every 1 minute, capacity is measured by sending multiple bursts of back-to-back packets as described in the TruPath section. To measure total capacity, at least one burst must come back with zero packet loss. If that is not the case, then capacity is skipped for that interval. In the case of intermittent packet loss, this leads to a choppy graph, and in the case of sustained packet loss, you’ll see capacity bottom out.
If all of the above checks out, the next thing you want to do is run PathTest, to corroborate the low capacity measurements. Remember that this is a load test that measures bandwidth, not capacity.
- If PathTest supports the capacity measurements, then is possible that you’re not getting the proper provisioning from your ISP.
- If the PathTest result is incongruent with your capacity readings, you should open a support ticket so we can help you further investigate.
Sustained packet loss
If a path shows sustained packet loss, look at its latest diagnostic to understand where the loss is occurring:
- If the loss is occurring at the last hop, make sure that firewall/endpoint protection at the target allows ICMP.
- If the loss is occurring mid-path, make sure routing policies are not de-prioritizing ICMP, and access control lists are not blocking ICMP. Mid-path firewalls might also be impacting end-to-end performance with respect to bursty data.
You can also try looking at other diagnostics on the same path to look for consistency in the results - identifying the same hop as a problem. Another option is to look at the diagnostics of other paths that use the same hop.
In any case, ICMP limitations might not affect production traffic. Set up a dual-ended path to test whether other protocols are affected.
APM and ISP capacity numbers differ
There are times when the network capacity numbers returned by APM do not match those from a speed test provided by your ISP. If this is the case, try the following:
- Confirm that the speed test run by the ISP is effectively using the same source and target as your test.
- Use dual-ended monitoring (testing a path between two AppNeta monitoring points). Dual-ended monitoring measures network capacity in both directions (source to target and target to source), similar to speed tests. Testing each direction independently allows you to account for asymmetry in the network path. For example, upload and download rates may be different and may take different routes. Single-ended monitoring can only determine the capacity in the direction with the lowest capacity.
Run PathTest. Carriers use a variety of techniques for shaping and policing network traffic, some of which are only clearly evident under load. PathTest does not use lightweight packet dispersion, but rather generates bursts of packets which may trigger carrier shaping technologies. For this test, set up PathTest as follows:
- In APM, navigate to Delivery > Path Plus
- In the PathTest Settings pane:
- Set Protocol to UDP
- UDP and ICMP packets are treated differently by network equipment. UDP packets are treated as data traffic whereas ICMP packets are treated as control traffic.
- Set Direction to Both (Sequential).
- Set Duration as appropriate (default 5 seconds).
- Set Bandwidth to Max.
- Click Run Test.
- Set Protocol to UDP
Note: For cases where you need to measure capacity over time, you can use rate-limited monitoring. Rate-limited monitoring is similar to PathTest in that it loads the network while testing, but instead of a single measurement, it makes measurements at regular intervals over time. Open a support ticket with Customer Care to enable rate-limited monitoring.
Oversubscription is a technique your ISP uses in order to sell the full bandwidth of a link to multiple customers. It’s a common practice and usually not problematic, but if it is impacting performance, you’ll see it first in your utilized capacity measurements.
The first thing you want to do is corroborate capacity measurements with RTT, loss, and jitter. If there are no corresponding anomalies, then whatever triggered the high utilization isn’t really impacting performance. If there are, you’ll then use Usage monitoring to check for an increase in network utilization.
High utilized capacity coupled with no increase in flow data is a classic sign of oversubscription, and it’s time to follow up with your ISP.
Whenever a bottleneck is encountered along a path in the direction of a diagnostic test, total capacity will decrease, regardless of the physical capacity of the path beyond that point. This is consistent with an application’s experience of a network path. From time to time, however, intermediate hops might seem inconsistent with the end-to-end path to a target. You might also observe messages for these intermediate hops such as ‘CPU-limited response - Total Capacity depressed’, ‘Inconsistent behaviors observed at this hop …’, and ‘High utilization detected’; yet the end-to-end path appears normal. For example, the total capacity might appear to be lower than expected in the middle of the network. An intermediate hop may respond with only 19 Mbps while the target hop is responding at 78 Mbps. How could a device that can only handle 19 Mbps pass 78 Mbps to a downstream device? The answer to this question requires an understanding of how modern routers, switches, and firewalls are designed.
Several CPUs and ASICs combine within a single device. When APM directs test packets to the target, they pass through the router’s ASICs. Often you will find that one ASIC is associated with each router port. ASICs are designed to be very fast when routing Layer 3, and most can do some basic onboard queuing and filtering. Rarely will you find ASICs that are capable of handling all routing functions. Specialized functions are passed to a router management CPU which is shared by all ASICs. When APM tests directly to a router hop, the ASIC typically redirects test packets to the router management CPU. In some cases this path is slower or more congested than the main network path through the router. Going back to our original example, we can better understand what APM is reporting. The router is capable of passing 157 Mbps through its ASICs, but we are only able to achieve 38.8 Mbps when testing to the router’s management CPU.
With this basic router architecture in mind, here are few points to consider when examining inter-hop responses:
If you are measuring lower-speed links, you can use routers and switches as targets. For example, if you are testing a 10 Mbps link you may target a router, provided that the router is capable of responding faster than the speed of the link.
Watch for routers that are reporting high utilization, even though end-to-end utilization is low. Typically this indicates that ACLs are redirecting large amounts of network traffic from the ASICs to the router’s management CPU. Inevitably you will find that the show cpu command will reveal that the router’s CPU is busy. If you find a router in this condition, we recommend reordering or removing ACLs, or replacing the router with an appropriate firewall or traffic shaper. Also see ‘High utilization detected’.
In some cases the bottleneck between ASICs and the router management CPU becomes significant to applications, and therefore should be taken into consideration. Although traffic is typically concentrated between ASICs, you may find that the management CPU will handle some types of traffic due to ACLs, broadcasts, multicast, and fragmentation. If the traffic that is handled by the router management CPU is important to your application, you’ll need to devise a workaround.