Analyze Results

The following sections describe network path violations and web path violations generated within AppNeta Performance Manager (APM), how they are triggered, typical causes, and what to do if you see one. To investigate even further, you can determine the scope and the source of network issues and the scope of user experience issues.

Network path violations

Connectivity (network path)

Network path connectivity refers to the ability of the Monitoring Point to receive a layer 2 response from the path target.

How it is triggered:

Network path connectivity alerts are triggered when a network path target does not respond to the Monitoring Point. It can be triggered immediately or [X mins] after disconnecting.

Typical causes:

  • Infrastructure between the Monitoring Point and the target is down.
  • The target is down.
  • The target does not respond to ICMP (single-ended path) or UDP port 3239 (dual-ended path).
  • A firewall is not configured correctly.

Further investigation:

  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically a network issue localized to the source location or region for those paths.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

Data Jitter

Jitter, also known as packet delay variation, is a measure of variation in latency. Jitter affects time-sensitive applications that use UDP but does not affect applications using TCP. Data jitter is measured when packet trains emulating data traffic are used for monitoring.

How it is triggered:

Data Jitter alerts are triggered when Data Jitter is greater than [X ms] for [Y mins].

Typical causes:

  • Network congestion
  • Lack of QoS configuration
  • Mis-configured QoS
  • Network devices changing QoS markings

Further investigation:

  • Review network path performance charts for the path to look for changes in Data Jitter and determine when they occurred. If needed, expand the timeline to include time leading up to alert. Correlate this with network changes made at the same time to determine a potential root cause.
  • Make sure that test packets are configured to use the QoS markings used by the traffic being emulated.
  • Review diagnostic tests run during the violation event.
    • Check the QoS column on the Data Details tab. Make sure that the QoS markings are consistent (not being changed or dropped) along the path.
    • Check the Data Jitter and Latency columns on the Data Details tab. Look for significant jumps in Data Jitter or Latency along the path indicating a source of congestion. Note that the cause of data jitter can be at the first hop reporting the jump, the previous hop, or any infrastructure in between the two.
  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically a network issue localized to the source location or region for those paths.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

Data Loss

Packet loss, whether the packets are data or voice, is simply a measure of the number of packets that did not make it to their intended destination. Packet loss can occur for a variety of reasons including traffic congestion along the network path, an overloaded network device, bad physical media, flapping routes, flapping load balancing, and name resolution issues.

The effect of packet loss can range from insignificant to critical depending on its severity and the application that is experiencing it. For example, with applications that use TCP, light data loss will generally be unnoticed because TCP will detect the issue and have the lost packets retransmitted. That said, heavy data loss can cause many retransmissions and can significantly impact throughput. Users would notice slow response times. For applications that use UDP, VoIP for example, the loss may or may not have a significant effect on the conversation depending on how much loss is experienced as UDP packets are not retransmitted.

Delivery monitoring provides both data and voice loss metrics. Data loss is that measured when packet trains emulating data traffic are used. Voice loss is that measured when packet trains emulating voice traffic (smaller payloads with wider packet spacing) are used.

How it is triggered:

Data Loss alerts are triggered when [X%] of data packets are lost in tests occurring for [Y mins].

Typical causes:

  • Traffic congestion along the network path
  • An overloaded network device
  • Bad physical media
  • Flapping routes
  • Flapping load balancing
  • Name resolution issues
  • MTU mismatch
  • Firewall protecting against DDOS resulting in a loss plateau (e.g. exactly 50% loss)
  • Selected target does not respond well to test packets

Further investigation:

  • Review network path performance charts for the path to look for changes in Data Loss and determine when they occurred. If needed, expand the timeline to include time leading up to alert. Correlate this with network changes made at the same time to determine a potential root cause.
    • If you see a jump in Data Loss but no Voice Loss this is highly indicative of a problem with MTU.
  • Review diagnostic tests run during the violation event. Check the Data Loss column on the Data Details tab. Look for a hop reporting non-zero data loss that continues to be reported at all subsequent hops along the route. Note that the cause of data loss can be at the first hop reporting data loss, the previous hop, or any infrastructure in between the two.
  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically a network issue localized to the source location or region for those paths.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

Latency

Latency is the time it takes for a packet to go from a source to a target and is calculated as one half of the Round Trip Time (RTT) of the fastest packet in a packet train.

High latency values have a detrimental effect on applications that use TCP and time-sensitive applications that use UDP. For TCP, the effect of latency is compounded due to the way its congestion control mechanism works. This results in a major decrease in TCP throughput. Modern video streaming services like YouTube and Netflix use TCP. For time-sensitive applications that use UDP (for example, real-time voice and video streaming and Voice over IP (VoIP)), large latencies can introduce both conversational difficulty and packet loss.

How it is triggered:

Latency alerts are triggered when Latency is greater than [X ms] for (Y mins].

Typical causes:

  • Network congestion
  • Routing error or routing change (for example, re-routing a latency-sensitive voice path over a VPN rather than directly over the internet could increase latency)

Further investigation:

  • Review network path performance charts for the path to look for changes in Latency and determine when they occurred. If needed, expand the timeline to include time leading up to alert. Correlate this with network changes made at the same time to determine a potential root cause.
    • If you see both a latency and RTT jump this is indicative of a route change.
  • Review the Route Visualization to confirm that the path is over the correct network infrastructure.
  • Review diagnostic tests run during the violation event. Check the Latency column on the Data Details tab. Look for unexpected jumps in Latency along the path, potentially indicating a heavily used network device. Note that the cause of the unexpected jump in latency can be at the first hop reporting the jump, the previous hop, or any infrastructure in between the two.
  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically a network issue localized to the source location or region for those paths.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

MOS

The Mean Opinion Score (MOS) is an estimate of the rating a typical user would give to the sound quality of a call. It is expressed on a scale of 1 to 5, where 5 is perfect. It is a function of loss, latency, and jitter. It also varies with voice codec and call load. MOS is calculated based on ITU G.107. MOS is often used as a catch-all indicator of voice performance degradation in place of separate jitter, loss, and other thresholds.

How it is triggered:

MOS alerts are triggered when MOS is less than [X] for [Y mins].

Typical causes:

  • Traffic congestion along the network path
  • An overloaded network device
  • Bad physical media
  • Flapping routes
  • Flapping load balancing
  • Name resolution issues
  • Routing error or routing change to a path with one of the issues listed above

Further investigation:

  • Review network path performance charts for the path to look for drops in MOS and determine when they occurred. If needed, expand the timeline to include time leading up to alert. Correlate this with network changes made at the same time to determine a potential root cause.
    • If MOS drops, we recommend investigating voice metrics like Voice Loss and Voice Jitter first to determine what impacted the MOS change.
  • Review the Route Visualization to confirm that the path is over the correct network infrastructure.
  • Review diagnostic tests run during the violation event. Check the MOS column on the Voice Details tab. Look for significant drops in MOS along the path, potentially indicating a heavily used network device. Note that the cause of the drop on MOS can be at the first hop reporting the drop, the previous hop, or any infrastructure in between the two.
  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically a network issue localized to the source location or region for those paths.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

QoS Change

QoS (Quality of Service) markings on packets are used to prioritize traffic (for example, voice and video traffic). For priority traffic, if these markings are altered by a device in the network, a poor user experience can occur.

How it is triggered:

QoS change alerts are triggered when QoS markings are altered by the network. See Alerting on QoS Changes for details.

Typical causes:

A network device that actively changes, removes, or doesn’t honor QoS markings.

Further investigation:

  • Review diagnostic tests run during the violation event. Check the QoS column on the Data Details tab. Look for changes in QoS markings along the path. Note that the cause of the QoS change can be at the first hop reporting the change, the previous hop, or any infrastructure in between the two.
  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically a network issue localized to the source location or region for those paths.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

RTT

Round Trip Time (RTT) is the time it takes for a packet to go from a source to a target and back (the RTT chart in APM shows the average RTT over the selected time period). Latency, the time it takes for a packet to go from a source to a target, is calculated as one half of the RTT of the fastest packet in a packet train.

High latency values have a detrimental effect on applications that use TCP and time-sensitive applications that use UDP. For TCP, the effect of latency is compounded due to the way its congestion control mechanism works. This results in a major decrease in TCP throughput. Modern video streaming services like YouTube and Netflix use TCP. For time-sensitive applications that use UDP (for example, real-time voice and video streaming and Voice over IP (VoIP)), large latencies can introduce both conversational difficulty and packet loss.

How it is triggered:

RTT alerts are triggered when RTT is greater than [X ms] for [Y mins].

Typical causes:

  • Network congestion
  • Routing error or routing change (for example, re-routing a latency-sensitive voice path over a VPN rather than directly over the internet could increase latency and RTT)

Further investigation:

  • Review network path performance charts for the path to look for changes in RTT and determine when they occurred. If needed, expand the timeline to include time leading up to alert. Correlate this with network changes made at the same time to determine a potential root cause.
    • If you see both a latency and RTT jump this is indicative of a route change.
  • Review the Route Visualization to confirm that the path is over the correct network infrastructure.
  • Review diagnostic tests run during the violation event. Check the RTT column on the Data Details tab. Look for unexpected jumps in RTT along the path, potentially indicating a heavily used network device. Note that the cause of the unexpected jumps in RTT can be at the first hop reporting the jump, the previous hop, or any infrastructure in between the two.
  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically a network issue localized to the source location or region for those paths.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

Voice Jitter

Jitter, also known as packet delay variation, is a measure of variation in latency. Jitter affects time-sensitive applications that use UDP but does not affect applications using TCP. Voice jitter is measured when packet trains emulating voice traffic (smaller payloads with wider packet spacing) are used for monitoring.

How it is triggered:

Voice Jitter alerts are triggered when Voice Jitter is greater than [X ms] for [Y mins].

Typical causes:

  • Network congestion
  • Lack of QoS configuration
  • Mis-configured QoS
  • Network devices changing QoS markings

Further investigation:

  • Review network path performance charts for the path to look for changes in Voice Jitter and determine when they occurred. If needed, expand the timeline to include time leading up to alert. Correlate this with network changes made at the same time to determine a potential root cause.
  • Make sure that test packets are configured to use the QoS markings used by the traffic being emulated.
  • Review diagnostic tests run during the violation event.
    • Check the QoS column on the Data Details tab. Make sure that the QoS markings are consistent (not being changed or dropped) along the path.
    • Check the Voice Jitter and Latency columns on the Voice Details tab. Look for significant jumps in Voice Jitter or Latency along the path indicating a source of congestion. Note that the cause of voice jitter can be at the first hop reporting the jump, the previous hop, or any infrastructure in between the two.
  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically a network issue localized to the source location or region for those paths.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

Voice Loss

Packet loss, whether the packets are data or voice, is simply a measure of the number of packets that did not make it to their intended destination. Packet loss can occur for a variety of reasons including traffic congestion along the network path, an overloaded network device, bad physical media, flapping routes, flapping load balancing, and name resolution issues.

The effect of packet loss can range from insignificant to critical depending on its severity and the application that is experiencing it. For example, with applications that use TCP, light data loss will generally be unnoticed because TCP will detect the issue and have the lost packets retransmitted. That said, heavy data loss can cause many retransmissions and can significantly impact throughput. Users would notice slow response times. For applications that use UDP, VoIP for example, the loss may or may not have a significant effect on the conversation depending on how much loss is experienced as UDP packets are not retransmitted.

Delivery monitoring provides both data and voice loss metrics. Data loss is that measured when packet trains emulating data traffic are used. Voice loss is that measured when packet trains emulating voice traffic (smaller payloads with wider packet spacing) are used.

How it is triggered:

Voice Loss alerts are triggered when [X%] of voice packets are lost in tests occurring for [Y mins].

Typical causes:

  • Traffic congestion along the network path
  • An overloaded network device
  • Bad physical media
  • Flapping routes
  • Flapping load balancing
  • Name resolution issues
  • Firewall protecting against DDOS resulting in a loss plateau (e.g. exactly 50% loss)
  • Selected target does not respond well to test packets

Further investigation:

  • Review network path performance charts for the path to look for changes in Voice Loss and determine when they occurred. If needed, expand the timeline to include time leading up to alert. Correlate this with network changes made at the same time to determine a potential root cause.
  • Review diagnostic tests run during the violation event. Check the Voice Loss column on the Voice Details tab. Look for a hop reporting non-zero voice loss that continues to be reported at all subsequent hops along the route. Note that the cause of voice loss can be at the first hop reporting voice loss, the previous hop, or any infrastructure in between the two.
  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically a network issue localized to the source location or region for those paths.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

Web path violations

Apdex Score

Apdex is an industry-standard method for reporting and comparing application performance in terms of end user experience. Apdex uses a simple formula to calculate user satisfaction. The result - the Apdex ‘score’ - is a single number between 0 and 1 where 1 indicates that a user would be completely satisfied with the application response time. APM presents the Apdex score as a percentage from 0% to 100%. Note that Apdex is averaged over a 2 hr window so the alert can take longer to clear than other alerts. See Apdex for further details.

How it is triggered:

Apdex Score alerts are triggered when the rolling average of Apdex scores falls below [X] for [Y] tests.

Typical causes:

  • Application performance issues (for example, slow web app server).
  • Network performance problems (for example, network congestion, DNS issues).

Further investigation:

  • Review the web path performance charts for the web path that generated the alert. If needed, adjust the chart timeline to include the time leading up to the alert.
  • Check the DNS chart for issues with DNS server availability or slow response.
  • Check the End User Experience and Milestone Breakdown charts looking for end user experience time components (network, server, browser) or milestones taking more time than expected.
    • If the network time is longer than expected, check the network path performance charts for network problems.
    • Otherwise, review individual web tests during the issue to identify the resource(s) causing the component or milestone to take longer than expected.
  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically an issue localized to the source location or region for those paths.
      • Check whether the script on the violating path(s) has changed.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target (web app) or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

Connectivity (web path)

Web path connectivity refers to the ability of a user on a browser (or script emulating a user) to access a web app. Web path connectivity alerts are generated when the script running on the Monitoring Point cannot connect to the target web app.

How it is triggered:

Web path connectivity alerts are triggered when the script cannot connect to the target web app over TCP for [X] tests.

Typical causes:

  • Network connectivity is lost.
  • There is a problem resolving the web app IP address using its hostname (DNS issue).
  • The web app is not running.
  • There is a web app infrastructure problem (for example, authentication services are down).
  • There is a routing problem.

Further investigation:

  • Check whether other paths to the same target exhibit the same symptoms
  • If only one or a small number of paths to the same target have lost connectivity to the web app, there is typically an issue localized to the source location or region for those paths.
    • Check whether the script on the violating path(s) has changed.
    • Check network path connectivity for the Delivery paths between the same source and target.
    • If network connectivity is lost, check the Route Visualization for the network path(s), looking specifically at the TCP route, to see where and when the break in connectivity occurred. Also check the TCP route as that is the one the web app is taking.
  • If many or all paths to the same target have lost connectivity to the same web app, there is typically a problem at the target (web app) or its network infrastructure.
    • Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.
  • Try accessing the web app from your browser (to confirm that the target web app is available). Typical issues include:
    • There is a problem resolving the web app IP address using its hostname. This is a DNS issue (See the DNS chart).
    • The web app is not running.
    • There is a web app infrastructure problem (for example, authentication services are down).

HTTP Error

HTTP errors are responses to HTTP requests that indicate a problem. An HTTP Error alert can be triggered if an HTTP Status of 4xx (client error) or 5xx (server error) is returned.

How it is triggered:

HTTP Error alerts are triggered when the HTTP status returned by a script execution is 4xx (client errors) or 5xx (server errors) for [X] tests.

Typical causes:

  • Incorrect target URL used in the script or other script related problems.
  • Web app is down but its infrastructure is still up.
  • Changes to the web app are not reflected in the script.

Further investigation:

  • Review the script and the web app.
    • Has the monitoring script changed? Check the historical web path status timeline) for a script change event (indicated by a Purple diamond) at the time of the HTTP Error alert. Review script changes and update as appropriate. For Selenium scripts, review the Resolving Common Issues page for hints on common scripting problems.
    • Has the web app changed? If so, review changes and update script as appropriate.
    • Is the web app available? Confirm that it can be accessed from a browser. If not, inform the team responsible for it.
    • Is it out of service due to scheduled maintenance? If so, wait for maintenance to complete.
  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically an issue localized to the source location or region for those paths.
      • Check whether the script on the violating path(s) has changed.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target (web app) or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

HTTP Status

When a request is made to a target web app its response contains an HTTP Status. HTTP Status alerts occur when an unexpected HTTP Status is returned. Using this alert threshold is rare and tends to have specific applications. For example, expect a 403 status when accessing a protected resource and alert if it ever returns something different.

How it is triggered:

HTTP Status alerts are triggered when an unexpected HTTP status is returned by the target for [X] tests.

Typical causes:

  • A change to the script or web app.

Further investigation:

  • Review the script and the web app.
    • Is the expected HTTP status correct? If not, determine why the wrong HTTP status is being returned.
    • Has the script changed? If so, review script changes and update as appropriate. For Selenium scripts, review the Resolving Common Issues page for hints on common scripting problems.
    • Has the web app changed? If so, review changes and update script as appropriate.
    • Is the web app available? Confirm that it can be accessed from a browser. If not, inform the team responsible for it.
  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically an issue localized to the source location or region for those paths.
      • Check whether the script on the violating path(s) has changed.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target (web app) or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

Page Load Time

Page Load Time is the time it takes for a web page to render. Page Load Time alerts occur when the time for a page to load is longer than the threshold specified for a period of time.

How it is triggered:

Page Load Time alerts are triggered when the time for a page to load is longer than [X ms] for [Y] tests.

Typical causes:

  • Web app issues. Slowness on the web app server.
  • Script or web app change.
  • Network congestion.
  • Routing issues. Using the wrong (slower) route.
  • DNS issues. Slow DNS response.

Further investigation:

  • Review the web path performance charts for the web path that generated the alert. If needed, adjust the chart timeline to include the time leading up to the alert.
  • Check the DNS chart for issues with DNS server availability or slow response.
  • Check the End User Experience and Milestone Breakdown charts looking for end user experience time components (network, server, browser) or milestones taking more time than expected.
    • If the network time is longer than expected, check the network path performance charts for network problems.
    • Otherwise, review individual web tests during the issue to identify the resource(s) causing the component or milestone to take longer than expected.
  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically an issue localized to the source location or region for those paths.
      • Check whether the script on the violating path(s) has changed.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target (web app) or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

Script Error

A Script Error indicates a problem with a script or web app. For example, an element id specified in a script is not present on the page being accessed.

How it is triggered:

Script Error alerts are triggered by a problem with the script or the target web app for [X] tests.

Typical causes:

  • An element referenced in the script is not present on the page being accessed.
  • A command used in the script is not valid or is used improperly.
  • Invalid credentials are used in the script.

Further investigation:

Transaction Time

Transaction time is the time it takes for a script test to complete.

How it is triggered:

Transaction Time alerts are triggered when the time for the script to complete is longer than [X ms] for [Y] tests.

Typical causes:

  • Web app issues. Slowness on the web app server or services that the app is dependent on (for example, an authentication provider, CDN provider, or in-site video links).
  • Script or web app change.
  • Network congestion.
  • Routing issues. Using the wrong (slower) route.
  • DNS issues. Slow DNS response.

Further investigation:

  • Compare multiple web paths that target the same app. Check for similar spikes in Transaction time across all web paths targeting that app. This indicates slowness on the web app server or services that the app is dependent on.
  • Check for Page Load Time alerts on the same web path you received the Transaction Time alert. One or more slow page loads could cause the transaction time to be longer than expected.
  • Review the script. If the script uses a waitFor command and the elements it is waiting for are either not present or are not loading, the transaction will take longer than expected.
  • Review the web path performance charts for the web path that generated the alert. If needed, adjust the chart timeline to include the time leading up to the alert.
  • Check the DNS chart for issues with DNS server availability or slow response.
  • Check the End User Experience and Milestone Breakdown charts looking for end user experience time components (network, server, browser) or milestones taking more time than expected.
    • If the network time is longer than expected, check the network path performance charts for network problems.
    • Otherwise, review individual web tests during the issue to identify the resource(s) causing the component or milestone to take longer than expected.
  • Check whether other paths to the same target exhibit the same symptoms
    • If only one or a small number of paths to the same target have the same violation, there is typically an issue localized to the source location or region for those paths.
      • Check whether the script on the violating path(s) has changed.
    • If many or all paths to the same target have the same violation, there is typically a problem at the target (web app) or its network infrastructure. Check the Route Visualization to determine which network hops were common to all violating paths prior to and during the violation event.

Determine the scope and source of a network path alert

Step 1: Determine how widespread the problem is

To assess the problem scope, compare the number of violating paths from the same source Monitoring Point or to the same target to see which is largest. Select a procedure below based on when the violation occurred.

Violation is current

  1. In APM, navigate to Delivery > Network Paths.
  2. Filter by violating network paths. GIF of creating a Status = Violated filter.
  3. Filter by the source Monitoring Point listed in the alert notification (for example, ‘Boston-MA-r90’). Note the number of matching paths (for example, 5). GIF of creating a source Monitoring Point filter.
  4. Remove the source Monitoring Point filter.
  5. Filter by the target listed in the alert notification (for example, ‘global.tr.skype.com’). Note the number of matching paths (for example, 2). GIF of creating a target filter.
  6. Repeat the search using the filter that returned the most violating network paths (in this example, the filter by source Monitoring Point) to focus on the issue causing the most path violations.
  7. Confirm that the network paths are violating for the same reason (for example, QoS Change) by hovering over the status icon of several paths. Screen shot showing that hovering over a status icon displays status details. In this case the violation is 'QoS Change'.
  8. Ctrl-click (or Cmd-click) several paths with the same type of violation that triggered the original alert. The path performance charts for each path are displayed in separate tabs.

Violation happened this week

  1. In APM, navigate to Delivery > Events.
  2. Find violation events of the same type that occurred around the same time.
    1. Filter by the source Monitoring Point or target known to have had a problem (for example, ‘Boston-MA-r90’).
    2. Filter by ‘Event Type = Alert Condition’.
    3. Sort events by ‘Event Time’. GIF of the Event Distribution: Past 7 Days page with filters created for source Monitoring Point and 'Event Type = Alert Condition', and then ordering by Event Time.
    4. Note the violation events of the same type (for example, Data Loss) that occurred around the same time by comparing time stamps. Screen shot showing how to identify related events.
  3. Ctrl-click (or Cmd-click) several paths with the same type of violation that triggered the original alert. The path performance charts for each path are displayed in separate tabs.

Violation happened more than a week ago

  1. In APM, navigate to Delivery > Network Paths
  2. In the Search field, enter the name of the source Monitoring Point or target known to have had a problem as well as any other search criteria. Screen shot of the Network Paths page with 'test-mp' highlighted in the Search box.
  3. Ctrl-click (or Cmd-click) several paths with the same type of violation that triggered the original alert. The path performance charts for each path are displayed in separate tabs.

Step 2: Compare paths

Confirm that selected paths are violating with the same symptoms at the same time. The comparison report makes comparison easier.

Compare paths using a comparison report

  1. In APM, navigate to Reports > Report List.
  2. In the Data Performance Comparison or Voice Performance Comparison sections, select the report associated with the violation type (for example, Data Loss).
  3. Edit the report filters to select the network paths you want to compare (those opened in separate tabs in Step 1).
  4. Specify the time range for the report.
  5. Click Update. Note patterns and anomalies common to graphs for different paths. Things to note:
    • How the problem is presenting? In one or both directions? Is it constant or intermittent? Is it random or in a regular pattern?
    • Familiar patterns. For example, precise cadence implies automation, business hours implies user activity.
    • Connectivity loss (gaps in graph). Do other comparison reports (for example, Data Loss, Jitter) indicate an issue leading up to the connectivity loss event.
  6. Identify paths with similar patterns that started at the same time as they are likely due to the same issue. For those paths, keep the tabs opened in Step 1 open. Close the other tabs.

Example

Data loss comparison of four dual-ended paths from the same source Monitoring Point. Screen shot showing data loss for four paths being compared.

Note that all four are showing similar data loss patterns in the outbound direction (above the line) and negligible data loss in the inbound direction (below the line). The data loss on each path also starts and ends at the same time.

Compare paths by comparing path performance charts

Within each tab you opened:

  1. Review the charts related to the violation type (for example, Data Loss chart) to confirm the violation(s) and determine when the problem started. Expand the time range if there is no obvious start to the problem.
  2. Note patterns and anomalies common to charts on different paths over the same time period. Things to note:
    • How the problem is presenting? Which metric (for example, Data Loss) is affected? In one or both directions? Is it constant or intermittent? Is it random or in a regular pattern?
    • Familiar patterns. For example, precise cadence implies automation, business hours implies user activity.
    • Connectivity loss (gaps in connectivity (black vertical lines)). Do other charts (for example, data loss, jitter) indicate an issue leading up to the connectivity loss event.
  3. Keep tabs open containing paths that have similar patterns and that started at the same time as they are likely due to the same issue. Close the other tabs.

Example

Selected path 1: Screen shot of Data Loss charts for first path being compared. Selected path 2: Screen shot of Data Loss charts for second path being compared. Note commonalities between the charts of path 1 and path 2. They have similar amounts of data loss in the same direction (outbound) starting at the same time.

Step 3: Use selected paths to determine the source of the problem

To find the problem source, look for hops that the impacted paths have in common that aren’t shared by non-impacted paths from the same source Monitoring Point or to the same target.

Background: Determining the source of a network problem

AppNeta Monitoring Points and Delivery monitoring are set up to monitor network traffic on paths between a variety of sources and targets and generate alerts when conditions on those network paths are outside of norms. A network device (hop) that is causing a problem will present in the same way on all network paths that pass through it - so all paths through that device will violate and generate an alert (for example, a high degree of data loss). Looking at the alerts with a common source (or common target) we need to find the device that is on all failing paths and not on any non-failing paths.

So: If all paths from a source are violating, the issue will be at a network device close to the source. If all paths to a target are violating, the issue will be at a network device close to the target. In all other cases the issue is on a network device somewhere between the source and the target.

For example, given the following network where AppNeta Monitoring Points are at the endpoints (S1, T1, T2, …) and we have network paths being monitored from source S1 to all targets (T1, T2, …), if there is a problem at H2, we should see alerts generated on four paths (S1->T1, S1->T2, S1->T3, and S1->T4), but no alerts on any other paths from S1. We can see that the issue is at the hop that all the paths generating alerts have in common but not in common with the paths not generating alerts - in this case, H2.

Tree diagram showing a bad node in order to see paths with the bad node in common.

Two ways to do this include reviewing routes and/or diagnostics at the time of the problem.

  • Use the routes method for “Connectivity Loss” violations. Also, use it to identify hops common to all violating paths that are not common to any non-violating paths.
  • Use the diagnostics method for everything except “Connectivity Loss” violations to identify the hop where the violation is first seen on each path.

Use routes

  1. On the Network Paths page, filter network paths so that several (maximum 20) violating and a couple non-violating paths from the same source (or target) are shown.
    • To do this, filter by Monitoring Point and Target specifying the source Monitoring Point(s) and target(s) for each path.
  2. Click Show Routes. Screen shot showing how to show routes for filtered network paths.
  3. Specify the Center Time as the time the problem was seen and select a time Range. Screen shot showing how to use the Route Visualization to identify common hops between network paths.
  4. Look for hops that the violating paths have in common that are not in common with the non-violating paths at the time of the problem. These are the suspected hops. You can move the time slider back and forth over the time range to see how the routes change.
  5. Record the hostname and IP addresses of the suspected hops.

Use diagnostics

For each violating path selected:

  1. In the Events pane (not the Events tab), find a Diagnostic Test that was successfully run while the path was violating (after a violation event and before the clear event). This should be around the same time as diagnostics on other paths you select for investigation.
  2. Click the test (represented by a pink circle) then click View. Screen shot showing how to see events on the network path performance charts.
    • The diagnostic test appears.
  3. Select the Data Details tab (or Voice Details tab for voice-based paths).
  4. Click Advanced Mode (if available).
  5. Review the metric in violation (for example, Data Loss). The suspected hop is the first one that shows a non-zero value where the remainder of hops to the target (or source) also show non-zero values. Screen shot showing data loss details in diagnostic test results.
    • In this example, the issue occurred on one of the first four hops. It is seen at hop 4 but it could also be one of the first three hops where we were unable to determine data loss. In this case, looking at diagnostics on other paths and/or using the “routes” method would help clarify the source of the issue.
  6. If the suspected hop is confirmed on other impacted paths, it is the likely source of the problem. If you do not find confirmation, check that selected diagnostics are close to one another in time and were taken when the problem was occurring. It is also possible that there are multiple problems occurring at the same time.
  7. Record the hostname and IP address of the suspected hop.

Exceptions:

  • No diagnostic available - If there is no diagnostic available at the right time, try one of the other violating paths. Alternatively, if the issue is still occurring, trigger a diagnostic manually (continue reviewing other paths while this completes). In general, diagnostics are triggered when a threshold is violated but it will be postponed or removed from the queue if there are too many diagnostics currently in progress.
  • “Diagnostic Failed” message - The message “Diagnostics Failed - Cannot complete inbound diagnostics because the target Monitoring Point is not in the same organizational hierarchy as the source Monitoring Point.” can be safely disregarded. Outbound diagnostics provide the information necessary to determine the problem source.
  • No measurements showing - A hop may be missing or may not show any data other than an IP address and hostname if it does not respond to ICMP packets either because it is configured that way or because it is too heavily loaded.
  • Measurements not above thresholds - If the metric in violation (for example, Data Loss) does not show values above the alert threshold (for example, > 3%), it is possible that the diagnostic completed after the violation cleared or that the network condition is intermittent and the diagnostic was was taken at a time of improved performance but before the violation cleared. In this case, continue with one of the other selected paths.

Determine the scope of a web path alert

To assess the problem scope, compare the number of violating paths from the same web app group, from the same source Monitoring Point, or to the same target to see which is largest. Select a procedure below based on when the violation occurred.

Violation is current

  1. In APM, navigate to Experience > Web Paths.
  2. Filter by violating web paths. GIF of creating a Status = Error/Violated filter.
  3. Filter by the web app group listed in the alert notification (for example, ‘Gsuite’). Note the number of matching paths (for example, 5). GIF of creating a web app group filter.
  4. Remove the web app group filter (for example, ‘Gsuite’).
  5. Filter by the source Monitoring Point listed in the alert notification (for example, ‘Boston-MA-r90’). Note the number of matching paths (for example, 10). GIF of creating a source Monitoring Point filter.
  6. Remove the source Monitoring Point filter (for example, ‘Boston-MA-r90’).
  7. Filter by the target listed in the alert notification (for example, ‘accounts.google.com’). Note the number of matching paths (for example, 5). GIF of creating a target filter.
  8. Repeat the search using the filter that returned the most violating web paths (in this example, the filter by source Monitoring Point) to focus on the issue causing the most path violations.
  9. Confirm that the web paths are violating for the same reason.
    1. Click the web path identified in the alert notification (same Monitoring Point, Target, Web App Group, and Workflow).
    2. Make note of the violation type(s) in the Web Path Details pane on the right.
    3. Make note of the error message(s) in the Latest Transaction Details pane. Screen shot of Web Timeline page showing the violation types on the right and the error messages within the Latest Transaction Details pane.
    4. Repeat for several other violating web paths in the list to confirm that they are violating for the same reason. This will confirm how widespread the problem is (for example, affecting 10 web paths related to the source Monitoring Point).

Violation happened this week

  1. In APM, navigate to Experience > Events.
  2. Find violation events of the same type at the same time and note the number of matching events.
    1. Filter by ‘Event Type = Web Alert Profile’ (show events triggered when an alert threshold is violated or cleared).
    2. Filter by ‘violated’ (show only violation events).
    3. Filter by the source of the path known to have had a problem (for example, the string ‘boston’ was in the source web path that was alerted on. We could have used ‘Boston-MA-r90’ to be more specific.).
    4. Sort events by ‘Event Time’.
    5. Note the number of violation events of the same type (for example, Page Load Time) that occurred around the time of the alert by reviewing the ‘Description’ column and the time stamps. GIF of the Experience Events: Past 7 Days page with filters created for 'Event Type = Web Alert Profile', 'violated', then ordering by Event Time.
  3. Repeat the search for the target then web app group and note the number of violation events that occurred at the time of interest. This will provide an indication of how widespread the problem is based on source, target, and web app group.

Violation happened more than a week ago

  1. In APM, navigate to Experience > Web Paths
  2. In the Search field, enter search terms to filter web paths suspected of having the same problem. Screen shot of the Web Paths page with 'test-mp' highlighted in the search box.
  3. Ctrl-click (or Cmd-click) several of the web paths listed. The web path performance charts for each web path are displayed in separate tabs.
  4. For the web path performance charts in each tab, select a time range and look for similar issues at the same time. This will provide an indication of how widespread the problem is.