When you receive an APM network path alert notification, use this workflow to troubleshoot the network problem using APM.

Step 1: Record network path alert notification details.

Make note of the alert notification details:

  • Event Time - the time and date of the violation
  • Monitoring Point - the source monitoring point
  • Target - the network path target
  • Details - the violation details (e.g., Measured Data Loss)
Step 2: Determine how widespread the problem is.

To determine the problem scope we compare the number of violating paths from the same source monitoring point or to the same target to see which is largest. The procedure to use depends on when the violation occurred.

Violation is current

For current violations:

  1. In APM, navigate to Delivery > Network Paths.
  2. Filter by violating network paths.
    GIF of creating a Status = Violated filter.
  3. Filter by the source monitoring point listed in the alert notification (e.g., ‘Boston-MA-r90’). Note the number of matching paths (e.g., 5).
    GIF of creating a source monitoring point filter.
  4. Remove the source monitoring point filter.
  5. Filter by the target listed in the alert notification (e.g., ‘global.tr.skype.com’). Note the number of matching paths (e.g., 2).
    GIF of creating a target filter.
  6. Repeat the search using the filter that returned the most violating network paths (in this example, the filter by source monitoring point) to focus on the issue causing the most path violations.
  7. Confirm that the network paths are violating for the same reason (e.g., QoS Change) by hovering over the status icon of several paths.
    Screen shot showing that hovering over a status icon displays status details. In this case the violation is 'QoS Change'.
  8. Ctrl-click (or Cmd-click) several paths with the same type of violation that triggered the original alert. The path performance charts for each path are displayed in separate tabs.
Violation happened this week

For violations that happened this week:

  1. In APM, navigate to Delivery > Events.
  2. Find violation events of the same type that occurred around the same time.
    1. Filter by the source monitoring point or target known to have had a problem (e.g., ‘Boston-MA-r90’).
    2. Filter by ‘Event Type = Alert Condition’.
    3. Sort events by ‘Event Time’.
      GIF of the Event Distribution: Past 7 Days page with filters created for source monitoring point and 'Event Type = Alert Condition', and then ordering by Event Time.
    4. Note the violation events of the same type (e.g., Data Loss) that occurred around the same time by comparing time stamps.
      Screen shot showing how to identify related events.
  3. Ctrl-click (or Cmd-click) several paths with the same type of violation that triggered the original alert. The path performance charts for each path are displayed in separate tabs.
Violation happened more than a week ago

For violations that happened more than a week ago:

  1. In APM, navigate to Delivery > Network Paths
  2. In the Search field, enter the name of the source monitoring point or target known to have had a problem as well as any other search criteria.
    Screen shot of the Network Paths page with 'test-mp' highlighted in the Search box.
  3. Ctrl-click (or Cmd-click) several paths with the same type of violation that triggered the original alert. The path performance charts for each path are displayed in separate tabs.
Step 3: Confirm that selected paths are violating with the same symptoms at the same time.

Two ways to do this include using a comparison report or by comparing the path performance charts of each path. The comparison report makes comparison easier.

Compare paths using a comparison report

To compare paths using a comparison report:

  1. In APM, navigate to Reports > Report List.
  2. In the Data Performance Comparison or Voice Performance Comparison sections, select the report associated with the violation type (e.g., Data Loss).
  3. Edit the report filters to select the network paths you want to compare (those opened in separate tabs in Step 2).
  4. Specify the time range for the report.
  5. Click Update.
    • Note patterns and anomalies common to graphs for different paths. Things to note:
      • How the problem is presenting
        • In one or both directions? Is it constant or intermittent? Is it random or in a regular pattern?
      • Familiar patterns
        • For example, precise cadence implies automation, business hours implies user activity.
      • Connectivity loss (gaps in graph)
        • Do other comparison reports (e.g., Data Loss, Jitter) indicate an issue leading up to the connectivity loss event.
  6. Identify paths with similar patterns that started at the same time as they are likely due to the same issue. For those paths, keep the tabs opened in Step 2 open. Close the other tabs.

Example

Data loss comparison of four dual-ended paths from the same source monitoring point.
Screen shot showing data loss for four paths being compared.

Note that all four are showing similar data loss patterns in the outbound direction (above the line) and negligible data loss in the inbound direction (below the line). The data loss on each path also starts and ends at the same time.

Compare paths by comparing path performance charts

To compare paths by comparing path performance charts, within each tab you opened:

  1. Review the charts related to the violation type (e.g., Data Loss chart) to confirm the violation(s) and determine when the problem started. Expand the time range if there is no obvious start to the problem.
  2. Note patterns and anomalies common to charts on different paths over the same time period. Things to note:
    • How the problem is presenting
      • Which metric (e.g., Data Loss) is affected? In one or both directions? Is it constant or intermittent? Is it random or in a regular pattern?
    • Familiar patterns
      • For example, precise cadence implies automation, business hours implies user activity.
    • Connectivity loss (gaps in connectivity (black vertical lines))
      • Do other charts (e.g., data loss, jitter) indicate an issue leading up to the connectivity loss event.
  3. Keep tabs open containing paths that have similar patterns and that started at the same time as they are likely due to the same issue. Close the other tabs.

Example

Selected path 1:
Screen shot of Data Loss charts for first path being compared.
Selected path 2:
Screen shot of Data Loss charts for second path being compared.
Note commonalities between the charts of path 1 and path 2. They have similar amounts of data loss in the same direction (outbound) starting at the same time.

Step 4: Use selected paths to determine the source of the problem.

To find the problem source, we look for hops that the impacted paths have in common that aren’t shared by non-impacted paths from the same source monitoring point or to the same target.

Background: Determining the source of a network problem

AppNeta monitoring points and Delivery monitoring are set up to monitor network traffic on paths between a variety of sources and targets and generate alerts when conditions on those network paths are outside of norms.
A network device (hop) that is causing a problem will present in the same way on all network paths that pass through it - so all paths through that device will violate and generate an alert (for example, a high degree of data loss). Looking at the alerts with a common source (or common target) we need to find the device that is on all failing paths and not on any non-failing paths.

So:
If all paths from a source are violating, the issue will be at a network device close to the source
If all paths to a target are violating, the issue will be at a network device close to the target
In all other cases the issue is on a network device somewhere between the source and the target

For example, given the following network where AppNeta monitoring points are at the endpoints (S1, T1, T2, …) and we have network paths being monitored from source S1 to all targets (T1, T2, …), if there is a problem at H2, we should see alerts generated on four paths (S1->T1, S1->T2, S1->T3, and S1->T4), but no alerts on any other paths from S1. We can see that the issue is at the hop that all the paths generating alerts have in common but not in common with the paths not generating alerts - in this case, H2.

Tree diagram showing a bad node in order to see paths with the bad node in common.

Two ways to do this include reviewing routes and/or diagnostics at the time of the problem.

  • Use the routes method for “Connectivity Loss” violations. Also, use it to identify hops common to all violating paths that are not common to any non-violating paths.
  • Use the diagnostics method for everything except “Connectivity Loss” violations to identify the hop where the violation is first seen on each path.
Use routes

To use routes:

  1. On the Network Paths page, filter network paths so that several (maximum 20) violating and a couple non-violating paths from the same source (or target) are shown.
    • To do this, filter by Monitoring Point and Target specifying the source monitoring point(s) and target(s) for each path.
  2. Click Show Routes.
    Screen shot showing how to show routes for filtered network paths.
  3. Specify the Center Time as the time the problem was seen and select a time Range.
    Screen shot showing how to use the Routes pane to identify common hops between network paths.
  4. Look for hops that the violating paths have in common that are not in common with the non-violating paths at the time of the problem. These are the suspected hops
    • You can move the time slider back and forth over the time range to see how the routes change.
  5. Record the hostname and IP addresses of the suspected hops.
Use diagnostics

For each violating path selected:

  1. In the Events pane (not the Events tab), find a Diagnostic Test that was successfully run while the path was violating (after a violation event and before the clear event). This should be around the same time as diagnostics on other paths you select for investigation.
  2. Click the test (represented by a pink circle) then click View.
    Screen shot showing how to see events on the network path performance charts.
    • The diagnostic test appears.
  3. Select the Data Details tab (or Voice Details tab for voice-based paths).
  4. Click Advanced Mode (if available).
  5. Review the metric in violation (e.g., Data Loss). The suspected hop is the first one that shows a non-zero value where the remainder of hops to the target (or source) also show non-zero values.
    Screen shot showing data loss details in diagnostic test results.
    • In this example, the issue occurred on one of the first four hops. It is seen at hop 4 but it could also be one of the first three hops where we were unable to determine data loss. In this case, looking at diagnostics on other paths and/or using the “routes” method would help clarify the source of the issue.
  6. If the suspected hop is confirmed on other impacted paths, it is the likely source of the problem.
    • If you do not find confirmation, check that selected diagnostics are close to one another in time and were taken when the problem was occurring. It is also possible that there are multiple problems occurring at the same time.
  7. Record the hostname and IP address of the suspected hop.

Exceptions:

  • No diagnostic available - If there is no diagnostic available at the right time, try one of the other violating paths. Alternatively, if the issue is still occurring, trigger a diagnostic manually (continue reviewing other paths while this completes).
    • In general, diagnostics are triggered when a threshold is violated but it will be postponed or removed from the queue if there are too many diagnostics currently in progress.
  • “Diagnostic Failed” message - The message “Diagnostics Failed - Cannot complete inbound diagnostics because the target monitoring point is not in the same organizational hierarchy as the source monitoring point.” can be safely disregarded. Outbound diagnostics provide the information necessary to determine the problem source.
  • No measurements showing - A hop may be missing or may not show any data other than an IP address and hostname if it does not respond to ICMP packets either because it is configured that way or because it is too heavily loaded.
  • Measurements not above thresholds - If the metric in violation (e.g., Data Loss) does not show values above the alert threshold (e.g., > 3%), it is possible that the diagnostic completed after the violation cleared or that the network condition is intermittent and the diagnostic was was taken at a time of improved performance but before the violation cleared. In this case, continue with one of the other selected paths.
Step 5: Document your findings and investigate the suspected device(s).

To finish the process:

  1. Record your findings
    1. Copy and save a deep link to relevant paths.
    2. Download PDFs.
    3. Attach relevant information to your issue tracking ticket.
  2. Investigate devices at the suspected hop(s) to determine what caused the problem. Keep in mind:
    • if you don’t control the suspected hop(s), pass the information you’ve collected to the service provider that does. Use the Networks view to determine who is responsible for the suspected hop(s).
    • knowledge of network changes at that time the problem started (e.g., hardware changes, software upgrades, or configuration changes).
    • the suspected source of the problem may not be a specific hop in all cases but hops at the same location (e.g., same hostname prefix).
    • the issue can be at the suspected hop, the previous hop, or any infrastructure in between the two.

Related Topics: