When you receive an APM network path alert notification, use this workflow to troubleshoot the network problem using APM.

Step 1: Record network path alert information.

Record the following from the alert notification:

  • source monitoring point name
  • target name
  • violation type (e.g. Data Loss)
  • time of the alert
Step 2: Find other violating paths from the same source or to the same target.

The procedure to use depends on when the violation occurred.

Violation is current

For current violations:

  1. In APM, navigate to Delivery > Network Paths.
  2. Click Violated to create a “Status = Violated” filter and show only currently violated network paths.
    Screen shot of Network Paths page with 'Violated' highlighted.
  3. Determine whether there are a significant number of path violations from the source monitoring point (the more violations the larger the impact):
    1. In the Group By dropdown, select Monitoring Point.
      Screen shot of Network Paths page with 'Group By' dropdown set to 'Monitoring Point'.
    2. Filter by the name of the source monitoring point in question.
      Screen shot of Network Paths page with three steps: '1' cursor in search bar, '2' selecting 'Monitoring Point' as a filter, and '3' specifying the monitoring point name.
    3. Click Apply.
      • The paths listed are all the currently violating paths from the source monitoring point.
  4. If there aren’t a significant number of path violations from the source, determine whether there are a significant number of path violations to the target:
    1. In the Group By dropdown, select Target.
      Screen shot of Network Paths page with 'Group By' dropdown set to 'Target'.
    2. Filter by the name of the target in question.
      Screen shot of Network Paths page with three steps: '1' cursor in search bar, '2' selecting 'Target' as a filter, and '3' specifying the target.
    3. Click Apply.
      • The paths listed are all the currently violating paths to the same target.
  5. Open a selection of paths (at least 3 to 5 paths):
    1. Hover over the status icon on the left to confirm the type of violation (e.g. Data Loss)
      Screen shot of Network Paths page with the hand cursor over a status icon to the left of the selected path and 'Data Loss > 5%' highlighted in the corresponding status dialog.
    2. Ctrl-click (or Cmd-click) several paths with the same type of violation that triggered the original alert. The path performance charts for each path are displayed in separate tabs.
Violation happened this week

For violations that happened this week:

  1. In APM, navigate to Delivery > Events.
  2. In the Search field, enter the name of the source monitoring point or target known to have had a problem.
    Screen shot of the Events page with 'test-mp' highlighted in the search box.
  3. In the Search field, specify “Event Type = Alert Condition”.
    Screen shot of Events page with three steps: '1' cursor in search bar, '2' selecting 'Event Type' as a filter, and '3' specifying the event type as 'Alert Condition'.
  4. Click Apply.
  5. Click the Event Time column header to sort the events list by event time.
    Screen shot of the Events page with the 'Event Time' column header highlighted.
  6. Find violation events of the same type (e.g. Data Loss) that occurred around the same time by comparing time stamps.
    Screen shot of Events page highlighting four events occurring at the same time with the same condition - Data Loss.
  7. Ctrl-click (or Cmd-click) several paths with the same type of violation that triggered the original alert. The path performance charts for each path are displayed in separate tabs.
Violation happened more than a week ago

For violations that happened more than a week ago:

  1. In APM, navigate to Delivery > Network Paths
  2. In the Search field, enter the name of the source monitoring point or target known to have had a problem as well as any other search criteria.
    Screen shot of the Network Paths page with 'test-mp' highlighted in the search box.
  3. Ctrl-click (or Cmd-click) several paths with the same type of violation that triggered the original alert. The path performance charts for each path are displayed in separate tabs.
Step 3: Confirm that selected paths are violating with the same symptoms at the same time.

Two ways to do this include using a comparison report or by comparing the path performance charts of each path. The comparison report makes comparison easier.

Compare paths using a comparison report

To compare paths using a comparison report:

  1. In APM, navigate to Reports > Report List.
  2. In the Data Performance Comparison or Voice Performance Comparison sections, select the report associated with the violation type (e.g. Data Loss).
  3. Edit the report filters to select the network paths you want to compare (those opened in separate tabs in Step 2).
  4. Specify the time range for the report.
  5. Click Update.
    • Note patterns and anomalies common to graphs for different paths. Things to note:
      • How the problem is presenting
        • In one or both directions? Is it constant or intermittent? Is it random or in a regular pattern?
      • Familiar patterns
        • For example, precise cadence implies automation, business hours implies user activity.
      • Connectivity loss (gaps in graph)
        • Do other comparison reports (e.g. Data Loss, Jitter) indicate an issue leading up to the connectivity loss event.
  6. Identify paths with similar patterns that started at the same time as they are likely due to the same issue. For those paths, keep the tabs opened in Step 2 open. Close the other tabs.

Example

Data loss comparison of four dual-ended paths from the same source monitoring point.
Screen shot of the Data Loss Comparison report showing four paths being compared. All four show similar data loss patterns of about 5% in the outbound direction over three days.
Note that all four are showing similar data loss patterns in the outbound direction (above the line) and negligible data loss in the inbound direction (below the line). The data loss on each path also starts and ends at the same time.

Compare paths by comparing path performance charts

To compare paths by comparing path performance charts, within each tab you opened:

  1. Review the charts related to the violation type (e.g. Data Loss chart) to confirm the violation(s) and determine when the problem started. Expand the time range if there is no obvious start to the problem.
  2. Note patterns and anomalies common to charts on different paths over the same time period. Things to note:
    • How the problem is presenting
      • Which metric (e.g. Data Loss) is affected? In one or both directions? Is it constant or intermittent? Is it random or in a regular pattern?
    • Familiar patterns
      • For example, precise cadence implies automation, business hours implies user activity.
    • Connectivity loss (gaps in connectivity (black vertical lines))
      • Do other charts (e.g. data loss, jitter) indicate an issue leading up to the connectivity loss event.
  3. Keep tabs open containing paths that have similar patterns and that started at the same time as they are likely due to the same issue. Close the other tabs.

Example

Selected path 1:
Screen shot of Data Loss charts with outbound chart showing data loss going from 0% to 5% and continuing for over three days.
Selected path 2:
Screen shot of Data Loss charts with outbound chart showing data loss going from 0% to 5% and continuing for over three days. Same timing and pattern as the previous diagram.
Note commonalities between the charts of path 1 and path 2. They have similar amounts of data loss in the same direction (outbound) starting at the same time.

Step 4: Use selected paths to determine the source of the problem.

To find the problem source, we look for hops that the impacted paths have in common that aren’t shared by non-impacted paths from the same source monitoring point or to the same target.

Background: Determining the source of a network problem

AppNeta monitoring points and Delivery monitoring are set up to monitor network traffic on paths between a variety of sources and targets and generate alerts when conditions on those network paths are outside of norms.
A network device (hop) that is causing a problem will present in the same way on all network paths that pass through it - so all paths through that device will violate and generate an alert (for example, a high degree of data loss). Looking at the alerts with a common source (or common target) we need to find the device that is on all failing paths and not on any non-failing paths.

So:
If all paths from a source are violating, the issue will be at a network device close to the source
If all paths to a target are violating, the issue will be at a network device close to the target
In all other cases the issue is on a network device somewhere between the source and the target

For example, given the following network where AppNeta monitoring points are at the endpoints (S1, T1, T2, …) and we have network paths being monitored from source S1 to all targets (T1, T2, …), if there is a problem at H2, we should see alerts generated on four paths (S1->T1, S1->T2, S1->T3, and S1->T4), but no alerts on any other paths from S1. We can see that the issue is at the hop that all the paths generating alerts have in common but not in common with the paths not generating alerts - in this case, H2.

Tree diagram with S1 on the left and T1 through T8 on the right. Paths from source to targets are via H1 through H7. All paths have H1 in common. S1->T1, S1->T2, S1->T3, S1->T4 have hop H2 in common but S1->T5, S1->T6, S1->T7, S1->T8 do not have H2 on their paths.

Two ways to do this include reviewing routes and/or diagnostics at the time of the problem.

  • Use the routes method for “Connectivity Loss” violations. Also, use it to identify hops common to all violating paths that are not common to any non-violating paths.
  • Use the diagnostics method for everything except “Connectivity Loss” violations to identify the hop where the violation is first seen on each path.
Use routes

To use routes:

  1. On the Network Paths page, filter network paths so that several (maximum 20) violating and a couple non-violating paths from the same source (or target) are shown.
    • To do this, filter by Monitoring Point and Target specifying the source monitoring point(s) and target(s) for each path.
  2. Click Show Routes.
    Screen shot of Network Paths page with two steps: '1' Monitoring Point and Target filters set in the search box, and '2' a hand cursor over the 'Show Routes' button.
  3. Specify the Center Time as the time the problem was seen and select a time Range.
    Screen shot of the Routes pane on the Network Paths page showing routes from the source monitoring point to four targets. The common hop between the four paths is highlighted. Center time, range, and the time slider are also highlighted.
  4. Look for hops that the violating paths have in common that are not in common with the non-violating paths at the time of the problem. These are the suspected hops
    • You can move the time slider back and forth over the time range to see how the routes change.
  5. Record the hostname and IP addresses of the suspected hops.
Use diagnostics

For each violating path selected:

  1. In the Events pane (not the Events tab), find a Diagnostic Test that was successfully run while the path was violating (after a violation event and before the clear event). This should be around the same time as diagnostics on other paths you select for investigation.
    Screen shot of the Events pane on the performance charts page for a network path. A pink dot has the hand cursor on it and is highlighted. The corresponding events list box has a 'View' link highlighted.
  2. Click the test (represented by a pink circle) then click View. The diagnostic test appears.
  3. Select the Data Details tab (or Voice Details tab for voice-based paths).
  4. Click Advanced Mode (if available).
  5. Review the metric in violation (e.g. Data Loss). The suspected hop is the first one that shows a non-zero value where the remainder of hops to the target (or source) also show non-zero values.
    Screen shot of the Data Details tab within diagnostics for a path. The 'Data Details' tab is highlighted. Also, the 'Data Loss' column is highlighted. Data loss values are shown on hops 4, 5, and 12 (the last hop). All other hops have data loss showing '-'.
    • In this example, the issue occurred on one of the first four hops. It is seen at hop 4 but it could also be one of the first three hops where we were unable to determine data loss. In this case, looking at diagnostics on other paths and/or using the “routes” method would help clarify the source of the issue.
  6. If the suspected hop is confirmed on other impacted paths, it is the likely source of the problem.
    • If you do not find confirmation, check that selected diagnostics are close to one another in time and were taken when the problem was occurring. It is also possible that there are multiple problems occurring at the same time.
  7. Record the hostname and IP address of the suspected hop.

Exceptions:

  • No diagnostic available - If there is no diagnostic available at the right time, try one of the other violating paths. Alternatively, if the issue is still occurring, trigger a diagnostic manually (continue reviewing other paths while this completes).
    • In general, diagnostics are triggered when a threshold is violated but it will be postponed or removed from the queue if there are too many diagnostics currently in progress.
  • “Diagnostic Failed” message - The message “Diagnostics Failed - Cannot complete inbound diagnostics because the target monitoring point is not in the same organizational hierarchy as the source monitoring point.” can be safely disregarded. Outbound diagnostics provide the information necessary to determine the problem source.
  • No measurements showing - A hop may be missing or may not show any data other than an IP address and hostname if it does not respond to ICMP packets either because it is configured that way or because it is too heavily loaded.
  • Measurements not above thresholds - If the metric in violation (e.g. Data Loss) does not show values above the alert threshold (e.g. > 3%), it is possible that the diagnostic completed after the violation cleared or that the network condition is intermittent and the diagnostic was was taken at a time of improved performance but before the violation cleared. In this case, continue with one of the other selected paths.
Step 5: Document your findings and investigate the suspected device(s).

To finish the process:

  1. Record your findings
    1. Copy and save a deep link to relevant paths.
    2. Download PDFs and take screenshots of charts (include path name and time stamps).
    3. Attach relevant information to your issue tracking ticket.
  2. Investigate devices at the suspected hop(s) to determine what caused the problem. Keep in mind:
    • if you don’t control the suspected hop(s), pass the information you’ve collected to the service provider that does. Use the Networks view to determine who is responsible for the suspected hop(s).
    • knowledge of network changes at that time the problem started (e.g., hardware changes, software upgrades, or configuration changes).
    • the suspected source of the problem may not be a specific hop in all cases but hops at the same location (e.g., same hostname prefix).
    • the issue can be at the suspected hop, the previous hop, or any infrastructure in between the two.