Back to Blog
SD-WAN23 min read

Advanced SD-WAN Troubleshooting Lab Guide | NHPREP

A
Admin
March 26, 2026
SD-WAN troubleshootingSD-WAN labCatalyst SD-WANSD-WAN debugSD-WAN hands-on

Advanced SD-WAN Troubleshooting Lab Guide

Introduction

When a critical SD-WAN overlay goes down in production, every minute of downtime translates to lost revenue and frustrated users. The ability to quickly isolate whether a failure lives in the control plane, data plane, overlay routing layer, or policy configuration separates seasoned engineers from those still climbing the learning curve. This SD-WAN troubleshooting lab guide walks you through the structured methodology and CLI-based tools you need to systematically diagnose and resolve the most challenging issues you will encounter on Catalyst SD-WAN deployments.

SD-WAN architectures introduce a level of abstraction that traditional WAN engineers are not accustomed to. Overlay tunnels, centralized policy distribution, OMP route exchanges, and BFD-based liveness detection all create new failure domains that require a different troubleshooting mindset. Rather than simply checking interface counters and routing tables, you must now think in terms of control connections to orchestrators and controllers, IPsec data plane tunnels between WAN Edge routers, and the policies that shape how traffic flows across the fabric.

This guide is structured around the same categories of faults you will encounter in real-world environments and in hands-on lab scenarios: control connection faults, data plane tunnel connectivity faults, policies and routing issues, and other miscellaneous troubles. By the end of this article, you will have a clear mental framework for approaching any SD-WAN issue, along with the specific CLI commands and verification steps to get from symptom to root cause efficiently.

What Are the Catalyst SD-WAN Components You Need to Understand for Troubleshooting?

Before diving into fault isolation, it is essential to understand the components that make up a Catalyst SD-WAN deployment and how they interact. Each component plays a distinct role, and understanding these roles is the first step in knowing where to look when something breaks.

Component Naming and Legacy References

The Catalyst SD-WAN platform has undergone a naming transition. The newer official names and their legacy equivalents are:

Modern NameLegacy NameRole
Catalyst SD-WAN ManagervManageNetwork Management System (NMS)
Catalyst SD-WAN ValidatorvBondOrchestrator
Catalyst SD-WAN ControllervSmartController

It is important to note that while the marketing names have changed, all CLI outputs and the entire codebase continue to use the legacy names (vmanage, vbond, vsmart). There are no plans to change these internal references. This means that every show command you run, every log message you read, and every debug output you analyze will still reference vManage, vBond, and vSmart. For practical troubleshooting purposes, you should be fluent in both naming conventions but expect to work with the legacy names at the CLI.

The Role of Each Component in Troubleshooting Context

Understanding what each component does helps you narrow down where a fault might originate:

  • vBond (Validator): The initial point of contact for all SD-WAN devices. It orchestrates the authentication and onboarding process. If a WAN Edge router cannot reach vBond, it will never join the overlay.
  • vSmart (Controller): Responsible for the centralized control plane. It distributes OMP routes, policies, and service chain information to WAN Edge routers via DTLS or TLS tunnels. If vSmart connectivity is lost, the WAN Edge routers lose their centralized routing and policy information.
  • vManage (Manager): The NMS that provides the GUI for configuration, monitoring, and management. While critical for operations, vManage issues are a separate troubleshooting domain that focuses on server-side concerns such as clustering, services, and OS-level problems.

Pro Tip: When troubleshooting SD-WAN issues, the primary focus should be on WAN Edge routers and the CLI troubleshooting tools available on them. vManage NMS troubleshooting (disaster recovery, clustering, server, OS, and services) is an entirely separate discipline and requires different tools and approaches.

How Does the SD-WAN Lab Topology Map to Real-World Troubleshooting?

A well-designed SD-WAN troubleshooting lab mirrors the essential elements of a production deployment. Understanding the topology is critical because it defines the relationships between devices and the paths that control and data traffic take through the network.

VPN Segmentation in the Lab

The lab topology uses VPN segmentation to separate different traffic domains, which is consistent with how production SD-WAN networks are designed:

  • VPN 0 (Transport/Underlay): Carries the WAN transport connections. This is the underlay network that provides IP reachability between SD-WAN devices. All control connections and data plane tunnels are built on top of this foundation.
  • VPN 1 (Service-Side): Carries user traffic for one service segment. This is where enterprise applications and services reside.
  • VPN 2 (Service-Side): Carries user traffic for a second service segment, providing additional segmentation.

The Layered Architecture

The lab topology illustrates how SD-WAN builds layers of functionality on top of the underlay:

  1. Underlay Connectivity: Basic IP reachability between all SD-WAN devices through the transport network (VPN 0).
  2. Control Plane: DTLS/TLS tunnels between WAN Edge routers and vSmart controllers, carrying OMP (Overlay Management Protocol) updates.
  3. Data Plane: IPsec tunnels between WAN Edge routers, providing encrypted transport for user traffic. BFD (Bidirectional Forwarding Detection) runs inside these tunnels to detect path failures.
  4. Overlay Routing and Policies: OMP routes and centralized policies distributed by vSmart that determine how traffic is forwarded across the overlay.
  5. Traffic Forwarding: The actual movement of user packets based on the combined effect of routing tables, policies, and tunnel state.

This layered model is not just theoretical; it directly maps to the categories of faults you will troubleshoot. A failure at a lower layer (such as underlay connectivity) will cascade and affect all higher layers. This is why a structured, bottom-up troubleshooting approach is so effective.

What Is the Structured Approach to SD-WAN Troubleshooting?

Effective SD-WAN troubleshooting follows a structured methodology that mirrors the layered architecture of the solution. Jumping straight to policy debugging when the underlay is broken will waste valuable time. The recommended approach is to work through the layers systematically.

The Five-Layer Troubleshooting Framework

Based on the fault categories that arise in real-world SD-WAN deployments, the troubleshooting framework can be organized into five distinct layers:

LayerFocus AreaKey Questions
1Underlay and Service-Side ConnectivityCan devices reach each other at the transport level? Are service-side interfaces up?
2Control Plane IssuesAre DTLS/TLS tunnels to vSmart and vBond established? Is OMP peering up?
3Data Plane Tunnel IssuesAre IPsec tunnels between WAN Edge routers up? Is BFD detecting liveliness?
4Overlay Routing and PoliciesAre OMP routes being received and installed? Are policies being applied correctly?
5Traffic ForwardingIs user traffic actually taking the expected path? Are packets being forwarded correctly?

Why Bottom-Up Matters

Consider a scenario where users report that traffic between two branch sites is not flowing. If you immediately start examining policies, you might spend hours analyzing complex route maps and access lists. But if the root cause is a failed IPsec tunnel due to a broken underlay path, all that policy analysis is wasted effort.

By starting at Layer 1 (underlay connectivity) and working your way up, you quickly eliminate or confirm each layer as a potential source of the problem. This approach is especially important in SD-WAN because:

  • Dependencies are strict: Control connections cannot form without underlay reachability. Data plane tunnels cannot form without control connections. Policies cannot take effect without OMP routes.
  • Symptoms propagate upward: A Layer 2 failure (control plane) will manifest as Layer 3 symptoms (no data plane tunnels) and Layer 5 symptoms (no traffic forwarding).
  • Fix at the lowest broken layer: Fixing the root cause at the lowest layer will often automatically resolve all the symptoms at higher layers.

Pro Tip: When approaching any SD-WAN issue, resist the urge to jump to the layer where the symptom appears. Instead, start by verifying underlay connectivity and work your way up. Experienced engineers know that the majority of complex-looking SD-WAN problems have simple root causes at the lower layers.

How Do You Troubleshoot SD-WAN Control Connection Faults?

Control connection faults are among the most impactful issues in an SD-WAN deployment. Without functioning control connections, WAN Edge routers cannot receive routing information or policies from the controllers, effectively isolating them from the overlay fabric.

Understanding Control Connections

Control connections in SD-WAN are DTLS (Datagram Transport Layer Security) or TLS (Transport Layer Security) tunnels that connect WAN Edge routers to vBond, vSmart, and vManage. These tunnels carry:

  • OMP updates: Route advertisements, TLOC (Transport Location) information, and service routes
  • Policy distribution: Centralized data policies, app-route policies, and control policies pushed from vSmart
  • Configuration management: Templates and configuration pushes from vManage

Common Control Connection Fault Categories

In a typical SD-WAN troubleshooting lab, you will encounter multiple control connection fault scenarios. These faults represent the kinds of issues that frequently occur in production environments:

  1. Certificate-related failures: SD-WAN uses PKI for device authentication. Expired certificates, mismatched organization names, or incorrect serial numbers will prevent control connections from forming.

  2. Reachability issues: If the WAN Edge router cannot reach vBond on the transport network, the entire onboarding and control connection process fails. This could be due to underlay routing issues, firewall rules blocking the required ports, or incorrect tunnel interface configurations.

  3. Configuration mismatches: Incorrect system IP addresses, site IDs, or organization names in the device configuration will prevent successful authentication and control connection establishment.

  4. Resource constraints: Control connections consume resources on both the WAN Edge and the controllers. In large-scale deployments, hitting connection limits can prevent new devices from joining.

Verification Commands for Control Connections

The CLI provides several commands to verify the state of control connections on WAN Edge routers. Since the focus of troubleshooting is mainly on WAN Edge routers and CLI troubleshooting tools, these commands form the core of your diagnostic toolkit:

show control connections
show control local-properties
show control connections-history
show orchestrator connections

When examining control connection output, pay attention to:

  • State: Whether the connection is in UP state or stuck in a transitional state
  • Uptime: How long the connection has been established (or how recently it went down)
  • Local and remote IPs: Whether the correct transport addresses are being used
  • Protocol: Whether DTLS or TLS is being used as configured

Pro Tip: Do not spend more than 15 to 20 minutes on any single troubleshooting task. If you find yourself going in circles, step back and verify the layer below. Most control connection faults can be identified within minutes using the right show commands and a systematic approach.

How Do You Troubleshoot SD-WAN Data Plane Tunnel Issues?

Once control connections are established, the next layer to verify is the data plane. Data plane tunnels are the IPsec-encrypted paths between WAN Edge routers that carry actual user traffic.

How Data Plane Tunnels Work

Data plane tunnels in SD-WAN are IPsec tunnels that are automatically negotiated between WAN Edge routers based on TLOC information exchanged via OMP through the vSmart controllers. Key characteristics include:

  • Automatic establishment: Unlike traditional IPsec VPNs, SD-WAN data plane tunnels are automatically established based on OMP TLOC advertisements. You do not manually configure crypto maps or tunnel interfaces for each peer.
  • BFD monitoring: Each data plane tunnel runs BFD (Bidirectional Forwarding Detection) to continuously monitor path quality and detect failures quickly. BFD provides sub-second failure detection that triggers fast failover.
  • Multi-transport support: A single WAN Edge router can have tunnels across multiple transports (MPLS, Internet, LTE), with BFD monitoring each path independently.

Common Data Plane Tunnel Faults

Data plane tunnel issues typically fall into two main categories:

  1. Tunnel formation failures: The IPsec tunnel between two WAN Edge routers fails to establish. This could be due to:

    • NAT traversal issues on the underlay
    • Firewall rules blocking IPsec traffic (UDP 12346 for DTLS or specific ports for IPsec)
    • Incorrect TLOC information being advertised via OMP
    • MTU mismatches causing fragmentation issues
  2. Tunnel degradation: The tunnel is established but experiencing performance issues. BFD detects:

    • High latency on the path
    • Packet loss exceeding configured thresholds
    • Jitter affecting real-time traffic

Verification Commands for Data Plane Tunnels

show bfd sessions
show tunnel statistics
show ipsec inbound-connections
show ipsec outbound-connections

When analyzing data plane tunnel issues, always correlate your findings with the control plane state. If control connections to vSmart are down, TLOCs will not be exchanged, and data plane tunnels will not form. This is another reason why the bottom-up approach is essential.

What Are the Key Overlay Routing and Policy Issues in SD-WAN?

With control connections and data plane tunnels verified, the next troubleshooting layer addresses overlay routing and policy issues. This is where the centralized intelligence of SD-WAN comes into play, and it is also where some of the most complex troubleshooting scenarios arise.

OMP Route Distribution

OMP (Overlay Management Protocol) is the routing protocol that runs between WAN Edge routers and vSmart controllers. It carries three types of routes:

  • OMP routes (vRoutes): Prefixes learned from the service-side VPNs on each WAN Edge router
  • TLOCs: Transport locations that identify the tunnel endpoints on each WAN Edge router
  • Service routes: Information about services available at each site (firewalls, IDS/IPS, etc.)

When OMP routes are not being received or installed correctly, users will experience connectivity failures even though the tunnels themselves are up.

Policy-Related Issues

SD-WAN policies are distributed from vSmart to WAN Edge routers and control how traffic is routed and forwarded across the overlay. Policy issues can manifest in several ways:

  • Traffic taking unexpected paths: A data policy might be steering traffic to a specific TLOC or tunnel that is experiencing degradation.
  • Traffic being dropped: An improperly configured access control policy might be blocking traffic that should be allowed.
  • Suboptimal routing: Control policies might be modifying OMP route attributes in ways that cause traffic to take longer paths.

Working with Policy Groups and Config Groups

Modern Catalyst SD-WAN deployments (running version 20.15.1/17.15.1 and later) use policy groups and config groups, which represent the newer "UI 2.0" approach. This replaces the older device templates and feature templates ("UI 1.0") approach.

Configuration ApproachVersionUI GenerationKey Characteristics
Device Templates + Feature TemplatesPre-20.15UI 1.0Individual template-based configuration
Policy Groups + Config Groups20.15.1+ / 17.15.1+UI 2.0Group-based configuration and policy management

Understanding which configuration approach is in use is important for troubleshooting because the way policies are applied and verified differs between the two approaches. The migration from UI 1.0 to UI 2.0 itself can introduce configuration inconsistencies that lead to troubleshooting scenarios.

Verification Commands for Routing and Policies

show omp routes
show omp tlocs
show omp peers
show policy from-vsmart
show route vrf 1

Pro Tip: When troubleshooting policy issues, remember that policies and routing issues represent some of the most challenging fault scenarios. Some of these problems may require significant time to diagnose, even for experienced engineers. A methodical approach of comparing expected behavior against actual behavior, one policy at a time, is the most reliable path to resolution.

How Do You Diagnose Traffic Forwarding Issues in SD-WAN?

Traffic forwarding issues represent the final troubleshooting layer and are often the symptom that triggers the entire investigation. A user reports that they cannot reach a resource, and you need to trace the path from source to destination across the SD-WAN overlay.

The Traffic Forwarding Path

In SD-WAN, a packet from a service-side host traverses several decision points:

  1. Service-side ingress: The packet arrives at the WAN Edge router on a service-side interface (VPN 1 or VPN 2).
  2. Route lookup: The router performs a route lookup in the appropriate VRF routing table.
  3. Policy evaluation: If data policies are configured, the packet is evaluated against policy rules that may modify the next-hop, set DSCP values, or redirect traffic.
  4. Tunnel selection: Based on the route lookup and policy evaluation, the router selects the appropriate IPsec tunnel to reach the destination WAN Edge.
  5. Encapsulation and forwarding: The packet is encapsulated in IPsec and forwarded across the selected tunnel.
  6. Remote decapsulation: The destination WAN Edge router decapsulates the packet and forwards it to the service-side destination.

Common Traffic Forwarding Faults

Traffic forwarding issues can be caused by problems at any of the preceding layers, but some are specific to the forwarding plane:

  • VRF routing table issues: Routes might be present in OMP but not installed in the VRF routing table due to route redistribution problems.
  • ACL blocking: Implicit or explicit access control lists might be dropping traffic.
  • NAT issues: If NAT is configured on the service side, incorrect translations can break connectivity.
  • QoS dropping: Aggressive QoS policies might be dropping traffic during congestion.

Systematic Forwarding Verification

The key to diagnosing forwarding issues is to verify each step in the forwarding path:

show route vrf 1
show ip route vrf 1 <destination-ip>
show policy-firewall stats

Start by confirming the route exists in the local VRF table, then verify that the route points to the correct next-hop (either a local interface for service-side destinations or a tunnel for remote destinations). If the route is correct, check for policies or ACLs that might be affecting the traffic.

What Is the Recommended Troubleshooting Workflow for SD-WAN Labs?

Working through SD-WAN troubleshooting scenarios in a lab environment requires a disciplined approach to maximize learning and efficiency. Here is a structured workflow based on best practices for hands-on lab exercises.

Time Management

Each troubleshooting task in a lab scenario should be approached with a time budget. The recommended approach is:

  • Allocate 15 to 20 minutes per task maximum: If you cannot identify the root cause within this window, consult reference materials or solution guides before moving on.
  • Recognize task dependencies: In most lab scenarios, tasks are sequential and dependent. You need to fix all tasks in one section before continuing with the next section. For example, all control connection faults must be resolved before you can effectively troubleshoot data plane tunnel issues.

The Three-Tier Approach to Lab Exercises

Effective lab practice uses a tiered approach that scales with your experience level:

TierApproachBest For
Tasks OnlyAttempt each task with no hints or solutionsExperienced engineers who want a challenge
Tasks with Brief SolutionsQuick explanations of faults and fixes, no step-by-step detailIntermediate engineers who need occasional guidance
Tasks with Full SolutionsDetailed step-by-step troubleshooting walkthroughsEngineers learning SD-WAN troubleshooting tools and methodology

The recommended strategy is to start with the tasks-only approach and fall back to brief solutions or full solutions only when you get stuck. This builds genuine troubleshooting muscle memory rather than just following procedures.

Lab Task Categories and Distribution

A comprehensive SD-WAN hands-on troubleshooting lab typically covers:

CategoryNumber of TasksFocus
Control Connection Faults4vBond, vSmart, and vManage control connection issues
Data Plane Tunnel Faults2IPsec tunnel establishment and BFD issues
Policies and Routing Issues3OMP, data policies, control policies
Other Troubles3Miscellaneous issues including service-side problems

This distribution reflects the real-world frequency of these issue types, with control connection faults being the most common category.

Pro Tip: Do not get discouraged if you cannot solve every task within the allotted time. Some troubleshooting scenarios are intentionally challenging and may require multiple attempts. Even experienced SD-WAN engineers encounter problems that take significant time to diagnose when seeing them for the first time. The goal is to build familiarity with the tools and methodology, not to achieve perfection on the first attempt.

Essential CLI Tools for SD-WAN Debug and Troubleshooting

The WAN Edge router CLI is your primary troubleshooting interface for SD-WAN issues. While the vManage GUI provides monitoring dashboards and high-level visibility, the CLI gives you the granular detail needed to identify root causes.

Control Plane Verification Commands

These commands help you verify that the SD-WAN control plane is functioning correctly:

show control connections
show control local-properties
show control connections-history
show orchestrator connections
show certificate installed
show certificate root-ca-cert

The show control connections command is often the first command you should run. It gives you a quick overview of all control connections, their state, and which controllers the WAN Edge is connected to.

The show control connections-history command is invaluable when connections are flapping. It shows you the history of connection attempts, including failure reasons, which can point you directly to the root cause.

OMP and Routing Verification

show omp peers
show omp routes
show omp routes vpn 1
show omp tlocs
show omp summary

OMP commands let you see what routing information the WAN Edge has received from vSmart and what it is advertising. Comparing OMP routes between two WAN Edge routers can quickly reveal whether a routing issue is local or centralized.

Data Plane and Tunnel Verification

show bfd sessions
show tunnel statistics
show ipsec inbound-connections
show ipsec outbound-connections
show interface tunnel <number>

BFD session information is particularly useful because it shows you not just whether a tunnel is up, but also the measured latency, loss, and jitter on each path. This information is critical for diagnosing performance-related issues.

Policy Verification

show policy from-vsmart
show running-config policy
show policy access-list

Policy verification commands let you see exactly what policies have been pushed from vSmart and how they are being applied locally. When traffic is not flowing as expected, comparing the intended policy configuration on vManage with the actual policy installed on the WAN Edge often reveals the discrepancy.

Software Versions and Their Impact on SD-WAN Troubleshooting

Software version awareness is an often-overlooked aspect of SD-WAN troubleshooting. Different versions introduce new features, change behaviors, and occasionally introduce bugs that affect network operation.

Current Recommended Versions

Modern SD-WAN troubleshooting labs are based on software version 20.15.1/17.15.1, which represents a significant advancement over earlier releases (such as 20.9/17.9). The version numbering convention uses two parallel tracks:

  • 20.x: The vManage/controller software version
  • 17.x: The IOS-XE based WAN Edge software version

UI 2.0 vs UI 1.0 Considerations

One of the most significant changes in recent SD-WAN versions is the transition from UI 1.0 to UI 2.0:

  • UI 1.0 used device templates and feature templates for configuration management
  • UI 2.0 uses policy groups and config groups for a more streamlined approach

This transition has direct implications for troubleshooting:

  1. Configuration verification: The way configurations are structured and applied differs between the two approaches. When verifying configurations on WAN Edge routers, you need to understand which approach was used to push the configuration.
  2. Template migration issues: Organizations migrating from UI 1.0 to UI 2.0 may encounter configuration inconsistencies during the transition.
  3. Documentation alignment: Ensure that any troubleshooting documentation or runbooks you reference are aligned with the UI version in use.

Building Your SD-WAN Troubleshooting Skills Through Hands-On Practice

There is no substitute for hands-on practice when it comes to building SD-WAN troubleshooting skills. Reading about troubleshooting methodology is valuable, but actually working through fault scenarios on real or simulated equipment is what transforms knowledge into competence.

Structured Practice Methodology

When approaching SD-WAN hands-on lab exercises, follow this methodology:

  1. Read the task description carefully: Understand what the expected behavior should be before diving into the CLI.
  2. Verify the current state: Use show commands to understand what is actually happening on the device.
  3. Compare expected vs. actual: Identify the gap between what should be happening and what is happening.
  4. Form a hypothesis: Based on the gap analysis, develop a theory about what might be causing the issue.
  5. Test and verify: Make a targeted change to test your hypothesis. Verify whether the change resolved the issue.
  6. Document your findings: Keep notes on what the fault was, how you identified it, and what fixed it. This builds your personal troubleshooting knowledge base.

Common Pitfalls to Avoid

  • Changing multiple things at once: Make one change at a time and verify after each change. Changing multiple variables simultaneously makes it impossible to identify which change actually fixed (or worsened) the problem.
  • Ignoring the basics: Always verify underlay connectivity before diving into overlay troubleshooting. A surprising number of complex-looking SD-WAN issues are caused by simple underlay problems.
  • Not using connection history: The connection history commands are among the most valuable troubleshooting tools. They show you exactly why a connection failed, which is far more useful than knowing that it is currently down.
  • Forgetting about dependencies: Remember that tasks in a lab are often dependent on each other. Failing to completely resolve an earlier task can cause misleading symptoms in later tasks.

Frequently Asked Questions

What is the best starting point when troubleshooting an SD-WAN issue?

Always start by verifying underlay connectivity and working your way up through the layers: control connections, data plane tunnels, overlay routing, and finally traffic forwarding. The bottom-up approach ensures you identify the root cause at the correct layer rather than chasing symptoms at higher layers. Begin with basic IP reachability on the transport network (VPN 0), then check control connections to vBond, vSmart, and vManage using show control connections.

Why do the CLI outputs still show vManage, vBond, and vSmart instead of the new Catalyst SD-WAN names?

While the official product names have been updated to Catalyst SD-WAN Manager, Catalyst SD-WAN Validator, and Catalyst SD-WAN Controller respectively, all CLI outputs and the entire codebase continue to use the legacy names (vmanage, vbond, vsmart). There are no plans to change these internal references. This applies to all show commands, debug outputs, log files, and configuration elements. You should be familiar with both naming conventions but expect the legacy names in all CLI-based troubleshooting.

What is the difference between UI 1.0 and UI 2.0 in Catalyst SD-WAN?

UI 1.0 refers to the older configuration approach that uses device templates and feature templates for managing device configurations. UI 2.0, introduced in version 20.15.1/17.15.1, uses policy groups and config groups instead. UI 2.0 provides a more streamlined and scalable configuration management approach. When troubleshooting, it is important to know which approach was used because it affects how configurations are structured, applied, and verified on WAN Edge routers.

How much time should I spend on each troubleshooting task in a lab exercise?

The recommended time budget is 15 to 20 minutes per task. If you cannot identify the root cause within this window, consult solution references or brief explanations before moving on. Spending too long on a single task can prevent you from completing the entire lab and learning from the full range of scenarios. Some tasks are intentionally challenging, and even experienced engineers may find certain fault scenarios difficult when encountering them for the first time.

Are SD-WAN lab troubleshooting tasks independent or sequential?

In most comprehensive SD-WAN troubleshooting labs, tasks are sequential and dependent on each other. You need to fix all tasks in one section (for example, all control connection faults) before continuing with the next section (data plane tunnel faults). This mirrors real-world troubleshooting where lower-layer issues must be resolved before higher-layer functionality can be verified. Skipping a task or leaving it partially resolved will likely cause misleading symptoms in subsequent tasks.

What should the troubleshooting focus be on: vManage or WAN Edge routers?

The primary troubleshooting focus should be on WAN Edge routers and the CLI troubleshooting tools and commands available on them. vManage NMS troubleshooting, which covers disaster recovery, clustering, server, OS, and services issues, is an entirely separate troubleshooting domain that requires different tools, different skills, and a different methodology. For overlay network troubleshooting, the WAN Edge CLI provides the most detailed and actionable diagnostic information.

Conclusion

Advanced SD-WAN troubleshooting is a skill that combines deep understanding of the Catalyst SD-WAN architecture with practical, hands-on experience using CLI diagnostic tools. The structured approach outlined in this guide, moving systematically from underlay connectivity through control plane, data plane, overlay routing, and finally traffic forwarding, provides a reliable framework for diagnosing any SD-WAN issue you encounter.

The key takeaways from this SD-WAN troubleshooting lab guide are:

  1. Always work bottom-up: Verify underlay connectivity before investigating overlay issues. Lower-layer faults cascade upward and create misleading symptoms at higher layers.
  2. Know your components: Understand the roles of vBond, vSmart, and vManage, and know that all CLI outputs use the legacy naming convention regardless of official product name changes.
  3. Master the CLI tools: The WAN Edge CLI is your primary troubleshooting interface. Commands like show control connections, show omp routes, show bfd sessions, and show policy from-vsmart form the core of your diagnostic toolkit.
  4. Practice with structured labs: Work through fault scenarios covering all four categories (control connections, data plane tunnels, policies and routing, and miscellaneous issues) to build comprehensive troubleshooting skills.
  5. Manage your time: Budget 15 to 20 minutes per task and use tiered solution references when needed rather than getting stuck on a single problem.
  6. Stay version-aware: Modern deployments using version 20.15.1/17.15.1 with UI 2.0 (policy groups and config groups) differ from older deployments using UI 1.0 (device templates and feature templates).

SD-WAN troubleshooting expertise is one of the most valuable skills you can develop as a network engineer. As organizations continue to migrate from traditional WAN architectures to SD-WAN overlays, the demand for engineers who can quickly diagnose and resolve complex overlay issues will only continue to grow. Invest the time in hands-on practice, build your troubleshooting methodology, and develop confidence with the CLI tools that will serve you throughout your career.

Explore the full range of SD-WAN courses and hands-on labs available at NHPREP to continue building your troubleshooting expertise.