Monitoring & Troubleshooting
"It's slow" is not a diagnosis. To fix a problem, you must first see it. Packet analysis and flow data provide the visibility you need.
1. Polling (SNMP) vs Streaming (Telemetry)
How do we get data out of network devices?
SNMP (Simple Network Management Protocol)
The legacy standard (since 1988). The NMS (Network Management System) asks questions every 5 minutes ("What is your CPU load?").
- Pro: Supported by everything (even toasters).
- Con: Slow. You miss spikes that happen between polls. CPU intensive for the router.
- Security: v1/v2c use cleartext community strings. v3 adds encryption (AES) and authentication (SHA).
Streaming Telemetry (gRPC / NETCONF)
The modern approach. The router pushes data to a collector instantly (sub-second) when it changes.
- Pro: Real-time visibility. Efficient (binary encoding like Protobuf).
- Con: Requires newer hardware and a Time Series Database (InfluxDB/Prometheus).
2. Flow Data: NetFlow / IPFIX
If SNMP is "How is the device feeling?", NetFlow is "Who is talking to whom?". It captures Metadata about traffic, not the payload.
NetFlow: Like a phone bill. "Paul called Sarah at 2:00 PM for 5 minutes." (Source IP, Dest IP, Port, Duration). Good for bandwidth analysis and security forensics.
Wireshark (PCAP): Like a wiretap. Captures the actual conversation ("Paul said hello"). Needed for deep troubleshooting (retransmissions, application errors).
3. Wireshark Filter Cheat Sheet
Packet captures (PCAP) can be huge. Use display filters to find the needle in the haystack.
| Scenario | Filter |
|---|---|
| Find Bad TCP | tcp.analysis.flags (Shows retransmissions, duplicate ACKs, zero windows). |
| Slow Server Response | tcp.time_delta > 0.5 (Packets that took > 500ms to arrive). |
| Specific IP | ip.addr == 192.168.1.10 |
| DHCP Traffic | bootp |
| DNS Errors | dns.flags.rcode != 0 |
4. Syslog Severity Levels
Devices generate text logs for events. Knowing the severity level helps you filter noise.
Mnemonic: "Every Alley Cat Eats Rats And Plays Games"
- 0 - Emergency: System unusable.
- 1 - Alert: Immediate action needed.
- 2 - Critical: Critical condition.
- 3 - Error: Error condition.
- 4 - Warning: Warning condition.
- 5 - Notice: Normal but significant. (Link Up/Down).
- 6 - Informational: Info messages.
- 7 - Debug: Granular details. (Do not enable on production console!).
5. Active Monitoring: IP SLA
Passive monitoring (SNMP/NetFlow) tells you what happened. Active monitoring (IP SLA) tells you what is happening right now by generating synthetic traffic.
You can configure a router to send a simulated VoIP stream (UDP Jitter Probe) to another device every 60 seconds.
If the Round Trip Time (RTT) exceeds 200ms or Jitter exceeds 30ms, the router can automatically reroute traffic (Policy Based Routing) or trigger an alert.
6. The Troubleshooting Flowchart
When "the network is down", follow the OSI model bottom-up.
- Physical (L1): Is the cable plugged in? Are interface lights on? Check for CRC errors (bad cable) or input errors (duplex mismatch).
- Data Link (L2): Is the MAC address learned? Is STP blocking the port? Is the VLAN correct?
- Network (L3): Can you ping the gateway? Do you have a route to the destination? Check ARP table.
- Transport (L4): Is an ACL blocking port 80/443? Is the firewall dropping SYN packets?
- Application (L7): Is the DNS resolving? Is the web server returning 500 errors?