Network Observability and Troubleshooting

Explore how to diagnose and resolve network connectivity and DNS resolution issues in complex AWS environments. Learn to use VPC Flow Logs, Transit Gateway route tables, and Route 53 Resolver endpoints to enhance network observability and troubleshoot hybrid connectivity failures effectively.

We'll cover the following...

Capturing traffic with VPC Flow Logs
- Destinations and custom formats
Troubleshooting hybrid connectivity
- Common failure scenarios
  - Diagnostic workflow
Diagnosing DNS and routing failures
- Resolver endpoints and forwarding rules
  - Common hybrid DNS failure patterns
Observability best practices at scale
Conclusion

In large AWS environments, network issues can become difficult to diagnose quickly. A single organization may operate dozens of accounts, hundreds of VPCs, and multiple hybrid connections to on-premises data centers. When a newly attached VPC cannot reach internal servers or DNS queries fail across accounts or connected networks, the solution is rarely to add another VPN tunnel, peering link, or connection.

A better approach is to observe and validate what is already in place. Architects need to inspect route tables, review traffic metadata, verify security boundaries, and trace DNS resolution paths to identify where communication is breaking down.

This lesson introduces a practical troubleshooting framework for AWS network observability. You’ll learn how to use VPC Flow Logs to capture traffic metadata, choose the right log destination for analysis, troubleshoot hybrid connectivity through Transit Gateway and Direct Connect, and diagnose DNS failures in multi-account and hybrid architectures.

The following diagram illustrates how these observability components fit together in a centralized, multi-account architecture.

This architecture establishes the foundation for every troubleshooting workflow discussed in this lesson. Understanding each component’s role begins with the primary data source: VPC Flow Logs.

Capturing traffic with VPC Flow Logs

VPC Flow Logs capture IP traffic metadata at the VPC, subnet, or elastic network interface (ENI) level. Each flow log records useful details such as source and destination IP addresses, ports, protocol number, packet and byte counts, the action taken (ACCEPT or REJECT), and log status. A key distinction is that Flow Logs capture metadata only, not packet payloads. For packet-level inspection, Traffic Mirroring is the appropriate tool.

To control log volume and cost, VPC Flow Logs also support custom log formats, allowing architects to select only the metadata fields needed for a specific analysis use case.

Destinations and custom formats

Flow Logs support three destinations, each suited to a different operational pattern. CloudWatch Logs enables near-real-time metric filters and alarms, making it ideal for detecting rejected traffic spikes within seconds. Amazon S3 provides cost-effective batch delivery for long-term retention, where Amazon Athena runs ad hoc SQL queries against partitioned log data. Kinesis Data Firehose streams records into third-party SIEM tools such as Splunk or Amazon OpenSearch for continuous security analytics.

The following table compares these destinations across dimensions:

VPC Flow Log Destinations: Choosing the Right Analysis Path

Destination	Latency to Insight	Cost Profile	Best Use Case	Query/Analysis Tool	Cross-Account Support
CloudWatch Logs	Near-real-time (seconds)	Higher at scale	Real-time alerting and metric filters	CloudWatch Insights / Metric Filters	Yes, via destination policies
Amazon S3	Minutes (batch delivery)	Lowest for large volumes	Long-term retention and compliance	Amazon Athena / QuickSight	Yes, via bucket policies in centralized logging account
Kinesis Data Firehose	Near-real-time (streaming)	Medium (throughput-based)	Streaming security analytics and SIEM integration	Custom consumers / Splunk / OpenSearch	Yes, via cross-account Kinesis streams