Sometimes the best ideas start with a simple “what if”. Over the years, I’ve had countless conversations with customers around how to monitor and alert on Zscaler Cloud Connectors – especially in cloud-native environments. So one night, I challenged myself: could I build a fully cloud-native monitoring stack using just the native tools from AWS, Azure, and GCP?Spoiler: I couldThis blog is Part 1 of that journey, focused entirely on what the outcome looks like – a CloudWatch dashboard with key metrics, visualizations, and alerts I’d want to see in production. Think of it as a preview of what’s possible if you’re running Zscaler Cloud Connectors in AWS with Auto Scaling Groups and Gateway Load BalancerWhile it’s not an official Zscaler solution (or supported by our team), I hope this sparks ideas for your own deployments. And if you’re wondering how to build it – don’t worry, Part 2 will dive into the full configuration, metrics, thresholds, and a CloudFormation template so you can try it out yourselfOf course, Zscaler already offers native integrations with Nanolog Streaming Service (NSS) to stream Cloud Connector Events and Metrics logs to your on-premise or cloud SIEM – I’ll include some links to the official help portal for that below, but this project is all about cloud-native monitoring! Nanolog Streaming Service (NSS)Before we get into this project, please know that Zscaler officially has the ability to stream Cloud Connector Events and Metrics using NSS to your existing SIEM. The key metrics are included and are cloud-agnostic, so you can actually build a beautiful dashboard and alerts in your own SIEM using this method. Here are a few links in case you weren’t aware of this integration:Deploying NSS VMs NSS FeedsCloud NSS FeedsNow when it comes to Cloud Connectors specifically, there are two feeds:Event Logs. These logs are focused on sending status changes for health checks, Zscaler service edge reachability, etcMetrics Logs. These logs are focused on sending Overall CPU, Data Plane CPU, Control Plane CPU, memory, throughput utilization, etcNow back to the fun stuff… What To Monitor (and Why)?Let’s focus on what matters most in Cloud Connector architecture: the Data PlaneCloud Connectors have both a Control Plane and a Data Plane. While the Control Plane handles configuration sync, log forwarding, and updates, traffic keeps flowing even if it’s downThe real question we want to answer:Can Cloud Connectors keep processing workload traffic without impacting performance?That’s the job of the Data Plane. It securely tunnels traffic to the Zscaler Zero Trust Exchange (ZTE)—either ZIA, ZPA, or directly to the Internet—depending on your policy.If the Data Plane is healthy, your traffic is safe. That’s why this blog centers around monitoring only the metrics tied to traffic flow and tunnel healthKey Metrics I Focused OnEvery deployment is different, but here’s what I chose to monitor—and why. These are grouped by Zscaler Location, meaning any number of Connectors behind the same GWLB:MetricExplanationEC2 Health and AvailabilityTrack total Cloud Connectors and their status to quickly identify failuresAuto Scaling Group UtilizationGauge how close you are to max capacity and when scaling might be neededData Plane CPU UtilizationCritical for understanding throughput. High CPU = nearing traffic limitsGWLB Target Group Health ChecksDirectly tells you if tunnels are up and traffic is flowingData Plane Bytes (In/Out)Visualize how much data is processed through Connectors—optional, but insightfulGWLB Bytes ProcessedUseful for spotting patterns or changes in workload traffic over timeNAT Gateway Bytes (In vs Out)Helpful for understanding data transfer patterns and potential cost impactsNetwork Latency (RTT in ms)Adds a layer of observability for network impact going out to Internet CloudWatch DashboardThis is what I built in the lab. Everything is visualized in a single CloudWatch dashboard, broken into logical sections:Header Text: Highlights region, VPC ID, ASG usage, and environment labelText Dividers: Split the dashboard by function (Health, CPU, Scaling, etc.)Metric Widgets: Visualize each of the key metrics above using gauges, graphs, and single valuesTip: I used Average over 5-minute periods for most non-network metrics, and Sum for all throughput-related dataAs this is my lab environment I need to be transparent and mention this single dashboard is for a single location (aka a set of Cloud Connectors) behind the same GWLB. We don’t want metrics aggregated across different sets of Connectors in different Regions as that wouldn’t be useful from an operational perspective. However, that can be a useful Executive view if you think that’s important! For large deployments would it make sense to create a different dashboard for each set of Cloud Connectors behind the same GWLB? Maybe. This is where tinkering comes in handy! Alerting StrategyTo keep things manageable, I built four key CloudWatch alarms that notify via SNS (email):AlarmWhat it doesHealthy Targets %26lt; 25%Indicates tunnel or instance issues affecting the Data PlaneASG Utilization %26gt; 90%Warns when scaling limits are near and capacity might be maxedData Plane CPU %26gt; 80%Suggests traffic load may be too high—triggers before packet lossNetwork Latency (Anomaly)Detects spikes in RTT beyond a defined baseline—great for spotting Zscaler path issuesYou can customize these thresholds for your environment, but these gave me solid coverage of health, scale, and latency What’s NextEverything you’ve seen here – from visualizing data plane health to tracking processed traffic and scaling activity – was built entirely from native AWS (and one Zscaler custom) metrics, and a pinch of CloudWatch magic.In Part 2, I’ll break down exactly how I built this: the metrics I used, the math behind the expressions, the alarm logic, and of course, a reusable CloudFormation template to get you startedAnd for those of you running Azure and/or GCP, Parts 3 and 4 are coming soon. Let’s nerd out together!
[#item_full_content] Sometimes the best ideas start with a simple “what if”. Over the years, I’ve had countless conversations with customers around how to monitor and alert on Zscaler Cloud Connectors – especially in cloud-native environments. So one night, I challenged myself: could I build a fully cloud-native monitoring stack using just the native tools from AWS, Azure, and GCP?Spoiler: I couldThis blog is Part 1 of that journey, focused entirely on what the outcome looks like – a CloudWatch dashboard with key metrics, visualizations, and alerts I’d want to see in production. Think of it as a preview of what’s possible if you’re running Zscaler Cloud Connectors in AWS with Auto Scaling Groups and Gateway Load BalancerWhile it’s not an official Zscaler solution (or supported by our team), I hope this sparks ideas for your own deployments. And if you’re wondering how to build it – don’t worry, Part 2 will dive into the full configuration, metrics, thresholds, and a CloudFormation template so you can try it out yourselfOf course, Zscaler already offers native integrations with Nanolog Streaming Service (NSS) to stream Cloud Connector Events and Metrics logs to your on-premise or cloud SIEM – I’ll include some links to the official help portal for that below, but this project is all about cloud-native monitoring! Nanolog Streaming Service (NSS)Before we get into this project, please know that Zscaler officially has the ability to stream Cloud Connector Events and Metrics using NSS to your existing SIEM. The key metrics are included and are cloud-agnostic, so you can actually build a beautiful dashboard and alerts in your own SIEM using this method. Here are a few links in case you weren’t aware of this integration:Deploying NSS VMs NSS FeedsCloud NSS FeedsNow when it comes to Cloud Connectors specifically, there are two feeds:Event Logs. These logs are focused on sending status changes for health checks, Zscaler service edge reachability, etcMetrics Logs. These logs are focused on sending Overall CPU, Data Plane CPU, Control Plane CPU, memory, throughput utilization, etcNow back to the fun stuff… What To Monitor (and Why)?Let’s focus on what matters most in Cloud Connector architecture: the Data PlaneCloud Connectors have both a Control Plane and a Data Plane. While the Control Plane handles configuration sync, log forwarding, and updates, traffic keeps flowing even if it’s downThe real question we want to answer:Can Cloud Connectors keep processing workload traffic without impacting performance?That’s the job of the Data Plane. It securely tunnels traffic to the Zscaler Zero Trust Exchange (ZTE)—either ZIA, ZPA, or directly to the Internet—depending on your policy.If the Data Plane is healthy, your traffic is safe. That’s why this blog centers around monitoring only the metrics tied to traffic flow and tunnel healthKey Metrics I Focused OnEvery deployment is different, but here’s what I chose to monitor—and why. These are grouped by Zscaler Location, meaning any number of Connectors behind the same GWLB:MetricExplanationEC2 Health and AvailabilityTrack total Cloud Connectors and their status to quickly identify failuresAuto Scaling Group UtilizationGauge how close you are to max capacity and when scaling might be neededData Plane CPU UtilizationCritical for understanding throughput. High CPU = nearing traffic limitsGWLB Target Group Health ChecksDirectly tells you if tunnels are up and traffic is flowingData Plane Bytes (In/Out)Visualize how much data is processed through Connectors—optional, but insightfulGWLB Bytes ProcessedUseful for spotting patterns or changes in workload traffic over timeNAT Gateway Bytes (In vs Out)Helpful for understanding data transfer patterns and potential cost impactsNetwork Latency (RTT in ms)Adds a layer of observability for network impact going out to Internet CloudWatch DashboardThis is what I built in the lab. Everything is visualized in a single CloudWatch dashboard, broken into logical sections:Header Text: Highlights region, VPC ID, ASG usage, and environment labelText Dividers: Split the dashboard by function (Health, CPU, Scaling, etc.)Metric Widgets: Visualize each of the key metrics above using gauges, graphs, and single valuesTip: I used Average over 5-minute periods for most non-network metrics, and Sum for all throughput-related dataAs this is my lab environment I need to be transparent and mention this single dashboard is for a single location (aka a set of Cloud Connectors) behind the same GWLB. We don’t want metrics aggregated across different sets of Connectors in different Regions as that wouldn’t be useful from an operational perspective. However, that can be a useful Executive view if you think that’s important! For large deployments would it make sense to create a different dashboard for each set of Cloud Connectors behind the same GWLB? Maybe. This is where tinkering comes in handy! Alerting StrategyTo keep things manageable, I built four key CloudWatch alarms that notify via SNS (email):AlarmWhat it doesHealthy Targets %26lt; 25%Indicates tunnel or instance issues affecting the Data PlaneASG Utilization %26gt; 90%Warns when scaling limits are near and capacity might be maxedData Plane CPU %26gt; 80%Suggests traffic load may be too high—triggers before packet lossNetwork Latency (Anomaly)Detects spikes in RTT beyond a defined baseline—great for spotting Zscaler path issuesYou can customize these thresholds for your environment, but these gave me solid coverage of health, scale, and latency What’s NextEverything you’ve seen here – from visualizing data plane health to tracking processed traffic and scaling activity – was built entirely from native AWS (and one Zscaler custom) metrics, and a pinch of CloudWatch magic.In Part 2, I’ll break down exactly how I built this: the metrics I used, the math behind the expressions, the alarm logic, and of course, a reusable CloudFormation template to get you startedAnd for those of you running Azure and/or GCP, Parts 3 and 4 are coming soon. Let’s nerd out together!