Early last year, Zscaler announced cloud resilience capabilities that ensure business continuity for customers during blackouts, brownouts, and catastrophic events. These capabilities encompass resilient infrastructure, software, and operational processes that help organizations build business continuity plans. These capabilities have been met with enthusiasm and subsequent adoption by many Zscaler customers.

How to forward traffic with resilienceZscaler recommends establishing redundant GRE/IPSec VPN tunnels to the Zscaler cloud to ensure high availability of the service. These redundant tunnels will need to be established to two different Zscaler data centers. Additionally, Zscaler recommends establishing Layer 7 service monitoring to the URL http://gateway.<Zscaler cloud>.net/vpntest on devices that support this ability. The Layer 7 monitoring URL is serviced within the Zscaler cloud and ensures that the path to the Service Edge instance is performing as expected.

The Zscaler architecture in the cloud enables high availability and redundancy. There is decentralization of services with the VPN tunnels being terminated in instances physically and logically separated from the Service Edge instances. This allows Zscaler operations to quickly isolate incidents in individual components and provides a high degree of availability of the services allowing adherence to the industry leading SLA.

Black swan events with VPN tunnelsEven with the establishment of the redundant tunnels, sometimes the VPN tunnels do not failover to the backup/secondary tunnel that is active and servicing traffic. This could be for one of two reasons:

Although the Primary Zscaler DC is undergoing scheduled maintenance, the instances terminating the VPN tunnels continue to be active. The keepalives and Dead Peer Detection algorithms of the IPSec VPN continue to meet the configured parameters. The maintenance is being carried out on other instances in the Zero Trust Exchange. In an unlikely scenario where it does play out, there is service impact to customer traffic because the VPN tunnels do not fail over to the backup/secondary tunnels. This situation can be potentially avoided by configuring Layer 7 monitoring. Here too, there is a nuance at play—the endpoint device may not support Layer 7 monitoring. If Layer 7 monitoring has been configured, there can still be a situation when there is a ‘200 OK’ response received from the service monitoring gateway.
The Zscaler Service Edge servicing the Layer 7 request for monitoring may still be operational and online. The other instances in the Zscaler cloud are undergoing maintenance. The user traffic from the organization is being serviced by Service Edge instances that are not the same as the instance servicing the Layer 7 requests. Here is what I call a black swan event—when you think all boxes are ticked only to find an unexpected situation. The VPN test passes and the user traffic is impacted.
In this last scenario, which is admittedly rare, the administrator has the task of manually triggering a failover from the primary to the secondary VPN tunnel at their endpoint devices. If this was only one location that the failover needs to be triggered from, the solution is straightforward and easy to implement. Scale this situation out to a few hundred endpoints spread across the globe. Add service providers who manage those endpoints to the mix. Ensuring that all the locations trigger failovers to secondary VPN tunnels from their endpoints can be a logistical and operational hassle. This situation is common in SD-WAN deployments that rely on IPSec VPN tunnels to establish traffic forwarding to the Zscaler cloud.

Taking control: Data center exclusion from traffic forwarding mechanismsLast year, Zscaler introduced the ability to temporarily exclude data centers experiencing connectivity issues from a subcloud. This is especially useful in situations when the customer needs to exclude a data center from receiving traffic from the Zscaler Client Connector. Earlier this year, we introduced the capability for customers to exclude a specific Zscaler data center from traffic forwarding using IPSec VPNs.

Once configured, the specific Zscaler data center:

Terminates all existing IPSec VPN tunnels from the specific tenant
Does not accept new IPSec tunnel requests from that tenant
This ensures that the IPSec tunnel endpoint at the customer premises fails over to the pre-configured secondary tunnel based on the configuration at the endpoint device.

Figure 1. DC exclusion for traffic forwarding method

The help documentation has more information on configuring the DC exclusion based on the traffic forwarding method.

Implementation scenarios for data center exclusionIntegrating multi-vendor environments can be hard. Configuring, maintaining, and managing numerous IPSec VPN endpoints can be challenging. Here are some ways to take advantage of this new functionality:

By subscribing to the Zscaler trust portal to get notifications of upcoming maintenance at a DC that is servicing most of your traffic. You’ll be able to see whether it conflicts with an event at your company that requires continuous connectivity. This feature allows you to exclude the DC from the VPN for the duration of the maintenance activity being conducted by Zscaler. You can schedule this duration in advance and be assured that the VPN tunnels are being serviced on the secondary/backup link for the duration and will switch back to the primary tunnel upon expiry of the scheduled exclusion window.
On being notified of a time window duration in which there will be suboptimal routing on your primary link to Zscaler, you can then configure the specific data center that will be impacted by the suboptimal routing for that duration, ensuring business continuity.
Let’s say that you have hundreds of sites that are serviced by IPSec VPN tunnels established to the nearest Zscaler PoP. In a black swan event explained earlier in the post, you’ll have the flexibility of excluding the DC from IPSec VPN tunnel establishment, from the Zscaler admin portal or programmatically using the APIs to exclude DCs from traffic forwarding.
You have automated and deployed multiple IPSec VPN tunnel endpoints across multiple locations. You now need to test the failover that has been automated from these endpoints. There is no need to bring down the primary VPN tunnels from all these deployed endpoints, which can be tedious and time consuming. Instead, you can leverage Zscaler Internet Access to trigger the failovers from a single console.
Try DC Exclusion todayYou can try this feature in your tenant today! Contact Zscaler Support to have this feature enabled on your tenant and test your business continuity readiness. Ensure that your organization has one less item to worry about. This would be one occasion when you know it works when no one complains!  

​[#item_full_content] [[{“value”:”Early last year, Zscaler announced cloud resilience capabilities that ensure business continuity for customers during blackouts, brownouts, and catastrophic events. These capabilities encompass resilient infrastructure, software, and operational processes that help organizations build business continuity plans. These capabilities have been met with enthusiasm and subsequent adoption by many Zscaler customers.

How to forward traffic with resilienceZscaler recommends establishing redundant GRE/IPSec VPN tunnels to the Zscaler cloud to ensure high availability of the service. These redundant tunnels will need to be established to two different Zscaler data centers. Additionally, Zscaler recommends establishing Layer 7 service monitoring to the URL http://gateway.<Zscaler cloud>.net/vpntest on devices that support this ability. The Layer 7 monitoring URL is serviced within the Zscaler cloud and ensures that the path to the Service Edge instance is performing as expected.

The Zscaler architecture in the cloud enables high availability and redundancy. There is decentralization of services with the VPN tunnels being terminated in instances physically and logically separated from the Service Edge instances. This allows Zscaler operations to quickly isolate incidents in individual components and provides a high degree of availability of the services allowing adherence to the industry leading SLA.

Black swan events with VPN tunnelsEven with the establishment of the redundant tunnels, sometimes the VPN tunnels do not failover to the backup/secondary tunnel that is active and servicing traffic. This could be for one of two reasons:

Although the Primary Zscaler DC is undergoing scheduled maintenance, the instances terminating the VPN tunnels continue to be active. The keepalives and Dead Peer Detection algorithms of the IPSec VPN continue to meet the configured parameters. The maintenance is being carried out on other instances in the Zero Trust Exchange. In an unlikely scenario where it does play out, there is service impact to customer traffic because the VPN tunnels do not fail over to the backup/secondary tunnels. This situation can be potentially avoided by configuring Layer 7 monitoring. Here too, there is a nuance at play—the endpoint device may not support Layer 7 monitoring. If Layer 7 monitoring has been configured, there can still be a situation when there is a ‘200 OK’ response received from the service monitoring gateway.
The Zscaler Service Edge servicing the Layer 7 request for monitoring may still be operational and online. The other instances in the Zscaler cloud are undergoing maintenance. The user traffic from the organization is being serviced by Service Edge instances that are not the same as the instance servicing the Layer 7 requests. Here is what I call a black swan event—when you think all boxes are ticked only to find an unexpected situation. The VPN test passes and the user traffic is impacted.
In this last scenario, which is admittedly rare, the administrator has the task of manually triggering a failover from the primary to the secondary VPN tunnel at their endpoint devices. If this was only one location that the failover needs to be triggered from, the solution is straightforward and easy to implement. Scale this situation out to a few hundred endpoints spread across the globe. Add service providers who manage those endpoints to the mix. Ensuring that all the locations trigger failovers to secondary VPN tunnels from their endpoints can be a logistical and operational hassle. This situation is common in SD-WAN deployments that rely on IPSec VPN tunnels to establish traffic forwarding to the Zscaler cloud.

Taking control: Data center exclusion from traffic forwarding mechanismsLast year, Zscaler introduced the ability to temporarily exclude data centers experiencing connectivity issues from a subcloud. This is especially useful in situations when the customer needs to exclude a data center from receiving traffic from the Zscaler Client Connector. Earlier this year, we introduced the capability for customers to exclude a specific Zscaler data center from traffic forwarding using IPSec VPNs.

Once configured, the specific Zscaler data center:

Terminates all existing IPSec VPN tunnels from the specific tenant
Does not accept new IPSec tunnel requests from that tenant
This ensures that the IPSec tunnel endpoint at the customer premises fails over to the pre-configured secondary tunnel based on the configuration at the endpoint device.

Figure 1. DC exclusion for traffic forwarding method

The help documentation has more information on configuring the DC exclusion based on the traffic forwarding method.

Implementation scenarios for data center exclusionIntegrating multi-vendor environments can be hard. Configuring, maintaining, and managing numerous IPSec VPN endpoints can be challenging. Here are some ways to take advantage of this new functionality:

By subscribing to the Zscaler trust portal to get notifications of upcoming maintenance at a DC that is servicing most of your traffic. You’ll be able to see whether it conflicts with an event at your company that requires continuous connectivity. This feature allows you to exclude the DC from the VPN for the duration of the maintenance activity being conducted by Zscaler. You can schedule this duration in advance and be assured that the VPN tunnels are being serviced on the secondary/backup link for the duration and will switch back to the primary tunnel upon expiry of the scheduled exclusion window.
On being notified of a time window duration in which there will be suboptimal routing on your primary link to Zscaler, you can then configure the specific data center that will be impacted by the suboptimal routing for that duration, ensuring business continuity.
Let’s say that you have hundreds of sites that are serviced by IPSec VPN tunnels established to the nearest Zscaler PoP. In a black swan event explained earlier in the post, you’ll have the flexibility of excluding the DC from IPSec VPN tunnel establishment, from the Zscaler admin portal or programmatically using the APIs to exclude DCs from traffic forwarding.
You have automated and deployed multiple IPSec VPN tunnel endpoints across multiple locations. You now need to test the failover that has been automated from these endpoints. There is no need to bring down the primary VPN tunnels from all these deployed endpoints, which can be tedious and time consuming. Instead, you can leverage Zscaler Internet Access to trigger the failovers from a single console.
Try DC Exclusion todayYou can try this feature in your tenant today! Contact Zscaler Support to have this feature enabled on your tenant and test your business continuity readiness. Ensure that your organization has one less item to worry about. This would be one occasion when you know it works when no one complains!”}]]