Article #9 from 2024
During my 2nd year working as a Sovereign Google Cloud Engineer on a major defense company public Cloud, I had the opportunity to use Google Cloud networking technologies and pass the Professional Cloud Network Engineer certification. Google Cloud provides control over network topology and security policies while preventing 75% of common network outages.
This article provides a comprehensive overview of Google Cloud Network Engineering, read my Architect, Data, Security, and Hybrid articles for more.
Reading Time: 10 minutes
See the Google Cloud article.
A Virtual Private Cloud (VPC) Network is a virtual version of a physical network that provides connectivity to your Compute Engine VM instances, offers native internal TCP/UDP LB and proxy systems for internal HTTP(S) LB, distributes traffic from Google Cloud external LB to backends.
VPC supports IPv6 and VMs get an internal or external IP based on the subnet's access type. Bring Your Own IP address (BYOIP) is available with IPv4 addresses except for Classic VPN gateways, GKE nodes, GKE pods and autoscaling MIGs. VMs can communicate with several isolated VPCs by having up to 8 multiple Network Interface Controllers (NIC) that must be configured at instance creation and a NIC can only be associated with 1 VPC.
VPC Network Peering allows private IP connectivity across 2 VPC networks no matter where they are, it is a decentralized / distributed approach to multi-project network as each VPC network its own firewall rules, routing tables and admin groups. To connect, both side need the other's VPC network name as well as project ID and, if one side is misconfigured or shut down, the connectivity will be lost. Custom routes can be shared between peered VPC networks.
Routes define the paths that network traffic takes from a VM instance to other destinations but no traffic flows without matching firewall rules. Routes apply to VMs egress traffic and can be fined-tuned using network tags that can be applied on VMs. They can be at network or subnet level and the most specific destination is preferred, even over priorities where low number means high priority.
Default network routes, 0.0.0.0/0 for IPv4 and ::/0 for IPv6, are generated at VPC network's creation must be replaced with another next hop if deleted or traffic will be dropped. Custom static routes are more secure (no advertising) and performant (less overhead) but should only be used for small stable topologies while dynamic routes are managed by Cloud Routers and used by Interconnects as well as VPN tunnels. Use private DNS zones to resolve domain names inside a VPC network and Cloud DNS Policies to steer traffic based on location or round robin for example.
IAM policies can be set at every level (Org > Folders > Projects > Resources) and are inherited top-down so moving a Project to another Folder will change the IAM associated with it. A less restrictive parent policy overrides a more restrictive child policy hence follow the least privilege philosophy.
Google Cloud's predefined roles are a list of permissions required to perform an activity but custom roles can also be created. A permission is written with service.resource.verb like compute.addresses.get. Predefined network IAM roles are Network Viewer, Network Admin and Security Admin. While IAM focuses on who can do what on which resources, Org Policy focuses on the what by setting restrictions on specific resources.
Firewall rules protect your VM instances from unapproved inbound and outbound connections and every VPC network function as a distributed firewall with firewall rules applying to the whole network even though connections are allowed or denied at instance level. A firewall rule parameters are the target (all instances, network tags, service accounts) and the source of ingress traffic (IP ranges, subnets, network tags, service accounts).
Use Global LBs for globally distributed workloads with a single anycast IP address and Regional Internal LBs for the rest. TCP/UDP Internal LBs are fast thanks to a reduced overhead coupled with an intermediate-free connection between the client and the backend and can be used as the next hop to a NAT gateway, LB to several NICs and in a hub and spoke topology. These Cloud LBs act as proxies and define how traffic is distributed, which health check to use, if session affinity is used and well as which other services.
Cloud LB can route traffic to Managed Instance Groups (MIG, a VM pool), Network Endpoint Groups (NEG, a backend collection) and buckets. Use Hybrid LB for workloads running outside Google Cloud, either in another Cloud or on-prem, but make sure each service is reachable with a ip:port combination and to use hybrid connectivity NEGs.
Traffic Management (TM) provides enhanced features to route LB traffic based on specific criteria like HTTP parameters (path, header, etc.), the traffic load itself and perform request / response-based actions. URL maps contain rules defining the criteria to use to route incoming traffic to a backend service consisting of a name, a default service, hosts and one or more path matchers stating path(s) and a backend service with path rules being evaluated on a longest-path-matches-first basis.
Cloud Content Delivery Network (CDN) caches content near users at the edge of Google's network in 3 modes: use origin headers, cache all static, force cache all. CDN Interconnect direct traffic from VPC networks to third-party providers' network to help optimize Cloud CDN cache population costs, especially for high-volume egress traffic and frequent content updates because egress traffic isn't free like ingress. CDN Interconnect doesn't require the use of CLB.
Cloud Router implements dynamic VPN that allows topology to be discovered and shared automatically which reduces manual static route maintenance with path selection prioritizing the lowest valued route. Cloud Interconnect (CI) provides a fast physical connection to Google's network either Dedicated (DI) at a Google colocation facility (or two in the same metropolitan area for redundancy) or Partner (PI) through a supported service provider. A DI connection consists of one or more circuits that can either be 1 to 8 10Gbps circuits or 1 to 2 100Gbps circuits. PI has the same feature has DI but connection is made through a supported service provider, it is recommended if you need a connection quickly, want a single port for multi-Cloud use and have a low bandwidth need (50Mbps to 50Gbps). For L2 connections, BGP is configured between the on-prem and Cloud router while for L3 connections, BGP is configured between Cloud Router and the service provider router and he handles everything. For 99.99% availability, configure at least four Interconnect connections and remember that you can have 2 peering edge placement in 2 different cities.
Cloud VPN provides a lower bandwidth than Interconnect and is cheaper but offers data encryption in transit and allow to selectively advertise routes between VPC networks. Use Classic VPN (99.9%) for static routing and HA VPN (99.99%) for BGP routing with the latter supporting site-to-site VPN. Use an Active/Passive configuration on a HA VPN for a consistent bandwidth experience.
Network Connectivity Center (NCC) provides multi-site connectivity management and supports being installed on a VM, virtual router appliances and SD-WAN routers allowing you to use Google's backbone network as a WAN to interconnect remote sites. This site-to-site data transfer approach results in lower latency as well as greater reliability than connecting over public internet.
NCC uses a hub-and-spoke model to manage hybrid connections inside and outside Google Cloud: each connectivity resource is represented as a spoke (a collection of network ressources with a VPC Network and a connection type like Interconnect or VPN assigned) and spokes are centrally managed by a hub (1/project) that contains routing tables entries for a group of spoke thus providing full mesh connectivity between them with site-to-site data transfer enabled. Site-to-Cloud connectivity is automatically enabled when creating a spoke. However NCC doesn't support IPv6, BGP communities, legacy non-VPC networks, classic VPN tunnels.
Private Accesses (PA) let consumers connect to Google Cloud APIs and services using internal IP addresses / internal connection without going through the public internet if they don't use an external connection, making them quicker and more secure as well as allowing VM instances with internal IP addresses to reach certain APIs and services too. PA type can be configured at the same time and are the following:
Private Google Access (PGA) lets your hosts, even on-prem, connect to Google on a VPC subnet-by-subnet basis. If a VM with an external IP is inside a subnet with PGA off, it can still access Google APIs and services. Don't forget to enable the APIs you want to use, define appropriate VPC network routes and firewall rules and create DNS records for private.googleapis.com and restricted.googleapis.com domain names if you use them.
Private Service Connect (PSC) lets you connect to a Google VPC network through a service attachment with PSC endpoints attached to it making PSC fast, scalable and simple. A forwarding rule can be attached to the service attachment in the same network. It is beneficial for consumers (PSC endpoint) as they can control the internal IP address that is used to connect a managed service, they don't need to reserve internal IP address ranges for backend services that are consumed in their VPC network and they must initiate traffic to the service provider thus improving security. It is beneficial for producers (service attachment) as they can choose to deploy a multi-tenant model serving multiple consumer VPC networks, they can scale services to as many VM instances as required without asking consumers for more IP address, they don't need to change firewall rules based on the subnet ranges in the consumer VPC networks. However you can't create a PSC endpoint in the same VPC network where the published service is, the IP address used for the PSC endpoints counts in the project quota and PSC endpoints are not accessible from peered VPC networks. PSC is to go-to solution if you want to provide access to services that you created in a VPC network that should be available to other specified VPC networks through endpoints that have internal IP addresses even if some of these VPC networks have subnets with overlapping IP addresses.
Serverless VPC Access (SVA) lets you connect serverless products to your VPC network to access Google.
Private Services Access (PSA) lets you connect your VPC network and a service producer's by automating VPC Network Peering. It is a producer-consumer model like PSC but doesn't require the importing and exporting of routes. However it is only available for some producers services like Apigee, Cloud SQL and Cloud TPU. Both sides must activate the Service Networking API in their project and, while the producer must allocate an IPv4 address range in the VPC network, the consumers must create one but also create a private connection to a service producer. If a consumer disables his PSA connection, he still has to delete the VPC Network Peering connection and release the IPv4 address range.
Cloud Network Address Translation (NAT) lets you provision app instances without public IP address and make them able to access the internet in a controlled manner. Cloud NAT has no middle proxies between instances and their destination, each of them is allocated a NAT IP address along with a slice of the associated port range. It offers outbound NAT but not inbound NAT meaning hosts outside your VPC network can't directly access any of the private instances behind the Cloud NAT gateway while these private instances can access the internet thus keeping VPC networks isolated and secure. Only answer packets are allowed back in. Cloud NAT and PGA work well together.
Ingress is free as well as egress to the same zone, to Google products and to a different Google Cloud service in the same region. Egress between zones in the same region incurs a charge. Network Service Tiers (NST) enables the optimization of Cloud network performance by choosing the Premium Tier or cost by choosing the Standard tier. The Premium is Google Cloud's current network service that provides high performance routing, is unique to Google Cloud as it uses Google's network, has global SLA and LB and CDN. Standard tier is comparable to other public Cloud offerings as it uses public ISPs networks, has no global SLA and a regional LB.
While IAM is inherited top down, billing is accumulated from the bottom up. A project accumulates the consumption of all its resources, a folder off all its projects, an org of all its folders. 1 project = 1 billing account. Budgets and alerts can be set at project/billing account level, you can use labels as well. Each Google Cloud service has its own pricing model and the go-to page is the pricing calculator.
Use Google Cloud Monitoring to create custom dashboards with various metrics like CPU utilization, packets or bytes sent and received by an instance as well as packets or bytes dropped by a firewall. Because charts only provide insights when someone is looking a them, you must create alerting policies that notify you when specific conditions are met like when the network egress of your VM instance goes above a certain threshold for a specific timeframe. Alerting policies and Uptime checks can notify you through mail, SMS and other channels.
75% of network outages happen due to misconfiguration and are discovered in production most of the time because it is difficult to know the impact of making a configuration change in firewall rules or routing rules. Network Intelligence Center (NIC) enables team to prevent networking outages and performance issues before they happen. NIC offers topology visualization, connectivity tests, a performance dashboard and firewall insights. This centralized monitoring cuts down troubleshooting time and effort, increases network security and improves the overall UX. If a VM is unreachable, perform a NIC Connectivity Test that will diagnose connectivity issues whether they are in Google Cloud or not. Once create, NIC Connectivity Tests can be saved for later to be replayed in order to verify the impact of configuration changes, ensure compliance and proactively prevent network outages.
To diagnose is the network is the root cause of an issue, use NIC Performance Dashboard for real-time latency and packet loss metrics between zones where you have VMs. Use NIC Firewall Insights to better understand and optimize firewall configurations (usage, misconfiguration, strictness) and get reports about firewall usage as well as the impact of various rules on a VPC network. NIC Network Analyzer automatically monitors your VPC network configs and detects misconfigurations as well as suboptimal configs. It also provides insights on network topology, firewall rules, routers, config dependencies, connectivity to services and apps, identifies network failures, provides root cause information, suggest possible resolutions.
VPC Flow Logs record a sample of network flows sent from and received by VM instances that can then be used for network monitoring, forensics, real-time security analysis (5s refresh rate) and expense optimization. Once enabled on a subnet, VPC Flow Logs records the 5-tuple: source IP address and port, destination IP address and port, protocol number. On top of these, you also get start and end time of the last observed packet, bytes and packets sent, instance details, VPC details and geographic details. Use Packet Mirroring on a VM to clone all its traffic. There is no delay and performance penalty when routing logged IP packets.