Article #7 from 2023
During my 1st year working as a Sovereign Google Cloud Engineer on a major defense company public Cloud, I had the opportunity to use Google Cloud technologies and pass the Professional Cloud Architect certification. Google Cloud reduces operational overhead as well as capital expenses and lets engineers focus on building applications rather than managing an infrastructure.
This article provides a comprehensive overview of Google Cloud, read my Data, Network, and Security articles for more.
Reading Time: 10 minutes
Cloud Computing is about providing elastic self-service resources to customers over the internet and only charging them for what they used.
The network is global and security is implemented in several layers such as encryption at rest and Denial of Service (DoS) protection. Use a Virtual Private Cloud (VPC) to get better data isolation on Google Cloud and access its built-in global network. This allows resources hosted over different regions and zones to share the same private IP address range that can have zonal subnets, firewall rules, network tags and load-balancer.
Google Cloud's hierarchy has 4 levels: Organization > Folder > Project > Resource, and policies are inherited top-down. Identity & Access Management (IAM) provides 4 basics roles: Owner, Editor, Viewer, Billing Admin. Use Service Accounts (SA) to grant permissions.
Use buckets from Google Cloud Storage (GCS) for object storage, Cloud SQL for web apps or Spanner for bigger global ones, Bigtable for big data. Use Google Compute Engine (GCE, an IaaS) for VMs with per-second billing, workload-specific and shielding options, reduced-cost possibilities with spot instances, persistent or not HDD or SSD disks, snapshots and backups. Use Google Kubernetes Engine (GKE, a Hybrid) for container cluster management. Use Google App Engine (GAE, a PaaS) for serverless scaling web apps, Cloud Run or Pub/Sub for running containers handling stateless web requests, Cloud Endpoints and Apigee to expose API, Cloud Functions (SaaS) for serverless event-driven apps. Use Terraform for Infrastructure as Code (IaC) and host it on Cloud Source Repositories (CSR) to combine git functionalities with IAM policies and diagnostics tools.
Monitoring is key to achieve high performance and reliability. The 4 signals to measure are latency, traffic, saturation, errors. You commit to a Service Level Agreement (SLA) with your client that is lesser than your internal Service Level Objectives (SLO) consisting of a Service Level Indicator (SLI) and a target of 99.X%. Use Cloud Logging to collect, store, search, analyze, monitor and alert.
Google Cloud offers IaaS, PaaS and SaaS solutions that one can interact with using the web UI, the Cloud Shell, the SDK, REST-based APIs and the Google Cloud mobile app. Google Cloud is part of the open source ecosystem and comes with a marketplace.
IAM is a way to identifying who (person, group, app) can do what (privileges, actions) on which resource (any Google Cloud service). IAM policies are inherited top-down and should follow the principle of least privilege. Roles can be basic (owner, editor, viewer, billing admin), predefined (service-specific with a collection of permissions), custom (refined precision). Use Service Accounts (SA) for server-to-server interactions. Authorization is the process of determining what permissions an authenticated identity has on a set of specified resources.
GCS is used for storing objects in a uniquely-named bucket using one of 4 storage classes depending in access rate: standard, nearline, coldline, archive for 0, 30, 90, 365 days. GCS handles client encryption, versioning and strong consistency.
Filestore provides a native file storage experience for GCE and GKE attached storage.
Instead of installing a SQL server on a VM, use Cloud SQL to benefit from High Availability (HA), scaling and automatically applied updates.
Cloud Spanner has the SQL syntax and consistency of a relational database (DB) and the availability and scalability of a non-relational DB meaning it offers the best of both worlds.
Firestore is the go-to solution for highly scalable NoSQL DB apps thanks to live sync, ACID transactions (Atomicity, Consistency, Isolation, Durability) and multi-region replication.
Consider using Bigtable for structured data of over 1TB with numerous low-latency reads and writes or if you don't need Firestore's transactional consistency.
When it comes to data transfer, use Storage Transfer.
Each Google Cloud service has its own pricing model and the go-to page is the pricing calculator. Resources are associated to a project linked to a sub-billing account and the organization accumulate every project's bill into the global billing account. Resources are either global (image, snapshot, network), regional (external IP) or zonal (VM instance, disk). Quotas can be set on 3 main categories: how many resources per project, how many API request per second, how many resources per region. They don't guarantee availability but prevent uncontrolled consumption. Use labels to organize resources with key-value pairs attached to them, like env:prod and team:ops, to analyze costs and filter resources when looking at bills. Labels shouldn't be confused with tags used for networking. Budgets can be set in order to send an alarm when close to reaching them.
Cost planning should be done using a continuous iterative cycle: capacity forecast > resources allocation > approve estimations > monitor accuracy. Cost-saving is possible with spot VMs, following rightsizing recommendations, keeping machines close to data and choosing the right storage with the usage. GKE cost-optimization best practices are to have multi-tenants clusters, filter out unwanted logs prior storage, bin-pack VMs (optimize their size while hosting different-sized items), minimize the image size as well as the time between startup and readiness, differentiate batch (regular) and serving (spike) workloads.
Monitoring is the base of Site Reliability Engineering (SRE) and Google Cloud Operations provides Monitoring, Logging, Error Reporting, Fault Tracing and Debugging. With your metrics, you can create dashboards and set alerts as well as export them to BigQuery for analysis and Looker Studio for visualization.
Connecting to Google Cloud can be done via several options: IPsec tunnel with Cloud VPN, a dedicated layer 3 with Direct Peering at 10Gbps to any Point of Presence (PoP) or a shared layer 3 with Carrier Peering, a dedicated layer 2 with Dedicated Interconnect to transfer large amount of data from on-prem and 99.X% SLA or a shared layer 2 with Partner Interconnect if you're further away. It is recommended to start with Cloud VPN first and it also offers HA to reach 99.99% instead of 99.9% SLA. To connect projects inside GCP, you can use Shared VPC where one project is the host and other are attached to it meaning administration is centralized or VPC network peering and decentralized administration.
Scaling is closely tied to Load-Balancing (LB), automation and managed services:
Use Cloud LB (CLB) to put resources behind a single anycast IP address either global or regional and Managed Instance Groups (MIG) to control identical VM instances as a single entity using templates. MIG can work with CLB to distribute network traffic to all of the instances in the group and recreate instances if one of them stops, crashes, is unhealthy or is deleted without the MIG asking it to. MIG also allows autoscaling based on load and policies regarding CPU usage, LB capacity or any metric. HTTP(S) LB operate on the OSI model 7th layer, at app level allowing routing based on the URL using URL maps to determine the appropriate backend service for the request. Cloud Content Delivery Network (CDN) caches content at the edge of Google's network, SSL proxy is used for unencrypted non-HTTP traffic, network LB balances UDP and TCP and SSL regional non-proxied traffic, internal TCP and UDP and HTTP(S) LB is for regional needs.
Infrastructure automation is done with Terraform to quickly provision and remove resources using IaC with declarative configuration files. Google Cloud Marketplace offers production-ready solutions from third-party vendors who have already created their own deployment configurations based on Terraform. Create .tf files with your desired infrastructure written in HCL, a language similar to JSON, group them in modules if needed and deploy them with "terraform apply". Automation is also done thanks to 4-steps CI Continuous Integration (CI) pipelines achievable with CSR, Cloud Build, Build Triggers and Container Registry (CR): dev check-in code > run unit tests > build deployment package > deploy.
Managed services are an alternative to creating and managing solutions, especially want it comes to data without the burden of handling an infrastructure nor scaling. BigQuery allows you to access meaningful insights using familiar SQL syntax without the need for a DB admin, Dataflow processes data prior to their analysis and Dataprep offers cleaning and preparation of data.
Most companies migrate to the Cloud to reduce responsibilities and CapEx as well as to access managed services. Other benefits are flexibility, elasticity and economies of scale. Customers can follow 1 of 4 strategies: lift & shift, rebuild, remain on-prem, improve & move.
Migrating is a 4-step process: assess current environment > prepare new platform > migrate workloads > optimize ops and costs with rightsizing. Assessing requires severals inputs like business objectives and technical assumptions, activities like automated discovery with StratoZone, outputs like workload grouping and high-level effort estimations. Migrating requires a 5-step installation process followed by a 5-step migration: setup VPN and network tags > create GCP roles and service accounts > deploy MCE (Migrate for Compute Engine) Manager > configure source environment > deploy Cloud Extension as part of installation and shutting down source VM > creating snapshot > booting over WAN in minutes > streaming data in to the Cloud as part of migration. Don't forget to migrate storage, detach and clean up.
Systems requirements must be done in 2 ways:
Qualitative from the user's Point of View (PoV) including who, what, why, when, how. Roles aren't job titles but describe a user objective kind of like personas describing user stories (as a, i want to, so that).
Quantitative for measurable things like time, finance and people. Use Key Performance Indicators (KPI) to measure success of business or technical elements like Return on Investment (RoI) and page views. Use SLIs, SLOs and SLAs: SLI is a measurable attribute of a service, SLO is the number / goal you want to achieve for a given SLI, SLA is a binding contrat over that SLO. With a fast response time example, the SLI would be the latency of successful HTTP 200 responses, the SLO the latency of 99% of the responses must be under 200ms, the SLA the user is compensated if 99th percentile latency exceeds 300ms.
Over time, monolithic programs turned into multiple microservices that are easier to develop, reduce deployment risks, scale independently and accelerate innovation speed. However they increase complexity, require to maintain backward compatibility as well as multiple deployments. It is key to recognize service boundaries to minimize dependencies and isolate properly as well as to use HTTPS LB for the frontends and internal LB between them and the backends.
Stateless microservices are easier to scale, migrate and administer than stateful ones requiring to store data over time. To communicate over HTTPS without giving too much details about them, microservices expose API endpoints written in REST that handle get, put, post and delete requests. Test new API versions prior to going live and deploy them with rolling updates like blue-green deployment and canary release.
The key reliability metrics are availability (% of time a system is running), durability (odds of losing data) and scalability (ability to work as load grows). Single Point Of Failure (SPOF) must be avoided by replicating resources following the n+2 principle that survives a unit failure and a unit upgrade at the same time. Beware of cascading and correlated failures as well as queries of death. HA is achieved by deploying to multiples zones or regions even though it costs more, enabling MIG auto-healing and setting up a Disaster Recovery Plan (DRP) addressing Recovery Point & Time Objective (RPO & RTO).
Security is a shared responsibility between the client and Google, the former of following best practices like principle of least privilege, separation of duties, regular audit, using Cloud IAP as well as service accounts and the latter of lower layers security. If there is a partner, he is responsible of security too, especially if he provides an External Key Management (EKM) that encrypts Google's Data Encryption Keys (DEK). Network-wise, prefer using bastion hosts over exposing external IP addresses, deny all on firewall then write explicit rules and leverage global services like DDoS protection with Cloud Armor.
Containers are isolated user spaces with their dependencies for running app code, they are lightweight as they don't carry a full OS and can be created or stopped quickly making them an app-centric way to deliver high performance scalable apps using the microservice design pattern. An app and its dependencies are called an image, a container is a running instance of an image. Docker is a software tool that both builds and runs container images while Kubernetes (Kube or k8s) offers a way to orchestrate those apps at scale. You can use publicly available open-source container image registry as the base for your own images, like Google's grc.io and Docker Hub Registry.
Kube is a container management and orchestration solution that automates everything related to containerized apps, supports declarative configurations and different workload types. GKE is Google's managed service for deploying, managing and scaling Kube environments that offers auto-upgrade, auto-repair and auto-scaling.
A Kube cluster runs on several VMs, one called the control plane (managed by GKE) that coordinates the entire cluster through the kube-APIserver and others called nodes that run Kubelet to communicate with it as well as Pods. They are the smallest object with a unique IP address where one or more containers live thus sharing its networking and storage resources while communicating through localhost 127.0.0.1. Kube uses a declarative object management logic continuously comparing their desired and current states. Creating a GKE cluster includes the creation of 3 VMs through GCE that will be used as nodes either zonal or regional and PersistentVolumes can be added using network-based storage to provide durable storage remote to the pods to avoid loosing data when one of them fails instead of regular Volumes to terminate with Pods.
To manage pods more efficiently, use controller objects like Deployments where you specify the number of pod replicas you want using ReplicaSets (population of identical Pods) and Services to LB access to specified Pods, all of that in a .yml file that you use with "kubectl apply". Dividing one physical cluster in several logical clusters in order to do multi-tenancy, to have a development and production environments for example, is possible with namespaces that you can put resource usage quotas on.
Time-based objects are Jobs that create Pods for a specific tasks then terminate them once it's over and CronJob that run Pods following a schedule. Migrating containerized workloads to GKE is done with Migrate for Anthos and requires less than 10 minutes. Scaling can be done in 4 ways with either Horizontal Pod Autoscaling (HPA), Vertical Pod Autoscaling (VPA), Cluster Autoscaling (CA) or Nodes Auto Provisioning (NAP). A good practice may bo to over-provision a little, like 1 pod/node, using the following formula : (1 - buffer%)/(1 + growth%) = over-provisioning%.