Article #12 from 2025
During my 3rd year working as a Sovereign Google Cloud Engineer on a major defense company public Cloud, I had the opportunity to use Google Cloud data technologies and pass the Professional Data Engineer certification. Google Cloud provides an integrated ecosystem for big data, BI, and AI.
This article provides a comprehensive overview of Google Cloud Data Engineering, read my Architect, Network, and Security articles for more.
Reading Time: 10 minutes
See the Google Cloud article.
Data and AI have a powerful partnership: AI learns nothing without data and large amounts of data can't be managed without AI. Tensor Processing Units (TPU) are Google's domain-specific hardware used to accelerate ML workloads with high efficiency. The data-to-AI workflow is the following: Ingestion & Process (Pub/Sub, Dataflow) > Storage (GCS) > Analytics (BigQuery, Looker) > ML (Vertex).
Use Notebooks environnements that integrate commentary, plots, and code to make collaboration easy for Big Data projects. JupyterLab is the industry standard and Notebook instance are standard GCE instances living in your project. You can pull data from BigQuery straight to a Notebook using the %%bigquery function but make sure that you only pull data subsets to avoid reaching your Notebook capacity limits.
The 4 major Big Data challenges are the 4 Vs: variety, volume, velocity, veracity.
BigQuery combines storage and analytics as it is both a fully managed storage facility to land and store datasets as well as a fast SQL-based analytic engine. BQ can query TBs in seconds and PBs in minutes.
A data lake brings together data from across the enterprise into a single location, usually by replicating it in its native raw format, to make data accessible for analytics. Google Cloud Storage (GCS) is the go-to data lake solution, especially for unstructured data, as it uses objects' metadata for access and handles both encryption and lifecycle management. Common challenges are accessing the source data (siloed), their accuracy and quality (un-queryable), the availability of computing ressources (under-provisioning), the query performance (engine & quotas).
A data warehouse is the consolidated place to store queryable data once ETL-ed (Extract, Transform, Load). Ask yourself the following questions: Can it serve as a sink for batch and / or streaming pipelines? Does it needs to be accurate by the minute or day or week? Will it scale to meet your needs? Use Cloud Data Loss Prevention (DLP) to protect Personally Identifiable Information (PII) and Cloud Composer to orchestrate production workflows.
Use a data lake to store data and a data warehouse to query it.
BigQuery (BQ) is Google's PetyByte-scale serverless data warehouse allowing users to focus on insights by serving as a collective home for all analytical data in an Organisation. You can query any data table source, either located in BigQuery or somewhere in Google Cloud and even Google Drive, by using a SQL-compatible dialect.
BigQuery can also be used as a query engine by connecting it to Cloud SQL and GCS. SQL is record-based for writing while BQ is column-based for reading. Transactional operation systems are 80% writes and 20% reads (OLTP) while analytical systems are 20% writes and 80% reads (OLAP).
When using BigQuery, have multiples datasets to separate tables related to different analytical domains and use project-level scoping to isolate datasets from each other according to business needs. Use Policy Tabs to enable column-level security. A BQ view is a virtual table defined by a SQL query that can be still queried. Queries served from cache don't incur any charges. Queries are charged to the project from which they're run in meaning you can separate storage and query costs by sharing a project's data without granting Run Job permissions inside of it.
Normalizing data means turning it into a relational system. Denormalizing data means turning it into a table. A data warehouse denormalizes data and BigQuery supports columns with nested and repeated fields. This enables parallelized processing and the use of complex data.
Reduce a query cost and improve its performance for date, time, and integrer value by partitioning (dividing) and clustering (sorting) the related column, thus allowing BQ to optimize its internal storage shards.
EL = as-is when the data is already clean. ELT = transformed when needed for experiments, ETL = transformation prior loading for quality control. For ETL, use Dataflow to BigQuery by default. Use Cloud Data Fusion for building data pipeline visually, exploring datasets using Wrangler and other features (pipeline tagging, metadata aggregating, event-condition action, rules engine). Use Cloud Composer to orchestrate work across multiple Google Cloud services by creating a serverless environment on which runs Apache Airflow, an open-source tool to write workflows in Python.
Data quality criteria are validity (remove quantities less than zero with WHERE), accuracy (COUNT number of unique and non-null values), completeness (handle missing values with NULLIF), consistency (handle duplicates with GROUPBY), uniformity (handle data types with CAST) and are independent.
Data discovery means finding the data you need while data lineage is metadata about the data (origin, process it went through, location, format). Use Dataplex for discoverability and Data Catalog for metadata management. A label is a key:value pair that can be attached to datasets, tables and views.
Use Dataflow for building serverless batch and stream pipeline using Apache Beam, or Dataproc if you prefer Hadoop Spark. Dataflow is a glue tying other services, like writing from Pub/Sub to BigQuery, that continuously optimizes graphs, fusions operations and autoscales. Use Dataflow templates to avoid coding.
Flexible Resource Scheduling (FlexRS) reduces non-time-critical batch processing jobs costs by scheduling and mixing both preemptible and standard VMs to execute jobs within 6 hours of their creation. Apache Beam code being uploaded to GCS by a user then sent to Dataflow who uses VMs, make sure to set up the right IAM Roles in order to launch jobs. Dataflow impacts the GCE CPU quota shared with other services as well as IP adresses and persistent disks. Security-wise, you can enforce a specific region (--region), combine private IPs with Private Google Access (--no_use_public_ips) and use encryption keys which must be in the same region as the workers (--dataflow_kms_key).
The 2 main concepts to know are PTransform (action) and PCollection (immutable data abstraction). A pipeline is the combination of the data to be processed and the actions to be taken on the data while a pipeline runner is the container host of a given pipeline. A Runner like Dataflow is used to execute your pipeline on a backend of your choice. Apache Beam allows you to run data processing pipelines with the language of your choice on the backend of your choice. A side input is an additional input that your ParDo can access each time it processes an element in the input PCollection. For streaming data, use an unbounded PCollection and time-based windows (beam.WindowInto(beam.window.SlidingWindows())).
PCollection:
Setup - with beam.Pipeline() as p
Create - local_pcollection = p | 'pipeline step' >> beam.Create(local_list)
Read from GCS - lines = p | beam.io.ReadFromText()
Write - p | beam.io.WriteToBigQuery()
PTransform:
Map - beam.Map() for 1:1 and beam.flatMap() for 1:many
Group - beam.GroupByKey() on key-value pairs after Map
Regroup - Combine a PCollection of values with CombineGlobally() or a key-value pair group with CombinePerKey()
Flat - beam.Flatten() to merge multiple PCollections into one (like SQL UNION), beam.Partition() to splits them into smaller ones
ParDo - beam.ParDo(Function()) used for intermediate steps and parallel processing
DoFn - class Function(beam.DoFn) used on each element of a ParDo
Streaming enables you to get real-time information in a dashboard by processing unbounded data (not at rest). The 3 key products are Pub/Sub to handle variable volumes of data, Dataflow to assist in processing data using aggregation and transformation without delay, BigQuery to report ad-hoc. Another way to explain it is using the ingest-process-visualize workflow in which Pub/Sub is the distributed messaging service, Dataflow creates processing pipelines and Apache Beam executes them, and either BigQuery or Looker is the visualizing tool used to uncover insight.
Pub/Sub messages can weight up to 10MB and are saved for 7 by default while encrypted both in transition and at rest. It has 2 data structures, Topics & Subscriptions, where a Topic can have multiple Subscriptions but a Subscription only belongs to a single Topic. 3 common Pub/Sub patterns are straight-through, fan-in, fan-out. Pub/Sub allows for both push and pull delivery models but push is best for real-time. Messages are resent if the acknowledgment (ACK) response delay is more than ackDeadline which can be customized. Messages can be delivered multiple times unless 'exactly-once' is configured and ordering can be forced by using the same region and ordering key for your messages.
For Dataflow to measure an average value over a stream requires windowing capabilities, turning infinite time into finite time, which can be of 3 kinds: fixed, sliding, sessions. A Watermark is a threshold that indicates when Dataflow expects all of the data in a window to have arrived, it is an even-time trigger.
Streaming data are added to BQ using Streaming Inserts which allows the initial use of a temporary table in order to identify the schema to be copied. Only use Streaming Inserts when the immediate availability of the data is a solution requirement as it is costly. For high throughput non-structured data requiring millisecond latency, choose Bigtable (cluster-based) as long as you have more than 1TB of NoSQL data. Bigtable motto is 'simplify for speed', hence is it just a table with rows and columns where you only get 1 index via the Row Key column so combine several columns value to create Structured Row Key.
BigQuery best practices include using Dataflow to process and transform data, creating views for common queries, revisiting the schema and data regularly, solving problems while they're small, optimizing performance (read, pass, and write a few bytes as possible), optimizing queries (SELECT, APPROX, WHERE, ORDER, JOIN), breaking queries into stages using intermediate tables.
Artificial Intelligence (AI) is a discipline while Machine Learning (ML) is a toolset. ML is a type of AI and Deep Learning (DL) is a type of ML. ML is about scaling up Business Intelligence (BI) and decision making and DL is ML with unstructured data. Machine Learning is a way to use standard algorithms to derive predictive insights from data and make repeated decisions.
Most business data is unstructured text located outside of a database (docs, sheet, email, code, PDF, chat, etc.), which Cloud Natural Language API (CNL) is made to label using various analysis: syntactic, entity, sentiment.
ML Pipelines are multi-step workflows designed to build ML models mainly consisting of Training & Serving models. You can use Kubeflow Pipelines to run Kubernetes ML Pipelines but prefer Vertex Pipelines to free yourself from managing GKE clusters and config files. To share or use shared packaged ML components like pre-trained models and Jupyter Notebooks, use TensorFlow Hub.
Vertex = your model, your data. AutoML = Google's models, your data, without coding. Use AutoML to create several smaller customer models to break up your complicated problem appart. Maintain a CSV file in GCS in the same project as AutoML which contains labels and source files paths for AutoML to work properly. To improve models, increase the amount, diversity, and complexity of data you provide.
BigQuery evolved from being solely a data warehouse to supporting the data-to-AI lifecycle as you can now create and execute ML models on structured BQ datasets in a few minutes using SQL queries (ml.PREDICT). BQ ML is compatible with several models including linear classification to predict discrete values, linear regression to predict continuous values, matrix factorization for recommandation, and kmeans clustering for pattern recognition. Supervised models are task-driven and identify a goal that can be to classify data (logistic regression) or predict a number (linear regression), unsupervised models are data-driven and identify a pattern (cluster analysis), and BQ supports both of them.
Prior to feeding data to the model, columns must be defined in the language used by data scientists and ML professionals: examples / observations / instances = rows, features = columns (known), label = column you want predictions on based on the features, which must be trained on historical data first. One-hot encoding is a pre-processing method that converts categorical data to numeric data to prepare it for model training. A model's training level depends on data quantity available to train it. Google Cloud's 4 options for building ML models are BQ ML, Pre-build APIs, AutoML, Custom training:
AutoML is no-code and helps building quick prototypes before investing more into development. 2 vital information gathering technologies are Transfer Learning for building a knowledge base from a field and Neural Architecture Search for finding the optimal model for the relevant project. For each data type, AutoML solves different types of problems called objectives, like classifying and detecting objects on images, returning a value based on others using regression, analyzing sentiment on text, and more.
Model deployment is when the model is implemented and it can be done through an endpoint for immediate results with low latency for real-time predictions, using batch prediction with a best-effort response for data accumulated in one request, using offline prediction for specific off-the-Cloud environments.
In traditional programming, 1+1=2 means data + rules (human-made algorithms) = answers. ML emerged because algorithms are numerous and complex as it solves the puzzles (data, expectations, examples) human gives it. Instead of telling a machine how to do addition, ask it to do additions on his own by giving it pairs of numbers and the answer. Successful ML required lots of data and storage as well as fast calculations. The 3 key stages of the learning process / cycle are data preparation (data uploading and feature engineering) > model training (train and evaluate model data cycle) > mode serving (useless if unused).
When uploading data, it must be given meaningful names, types, objectives, and labels. A feature is a factor that contributes to the prediction (independent variable in statistics, column in a table). Feature engineering is processing data before the model starts training and requires preparing features. Vertex FS (Feature Store) is a centralized repository to organize, store, and serve ML features.
A ML model's performance evaluation metrics come in 2 sets of measurements: confusion matrix (classification problems performance) and feature importance. Confusion matrix combines predicted and actual values with positive and negative: True Positive, False Negative, False Positive, True Negative.
Quantity - Recall looks for how many positive cases were predicted correctly: R = TP / (TP + FN)
Quality - Precision looks for how many cases predicted as positive actually are: P = TP / (TP + FP)
Recall and precision are often a trade-off that you must optimize depending on your use case. When looking to classify emails as either spam or not-spam, prioritize recall if you want to catch as many spams as possible or prioritize precision if you want to catch true spams without blocking other emails. Vertex helps visualizing the recall and precision curve to help adjusting as well as the feature importance bar chart illustrating how each feature contributes to a prediction. For feature importance, the larger the bar / value the higher importance thus helping deciding which features to include in the ML model for prediction.