GOOGLE CLOUD PROFESSIONAL MACHINE LEARNING ENGINEER EXAM OBJECTIVES COVERED IN THIS CHAPTER:

2.1 Exploring and preprocessing organization‐wide data (e.g., Cloud Storage, BigQuery, Cloud Spanner, Cloud SQL, Apache Spark, Apache Hadoop). Considerations include:

Organizing different types of data (e.g., tabular, text, speech, images, videos) for efficient training Data preprocessing (e.g., Dataflow, TensorFlow Extended [TFX], BigQuery)

2.2 Model prototyping using Jupyter notebooks. Considerations include:

Choosing the appropriate Jupyter backend on Google Cloud (e.g., Vertex AI Workbench notebooks, notebooks on Dataproc) Using Spark kernels Integration with code source repositories Developing models in Vertex AI Workbench by using common frameworks (e.g., TensorFlow, PyTorch, sklearn, Spark, JAX)

3.2 Training models. Considerations include:

Organizing training data (e.g., tabular, text, speech, images, videos) on Google Cloud (e.g., Cloud Storage, BigQuery) Ingestion of various file types (e.g., CSV, JSON, images, Hadoop, databases) into training Training using different SDKs (e.g., Vertex AI custom training, Kubeflow on Google Kubernetes Engine, AutoML, tabular workflows) Using distributed training to organize reliable pipelines Hyperparameter tuning Troubleshooting ML model training failures

3.3 Choosing appropriate hardware for training. Considerations include:

Distributed training with TPUs and GPUs (e.g., Reduction Server on Vertex AI, Horovod)

Google Cloud Data and Analytics Overview

Screenshot 2025-04-03 at 2.58.38 PM.png

Collect

Pub/Sub and Pub/Sub Lite for real-time streaming
Datastream for moving on premise Oracle and MySQL databases to Google Cloud data storage
BigQuery data transfer service (data warehouses, external cloud storage provider, google SaaS apps

Process

Cloud dataflow
- Cloud Dataflow is a serverless, fully managed data processing or ETL service to process streaming and batch data. Dataflow used Apache Beam before open‐sourcing its SDK. Apache Beam offers exactly‐once streaming semantics, which means it has mechanisms in place to process each message not only at least once, but exactly one time. This simplifies your business logic because you don't have to worry about handling duplicates or errors.
Cloud Dataproc
- Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open‐source tools and frameworks. Dataproc lets you take advantage of open‐source data tools for batch processing, querying, streaming, and machine learning
- Dataproc connectors
  - Cloud storage connector
  - bigquery connector
  - bigquery spark connector
  - cloud bigtable with dataproc
  - pub/sub lite spark connector
Cloud Composer
- create, run and manage workflows such as running cron tasks
- fully managed data workflow orchestration service that allows you to author, schedule and monitor pipelines. Built on Apache Airflow