Keras Model Training on Cloud TPU VM

Jul 26, 2025

MUHAMMAD GHIFARY

TPUs (Tensor Processing Units) are purpose-built accelerators by Google to speed up machine learning / deep learning workloads. One approach to accessing the TPUs is through Google Cloud TPU VMs: https://cloud.google.com/blog/products/compute/introducing-cloud-tpu-vms. They provide direct access to TPU host machines, enabling us to run compute workloads like TensorFlow, PyTorch, or JAX on the same virtual instance where the TPU hardware is attached. This design eliminates the network latency typical of prior TPU Node setups and results in significantly better performance and usability.

Some key benefits using Cloud TPU VMs:

Lower latency and higher throughput — code runs locally on TPU host sides, reducing roundtrip delays.
Simplified development and debugging — interactively work directly on TPU VMs without remote orchestration overhead.
Cost efficiency — eliminates the needs for separate, external Compute Engine instances to feed TPUs.

Previously, I wrote an article about Neural Radiance Field (NeRF) trained with Keras and TensorFlow backend. We’re gonna see how to train it using a Cloud TPU VM.

Step 1: Enable Cloud TPU API

Before provisioning any TPU resources, enable the TPU API using glcoud:

gcloud services enable tpu.googleapis.com

Step 2: Create a Cloud TPU VM

Launch a TPU VM with TPU v3-8 accelerator (a single TPU v3 chip with 8 cores) and tpu-vm-base version, in europe-west4-a region.

gcloud compute tpus tpu-vm create tpu-base-eu \\
    --zone=europe-west4-a \\
    --accelerator-type=v3-8 \\
    --version=tpu-vm-base \\

This command will provision a VM with TPU access in the specified zone.

Step 3: Verify and Start the TPU VM

List the TPUs to ensure that it’s successfully created.

gcloud compute tpus tpu-vm list --zone=europe-west4-a

Expected output:

Screenshot 2025-07-27 at 09.35.57.png