Setup Kubeflow Cluster in a Shared VPC on Google Cloud Platform

Lei Feng Blog |
4 min readMay 12, 2020

Kubeflow is an open-source project which aims to make running ML workloads on Kubernetes simple, portable and scalable. However, setting up a Kubeflow cluster in a shared VPC on Google Cloud Platform can not be done through the web console yet. This post tries to describe the steps you need to follow to set up a Kubeflow using a Shared VPC through command line.

Prepare the Environment

Step 1: Install Google SDK. If you use Cloud Shell, enable the boost mode. The following steps have been tested in Cloud Shell.

Step 2: Run the following command in the service project to check if any subnet and secondary ranges are usable:

gcloud container subnets list-usable \                 
--project <Service project ID> \
--network-project <Host Project ID>

Step 3: Follow this guide to set up OAuth for Cloud IAP.

Step 4: Download and install kfctl from kfctl releases page through the following commands:

# make a bin folder to contain the kfctl
mkdir ~/bin
# download kfctl from the releases page
wget -P /tmp
# unzip to the bin folder
tar -xvf /tmp/kfctl_v1.0–0-g94c35cf_linux.tar.gz -C ~/bin
# add the kfctl binary to the PATH
export PATH=$PATH:~/bin

Step 5: Configure gcloud default values for zone and project:

# Set your GCP project ID and the zone where you want to create 
# the Kubeflow deployment:
export PROJECT=<your GCP project ID>
export ZONE=<your GCP zone>
gcloud config set project ${PROJECT}
gcloud config set compute/zone ${ZONE}

Step 6: Select the KFDef spec to use as the basis for your deployment

export CONFIG_URI=""

Step 7: Create environment variables containing the OAuth client ID and secret that you created earlier

export CLIENT_ID=<CLIENT_ID from OAuth page>
export CLIENT_SECRET=<CLIENT_SECRET from OAuth page>
  • The CLIENT_ID and CLIENT_SECRET can be obtained from the Cloud Console by selecting APIs & Services -> Credentials

Step 8: Pick names for your Kubeflow deployment and directory for your configuration:

export KF_NAME=<your choice of name for the Kubeflow deployment>
export BASE_DIR=<path to a base directory>
export KF_DIR=${BASE_DIR}/${KF_NAME}
  • For example, your kubeflow deployment name can be ‘my-kubeflow’ or ‘kf-test’.
  • Set base directory where you want to store one or more Kubeflow deployments. For example, ${HOME}/kf_deployments.

Deploy Kubeflow with Customization

Step 1: Download the KFDef file to your local directory to allow modification:

export CONFIG_FILE="kfdef.yaml"
mkdir -p ${KF_DIR}
cd ${KF_DIR}
curl -L -o ${CONFIG_FILE}

Step 2: Edit the KFDef spec in the yaml file. The following commands shows you how to set values in the configuration file using yq :

yq w -i ${CONFIG_FILE} 'spec.plugins[0].spec.project' ${PROJECT}
yq w -i ${CONFIG_FILE} 'spec.plugins[0]' ${ZONE}
yq w -i ${CONFIG_FILE} '' ${KF_NAME}

Step 3: Run the kfctl build command to generate kustomize and GCP deployment manager configuration files for your deployment:

cd ${KF_DIR}
kfctl build -V -f ${CONFIG_FILE}

Step 4: Then update the ${KF_DIR}/gcp_config/cluster.jinja file created from step 3 to specify the network and subnetwork:

name: {{ CLUSTER_NAME }}
network: projects/<host project ID>/global/networks/<network name>
subnetwork: projects/<host project ID>/regions/<region>/subnetworks/<subnet name>
initialClusterVersion: "{{ properties['cluster-version'] }}"

Step 5: In ${KF_DIR}/gcp_config/cluster.jinja, disable subnetwork creation and specify secondary IP ranges by name (ipAllocationPolicy section may have to be moved out of the IF block if you aren't setting private cluster = true)

{ if properties['securityConfig']['privatecluster'] }
createSubnetwork: false
useIpAliases: true
clusterSecondaryRangeName: <name of secondary ip range for pods>
servicesSecondaryRangeName: <name of secondary ip range for services>

Step 6: Enable private clusters by editing ${KF_DIR}/gcp_config/cluster-kubeflow.yaml and updating the following two parameters:

privatecluster: true
gkeApiVersion: v1beta1

Step 7: with above changes, run the kfctl apply command to deploy Kubeflow:

cd ${KF_DIR}
kfctl apply -V -f ${CONFIG_FILE}

Cluster creation will take 3 to 5 minutes to complete. Do not proceed until the command prompt is returned in the console.

Note: You may notice warnings in the console related to … Encountered error applying application cert-manager … and … Default user namespace pending creation … these are advisory and will not affect completion of the cluster.

While it is running, you can view instantiation of the following objects in the GCP Console:

In Deployment Manager, two deployment objects will appear:

  • {KF_NAME}-storage
  • {KF_NAME}kubeflow-qwiklab

In Kubernetes Engine, a cluster named as {KF_NAME} will appear:

  • In the Workloads section, a number of Kubeflow components
  • In the Services section, a number of Kubeflow services

When the deployment finishes, check the resources installed in the namespace kubeflow in your new cluster. To do this from the command line, first set your kubectl credentials to point to the new cluster:

gcloud container clusters get-credentials ${KF_NAME} --zone ${ZONE} --project ${PROJECT}

Then see what’s installed in the kubeflow namespace of your GKE cluster:

kubectl -n kubeflow get all

Delete Kubeflow Deployment

Once you have done your job or someething goes wrong, you can use the following commands to delete the deployment:

If you want to delete all the resources, including storage:

kfctl delete -f ${CONFIG_FILE} — delete_storage

If you want to preserve storage, which contains metadata and information

kfctl delete -f ${CONFIG_FILE}

If you want to do it from the web console, go the Deployment Manager page and delete the deployments with names of ${KF_NAME} and ${KF_NAME}_strorage. You might also need to check the Instance groups under the Compute Engine page, and to delete the instance groups, you might need to delete the backend service in the Backends tab under Load balancing of Network Services page.


Originally published at



Lei Feng Blog |

Leadership in Generative AI, Data Science, Microsoft Azure, Google Cloud Platform, Machine Learning, Operations Research