Run Dataflow Jobs in a Shared VPC without Regional Endpoints on GCP

Organization and Shared VPC

Settings in Host Project

  • Add the user account or user group that runs Dataflow in the Service Project to the shared VPC with Compute Network User role, otherwise, the user/group cannot see the Network shared to my project tab on the VPC network page in the Service project. See below for an example:
  • Add the Dataflow service account from the Service Project, in the format of service-<SERVICE_PROJECT_NUMBER>@dataflow-service-producer-prod.iam.gserviceaccount.com, to the shared subnet with the Compute Network User role. The Dataflow service account would be created once you enable the Dataflow API, which is one of the settings in Service Project below.

Settings in Service Project

  • Enbale the Cloud Dataflow, Compute Engine, and Cloud Storage APIs. After the Cloud Dataflow API is enbaled, the Cloud Dataflow Service Account with format of service-<PROJECT_NUMBER>@dataflow-service-producer-prod.iam.gserviceaccount.com should be created. Now you can contact the shared VPC admin or Host Project Owner to add the Dataflow service account to the shared VPC with Compute Network User role.
  • In the IAM & admin page, make sure that
gcloud projects add-iam-policy-binding [PROJECT-ID] \
--member serviceAccount: service-[PROJECT-NUMBER]@dataflow-service-producer-prod.iam.gserviceaccount.com \
--role roles/dataflow.serviceAgent
gcloud projects add-iam-policy-binding [PROJECT-ID] \
--member serviceAccount: service-[PROJECT-NUMBER]@ compute-system.iam.gserviceaccount.com \
--role roles/compute.networkUser \
--role roles/compute.serviceAgent
  1. From Cloud Console go to IAM & Admin page, and then Service Accouts
  2. From the page select New service account
  3. Enter a name for the Service account name and select Project > Owner as the Role
  4. Click Create. Save the JSON file that contains your key to your local folder.

Specifying Execution Paramters

Java and Maven

$ mvn archetype:generate \
-DarchetypeGroupId=org.apache.beam \
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
-DarchetypeVersion=2.19.0 \
-DgroupId=org.example \
-DartifactId=word-count-beam \
-Dversion="0.1" \
-Dpackage=org.apache.beam.examples \
-DinteractiveMode=false \
-Dhttp.proxyHost=[YOUR-PROXY-HOST] \
-Dhttp.proxyPort=[YOUR-PROXY-PORT] \
-Dhttps.proxyHost=[YOUR-PROXY-HOST] \
-Dhttps.proxyPort=[YOUR-PROXY-PORT]
# load the application credential file
export GOOGLE_APPLICATION_CREDENTIALS="[path-to-authentication-key-file]"
echo $GOOGLE_APPLICATION_CREDENTIALS

# submit job
mvn -Pdataflow-runner compile exec:java \
-Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dhttp.proxyHost=[YOUR-PROXY-HOST] \
-Dhttp.proxyPort=[YOUR-PROXY-PORT] \
-Dhttps.proxyHost=[YOUR-PROXY-HOST] \
-Dhttps.proxyPort=[YOUR-PROXY-PORT] \
-Dexec.args="--project=[SERVICE_PROJECT_ID] \
--stagingLocation=gs://[CLOUD_STORAGE_BUCKET]/staging/ \
--output=gs://[CLOUD_STORAGE_BUCKET]/output/ \
--gcpTempLocation=gs://[CLOUD_STORAGE_BUCKET]/temp/ \
--runner=DataflowRunner \
--usePublicIps=false \
--region=us-west1 \
--zone=us-us-west2-a \
--subnetwork=https://www.googleapis.com/compute/v1/projects/[HOST_PROJECT_ID]/regions/us-west2/subnetworks/subnet-us-west-2"
  • For the [SERVICE_PROJECT-ID] and [HOST_PROJECT_ID], make sure to use the project id’s, rather than the project names.
  • The subnetwork is the shared subnet comes from the host project. Because there is no Dataflow regional endpoint in region us-west2, it needs to specify the region to us-west1 or any other regions that have the Dataflow regional endpoint. But it still needs to set the zone to one of the zones in us-west2.
  • Don’t forget the parameter --usePublicIps=false, if your organization don't allow to use external IP address.

Java and Eclipse

Python

export GOOGLE_APPLICATION_CREDENTIALS="[path-to-authentication-key-file]"
echo $GOOGLE_APPLICATION_CREDENTIALS

PROJECT=[SERVICE_PROJECT_ID]
BUCKET=gs://[CLOUD_STORAGE_BUCKET]

python -m apache_beam.examples.wordcount \
--input gs://dataflow-samples/shakespeare/kinglear.txt \
--output $BUCKET/outputs \
--temp_location $BUCKET/tmp \
--runner DataflowRunner \
--project $PROJECT \
--region 'us-west1' \
--no_use_public_ips True \
--subnetwork 'https://www.googleapis.com/compute/v1/projects/[HOST_PROJECT_ID]/regions/us-west2/subnetworks/subnet-us-west-2'

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Lei Feng

Lei Feng

Big Data, Google Cloud Platform, Machine Learning, Operations Research