Run Dataflow Jobs in a Shared VPC without Regional Endpoints on GCP

Google Cloud Dataflow is a fully managed platform running Apache Beam for unified stream and batch data processing services. With its fully managed approach on resource provisioning and horizontal autoscaling, you have access to virtually unlimited capacity and optimized price-to-performance to solve your data pipeline processing challenges.

Shared Virtual Private Network (VPC) allows an organization to connect resources from multiple projects to a common VPC network. This allows cloud resources to communicate with each other securely and efficiently using internal IP adresses from that network. When you use shared VPC, you designates a project as a Host Project and attach one or more Service Projects to it. The common VPC network shared accross the Host and Service projects is called Shared VPC Network. Using Shared VPC is a good practice for Organization Admins to keep centralized control over network resources, while delegating administrative resposibilities to Servcie Project Admins.

Dataflow regional endpoints store and handle metadata about Dataflow jobs, and deploy and control Dataflow workers. However, Dataflow regional endpoints are not yet available in all Google Cloud regions.

If the Shared VPC does not have any subnet in the region that provides the Dataflow regional endpoints, you need a tricky way to run your Dataflow jobs. In addition, because Dataflow service needs to use the network shared by the host project, there are some neccessary configurations need to be set up on both Host and Serive projects. This post describes these steps and settings to run a Dataflow sample job in a Shared VPC that does not have any Dataflow regional endpoints.

Organization and Shared VPC

The organization structure used in the experiment is shown inthe following table. The Organization node( ) has two projects: is the Service Project where the Dataflow job runs, and is the Host Project that hosts the shared VPC.

For the best practice, the Organiztion Admin has setup the following constraints in the page under the organization's menu.

The following table shows the Shared VPC from the Host Project. There is only one subnet in region with on. Unfortunately, there is no Dataflow regional endpoint available in this region.

In addition, the following firewalls are applied to all instances in the shared VPC networks.

Settings in Host Project

The following settings might need the role of or from the Organization or Host Project:

  • Add the user account or user group that runs Dataflow in the Service Project to the shared VPC with role, otherwise, the user/group cannot see the tab on the page in the Service project. See below for an example:
  • Add the Dataflow service account from the Service Project, in the format of , to the shared subnet with the role. The Dataflow service account would be created once you enable the Dataflow API, which is one of the settings in Service Project below.

Note: Depending on which mode (e.g. project permission or individual subnet permission) you are using to share the VPC, you need to make sure the Dataflow account from the Service project has the role in the shared Subnet.

Settings in Service Project

The following settings might need the role of or from the Service Project:

  • Enbale the Cloud Dataflow, Compute Engine, and Cloud Storage APIs. After the Cloud Dataflow API is enbaled, the Cloud Dataflow Service Account with format of should be created. Now you can contact the shared VPC admin or Host Project Owner to add the Dataflow service account to the shared VPC with role.
  • In the IAM & admin page, make sure that

1. the Cloud Dataflow Service Account has a role of , or you can run the following command to add it:

gcloud projects add-iam-policy-binding [PROJECT-ID] \
--member serviceAccount: service-[PROJECT-NUMBER]@dataflow-service-producer-prod.iam.gserviceaccount.com \
--role roles/dataflow.serviceAgent

2. the Compute Engine Service Agent has roles of and . Otherwise, use the following command to add them:

gcloud projects add-iam-policy-binding [PROJECT-ID] \
--member serviceAccount: service-[PROJECT-NUMBER]@ compute-system.iam.gserviceaccount.com \
--role roles/compute.networkUser \
--role roles/compute.serviceAgent
  1. From Cloud Console go to IAM & Admin page, and then Service Accouts
  2. From the page select New service account
  3. Enter a name for the Service account name and select Project > Owner as the Role
  4. Click Create. Save the JSON file that contains your key to your local folder.

You will need the the path of the file to set the environment variable GOOGLE_APPLICATION_CREDENTIALS in the following scripts.

Specifying Execution Paramters

This section describes the settings for Java SDK 2.x and Python 3.x. Check the quickstarts for more information.

Java and Maven

If you use Java and Apache Maven, you need to run the following command to create the Maven project containing the Apache Beam WordCount source code.

$ mvn archetype:generate \
-DarchetypeGroupId=org.apache.beam \
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
-DarchetypeVersion=2.19.0 \
-DgroupId=org.example \
-DartifactId=word-count-beam \
-Dversion="0.1" \
-Dpackage=org.apache.beam.examples \
-DinteractiveMode=false \
-Dhttp.proxyHost=[YOUR-PROXY-HOST] \
-Dhttp.proxyPort=[YOUR-PROXY-PORT] \
-Dhttps.proxyHost=[YOUR-PROXY-HOST] \
-Dhttps.proxyPort=[YOUR-PROXY-PORT]

Note: you need the last four lines only if you are behind a proxy, otherwise, feel free to skip them.

Once the Maven project created, you can use the following scripts to run the Apache Beam job on Dataflow:

# load the application credential file
export GOOGLE_APPLICATION_CREDENTIALS="[path-to-authentication-key-file]"
echo $GOOGLE_APPLICATION_CREDENTIALS

# submit job
mvn -Pdataflow-runner compile exec:java \
-Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dhttp.proxyHost=[YOUR-PROXY-HOST] \
-Dhttp.proxyPort=[YOUR-PROXY-PORT] \
-Dhttps.proxyHost=[YOUR-PROXY-HOST] \
-Dhttps.proxyPort=[YOUR-PROXY-PORT] \
-Dexec.args="--project=[SERVICE_PROJECT_ID] \
--stagingLocation=gs://[CLOUD_STORAGE_BUCKET]/staging/ \
--output=gs://[CLOUD_STORAGE_BUCKET]/output/ \
--gcpTempLocation=gs://[CLOUD_STORAGE_BUCKET]/temp/ \
--runner=DataflowRunner \
--usePublicIps=false \
--region=us-west1 \
--zone=us-us-west2-a \
--subnetwork=https://www.googleapis.com/compute/v1/projects/[HOST_PROJECT_ID]/regions/us-west2/subnetworks/subnet-us-west-2"

Please refer to the document of Specifying execution parameters for detailed explanation on each of the parameters. I would list a few points worth highlighting here:

  • For the [SERVICE_PROJECT-ID] and [HOST_PROJECT_ID], make sure to use the project id’s, rather than the project names.
  • The is the shared subnet comes from the host project. Because there is no Dataflow regional endpoint in region , it needs to specify the to or any other regions that have the Dataflow regional endpoint. But it still needs to set the to one of the zones in .
  • Don’t forget the parameter , if your organization don't allow to use external IP address.

Java and Eclipse

If you would use Eclipse, you can set the above parameters in the Arguments tab of Run Configurations, see below for an example:

Python

If you use Python, use the following settings to run the dataflow job in the Service Project decribed in this post.

export GOOGLE_APPLICATION_CREDENTIALS="[path-to-authentication-key-file]"
echo $GOOGLE_APPLICATION_CREDENTIALS

PROJECT=[SERVICE_PROJECT_ID]
BUCKET=gs://[CLOUD_STORAGE_BUCKET]

python -m apache_beam.examples.wordcount \
--input gs://dataflow-samples/shakespeare/kinglear.txt \
--output $BUCKET/outputs \
--temp_location $BUCKET/tmp \
--runner DataflowRunner \
--project $PROJECT \
--region 'us-west1' \
--no_use_public_ips True \
--subnetwork 'https://www.googleapis.com/compute/v1/projects/[HOST_PROJECT_ID]/regions/us-west2/subnetworks/subnet-us-west-2'

I hope this post is helpful. If I missed anything, please let me know in the comments.

References

https://cloud.google.com/dataflow/docs/guides/specifying-networks

https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#cloud-dataflow-service-account

https://cloud.google.com/dataflow/docs/guides/routes-firewall

https://cloud.google.com/dataflow/docs/concepts/regional-endpoints

https://cloud.google.com/dataflow/docs/guides/specifying-exec-params

Originally published at http://leifengblog.net.

Big Data, Google Cloud Platform, Machine Learning, Operations Research

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store