running a job in dataproc cluster via compute engine

is it possible to trigger a job in dataproc cluster from the compute engine vm?
the sample 'hello world' job file resides in the gcs bucket.
Any documentation or code available for the same?

Solved Solved
0 1 197
1 ACCEPTED SOLUTION

Yes, it is possible to trigger a job on a Dataproc cluster from a Compute Engine VM.

Key Methods for Triggering Dataproc Jobs from a Google Compute Engine VM

  • Google Cloud SDK (gcloud): The easiest method for users familiar with the command line. The gcloud dataproc jobs submit command lets you directly submit various job types (PySpark, Hadoop, Hive, etc.) to your Dataproc cluster.

  • Dataproc REST API: Provides the most customization. You can make HTTP requests from practically any programming language (Python, Java, etc.) giving you precise control over job submission.

  • Apache Airflow: Perfect for complex workflows where the Dataproc job is one step in a larger sequence of actions. Airflow offers dedicated operators that streamline Dataproc job management.

Example: Using the Google Cloud SDK (gcloud)

  1. Install/Update the SDK: Check https://cloud.google.com/sdk/docs/ for instructions.

  2. Submit the Job:

     
    gcloud dataproc jobs submit pyspark \
         --cluster=<your-dataproc-cluster-name> \
         --region=<your-region> \
         gs://<your-gcs-bucket>/hello_world.py \
         --jars gs://<path-to-jar-dependencies-if-any> 
    
    • Replace <...> with your cluster name,region, and paths to your PySpark script and any dependencies.

Enhanced Python Example: Dataproc REST API

 
import requests 
from google.auth import default

credentials, project_id = default()  

cluster_name = "your-dataproc-cluster-name"
region = "your-region"
job_details = {
    "projectId": project_id,
    "job": {
        "placement": {
            "clusterName": cluster_name
        },
        "pysparkJob": {
            "mainPythonFileUri": "gs://your-gcs-bucket/hello_world.py" 
        }
    }
}

endpoint = f"https://dataproc.googleapis.com/v1/projects/{project_id}/regions/{region}/jobs:submit"
headers = {
    "Authorization": f"Bearer {credentials.token}",
    "Content-Type": "application/json"
}

response = requests.post(endpoint, headers=headers, json=job_details)
if response.status_code == 200:
    print("Job submitted successfully.")
else:
    print("Job submission failed.")

Important Considerations

  • Permissions: Ensure your Compute Engine VM's service account has the necessary Dataproc and GCS permissions.
  • Networking: If your Dataproc cluster is in a private network, make sure your Compute Engine VM can communicate with it (consider Private Google Access).

View solution in original post

1 REPLY 1

Yes, it is possible to trigger a job on a Dataproc cluster from a Compute Engine VM.

Key Methods for Triggering Dataproc Jobs from a Google Compute Engine VM

  • Google Cloud SDK (gcloud): The easiest method for users familiar with the command line. The gcloud dataproc jobs submit command lets you directly submit various job types (PySpark, Hadoop, Hive, etc.) to your Dataproc cluster.

  • Dataproc REST API: Provides the most customization. You can make HTTP requests from practically any programming language (Python, Java, etc.) giving you precise control over job submission.

  • Apache Airflow: Perfect for complex workflows where the Dataproc job is one step in a larger sequence of actions. Airflow offers dedicated operators that streamline Dataproc job management.

Example: Using the Google Cloud SDK (gcloud)

  1. Install/Update the SDK: Check https://cloud.google.com/sdk/docs/ for instructions.

  2. Submit the Job:

     
    gcloud dataproc jobs submit pyspark \
         --cluster=<your-dataproc-cluster-name> \
         --region=<your-region> \
         gs://<your-gcs-bucket>/hello_world.py \
         --jars gs://<path-to-jar-dependencies-if-any> 
    
    • Replace <...> with your cluster name,region, and paths to your PySpark script and any dependencies.

Enhanced Python Example: Dataproc REST API

 
import requests 
from google.auth import default

credentials, project_id = default()  

cluster_name = "your-dataproc-cluster-name"
region = "your-region"
job_details = {
    "projectId": project_id,
    "job": {
        "placement": {
            "clusterName": cluster_name
        },
        "pysparkJob": {
            "mainPythonFileUri": "gs://your-gcs-bucket/hello_world.py" 
        }
    }
}

endpoint = f"https://dataproc.googleapis.com/v1/projects/{project_id}/regions/{region}/jobs:submit"
headers = {
    "Authorization": f"Bearer {credentials.token}",
    "Content-Type": "application/json"
}

response = requests.post(endpoint, headers=headers, json=job_details)
if response.status_code == 200:
    print("Job submitted successfully.")
else:
    print("Job submission failed.")

Important Considerations

  • Permissions: Ensure your Compute Engine VM's service account has the necessary Dataproc and GCS permissions.
  • Networking: If your Dataproc cluster is in a private network, make sure your Compute Engine VM can communicate with it (consider Private Google Access).