Cannot use apache hudi on dataproc

I am trying to use Apache Hudi component on Dataproc cluster

I ran the example code provided by Google, but it doesn't work. (https://cloud.google.com/dataproc/docs/concepts/components/hudi)

When I run the spark query w/ hudi I get the following error

java.lang.ClassNotFoundException:
  Failed to find data source: hudi. Please find packages at
  https://spark.apache.org/third-party-projects.html

Also, according to the documentation, the executable script should be located in the path below.

/usr/lib/hudi/cli

But it doesn't exist

Below is the cluster creation script used to use the hudi component.

gcloud dataproc clusters create hudi-poc \
  --enable-component-gateway --master-machine-type n2-standard-2 \
  --master-boot-disk-size 200 --num-workers 2 \
  --worker-machine-type e2-standard-2 --worker-boot-disk-size 100 \
  --image-version 2.1.2-ubuntu20  --region us-central1 \
  --scopes 'https://www.googleapis.com/auth/cloud-platform' \
  --optional-components HUDI 

Has anyone had success using hudi components on dataproc cluster?

1 1 559
1 REPLY 1

Hi,

Try to add this property: 

--properties spark:spark.jars.packages="org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0" \