Can't connect to Dataproc Metastore from Dataproc Batch Jobs

Hi,
While I'm trying to connect to hive-metastore(Dataproc Metastore) using thrift url within Spark Configuration, I'm getting various Metastore exceptions as included below.

Spark config:
`spark = ( SparkSession .builder .appName("IcebergSparkSrvrlesss") .config("spark.sql.catalog.iceberg_catalog","org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.iceberg_catalog.type","hive") .config('spark.sql.catalog.spark_catalog.uri', 'thrift://*.*.*.*:9083')

.config("spark.sql.catalog.iceberg_catalog.warehouse","gs://usmedp-devstg-icebergpoc/iceberg-catalog").enableHiveSupport().getOrCreate()

Read the data and create the table df_account_partition.writeTo(f"{iceberg_catalog}.{iceberg_warehouse}.icbg_account_tbl").tableProperty("format-version", "2").createOrReplace()\`

Exception:
Query for candidates of org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted in no possible candidates
Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
org.datanucleus.store.rdbms.exceptions.MissingTableException: Required table missing : "VERSION" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"
at org.datanucleus.store.rdbms.table.AbstractTable.exists(AbstractTable.java:606)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)


Traceback (most recent call last):
File "/tmp/srvls-batch-461fc9f5-cf14-4b80-b879-a2f8f369d268/iceberg_hive.py", line 21, in <module>
spark.sql("CREATE NAMESPACE IF NOT EXISTS iceberg_catalog.test;")
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql
File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o83.sql.
: org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore
 org.apache.spark.sql.connector.catalog.SupportsNamespaces.namespaceExists(SupportsNamespaces.java:97)
at org.apache.spark.sql.execution.datasources.v2.CreateNamespaceExec.run(CreateNamespaceExec.scala:43)

... 51 more
Caused by: java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

... 63 more
Caused by: MetaException(message:Version information not found in metastore. )
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:83)
... 68 more
Caused by: MetaException(message:Version information not found in metastore. )

1 1 219
1 REPLY 1

The exception you're encountering seems to be related to the Hive Metastore and its version table. The error message suggests that the required table "VERSION" is missing in the specified catalog and schema.

Here are a few suggestions to troubleshoot and resolve the issue:

Ensure that the Hive Metastore schema is properly initialized. The "VERSION" table is a part of the Hive Metastore schema. You can check whether the necessary tables, including "VERSION," exist in your Hive Metastore schema. Verify the version of Hive Metastore you are using and ensure it is compatible with your Spark and Iceberg versions. Sometimes, compatibility issues can lead to problems like missing tables.

The error suggests enabling "datanucleus.schema.autoCreateTables." You can configure this property in your Spark configuration to allow DataNucleus to automatically create missing tables.

Add the following configuration to enable auto-creation:

Show More
.config("spark.hadoop.datanucleus.schema.autoCreateAll", "true")

Make sure you are using compatible versions of Spark, Hive, and Iceberg. Check for any updates or patches that might address the issue.

Here's an example of how you can update your Spark configuration with the suggested changes:

Show More
spark = (
SparkSession.builder
.appName("IcebergSparkSrvrlesss")
.config("spark.sql.catalog.iceberg_catalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.iceberg_catalog.type", "hive")
.config('spark.sql.catalog.spark_catalog.uri', 'thrift://*.*.*.*:9083')
.config("spark.sql.catalog.iceberg_catalog.warehouse", "gs://usmedp-devstg-icebergpoc/iceberg-catalog")
.config("spark.hadoop.datanucleus.schema.autoCreateAll", "true")
.enableHiveSupport()
.getOrCreate()
)