SQL Server Table gets deleted when trying to overwrite the data in it from databricks spark

zxlwwiss  于 2023-06-21  发布在  Spark
关注(0)|答案(1)|浏览(80)

I am trying to write dataframe data into a table in Azure SQL from Databricks using pyspark. Table : dbo.test already exists in the database. I am able to read it before I execute below write operation.

testDf.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite")\
        .option("truncate", "false")\
        .option("url", azure_sql_url).option("dbtable", 'dbo.test')\
        .option("databaseName", database_name)\
        .option("user", username) \
        .option("password", password) \
        .option("encrypt", "true")\
        .option("hostNameInCertificate", "*.database.windows.net")\
        .option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
        .option("bulkCopyTimeout", "6000000")\
        .save()

After executing this command the following error is returned:

java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;

Surprisingly, the dbo.test table gets deleted.

Can someone help me understand why this is happening. Same code works fine in another environment.

0qx6xfy6

0qx6xfy61#

java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.schemaString(Lorg/apache/spark/sql/Dataset;Ljava/lang/String;Lscala/Option;)Ljava/lang/String;

The cause of error is version mismatch between the Spark SQL and JDBC spark connector.Also, The mode("overwrite") drops the table if already exists by default and re-creates a new one.

Here are some supported versions of JDBC spark connector ith respective spark version:

  • Spark 2.4.x : com.microsoft.azure:spark-mssql-connector:1.0.2
  • Spark 3.0.x : com.microsoft.azure:spark-mssql-connector_2.12:1.1.0
  • Spark 3.1.x : com.microsoft.azure:spark-mssql-connector_2.12 :1.2.0

Resolution:

If you are using Databricks runtime version greater than 10 then you need to downgrade it to Databricks runtime version 9.1 LTS or down. and use respective connector.

  • I downgraded Databricks runtime version to 7.3 LTS:

  • Then installed appropriate library for spark 3.0.x which is com.microsoft.azure:spark-mssql-connector_2.12:1.1.0

  • And tried your code its working fine.

df_name.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite")\
        .option("truncate", "false")\
        .option("url", "Azure_sql_url").option("dbtable", 'dbo.test')\
        .option("databaseName", "databasename")\
        .option("user", "username") \
        .option("password", "password") \
        .option("encrypt", "true")\
        .option("hostNameInCertificate", "*.database.windows.net")\
        .option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
        .option("bulkCopyTimeout", "6000000")\
        .save()

My execution:

OUTPUT:

Or you can directly use dataframename.format("jdbc")

Sample CODE:

df_name.write.format("jdbc").mode("overwrite")\
        .option("truncate", "false")\
        .option("url", "Azure_sql_url").option("dbtable", 'dbo.test')\
        .option("databaseName", "databasename")\
        .option("user", "username") \
        .option("password", "password") \
        .option("encrypt", "true")\
        .option("hostNameInCertificate", "*.database.windows.net")\
        .option("bulkCopyBatchSize", 10000).option("bulkCopyTableLock", "true")\
        .option("bulkCopyTimeout", "6000000")\
        .save()

You can also refer similar issue here on Github or this similar SO thread

Reference: https://learn.microsoft.com/en-us/sql/connect/spark/connector?view=sql-server-ver16

相关问题