无法使用pyspark插入sql,但可以在sql中工作

g52tjvyc  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(417)

我使用以下命令在sql中创建了一个表:

CREATE TABLE [dbo].[Validation](
    [RuleId] [int] IDENTITY(1,1) NOT NULL,
    [AppId] [varchar](255) NOT NULL,
    [Date] [date] NOT NULL,
    [RuleName] [varchar](255) NOT NULL,
    [Value] [nvarchar](4000) NOT NULL
)

注意标识键(ruleid)
在sql中将值插入到表中时,如下所示:
注意:如果表为空并递增,不按原样插入主键将自动填充

INSERT INTO dbo.Validation VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')

但是,当在databricks上创建临时表并执行下面相同的查询时,在pyspark上运行以下查询:

%python

        driver = <Driver>
        url = "jdbc:sqlserver:<URL>"
        database = "<db>"
        table = "dbo.Validation"
        user = "<user>"
        password = "<pass>"

        #import the data
        remote_table = spark.read.format("jdbc")\
        .option("driver", driver)\
        .option("url", url)\
        .option("database", database)\
        .option("dbtable", table)\
        .option("user", user)\
        .option("password", password)\
        .load()

        remote_table.createOrReplaceTempView("YOUR_TEMP_VIEW_NAMES")

        sqlcontext.sql("INSERT INTO YOUR_TEMP_VIEW_NAMES VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')")

我得到以下错误:
analysisexception:“未知”要求要插入的数据具有与目标表相同的列数:目标表有5列,但插入的数据有4列,包括0个具有常量值的分区列。;”
为什么它在sql上工作,而在通过databricks传递查询时却不工作?如何通过pyspark插入而不出现此错误?

iaqfqrcu

iaqfqrcu1#

这里最简单的解决方案是使用scala单元中的jdbc。如

%scala

import java.util.Properties
import java.sql.DriverManager

val jdbcUsername = dbutils.secrets.get(scope = "kv", key = "sqluser")
val jdbcPassword = dbutils.secrets.get(scope = "kv", key = "sqlpassword")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"

// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://xxxx.database.windows.net:1433;database=AdventureWorks;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"

// Create a Properties() object to hold the parameters.

val connectionProperties = new Properties()

connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)

val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "INSERT INTO dbo.Validation VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')"

stmt.execute(sql)
connection.close()

您也可以使用pyodbc,但是默认情况下不安装sqlserverodbc驱动程序,而安装jdbc驱动程序。
spark解决方案是在sqlserver中创建一个视图,并插入该视图。如

create view Validation2 as
select AppId,Date,RuleName,Value
from Validation

然后

tableName = "Validation2"
df = spark.read.jdbc(url=jdbcUrl, table=tableName, properties=connectionProperties)
df.createOrReplaceTempView(tableName)
sqlContext.sql("INSERT INTO Validation2 VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')")

如果要封装scala并从另一种语言(如python)调用它,可以使用scala包单元。

%scala

package example

import java.util.Properties
import java.sql.DriverManager

object JDBCFacade 
{
  def runStatement(url : String, sql : String, userName : String, password: String): Unit = 
  {
    val connection = DriverManager.getConnection(url, userName, password)
    val stmt = connection.createStatement()
    try
    {
      stmt.execute(sql)  
    }
    finally
    {
      connection.close()  
    }
  }
}

然后你可以这样称呼它:

jdbcUsername = dbutils.secrets.get(scope = "kv", key = "sqluser")
jdbcPassword = dbutils.secrets.get(scope = "kv", key = "sqlpassword")

jdbcUrl = "jdbc:sqlserver://xxxx.database.windows.net:1433;database=AdventureWorks;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"

sql = "select 1 a into #foo from sys.objects"

sc._jvm.example.JDBCFacade.runStatement(jdbcUrl,sql, jdbcUsername, jdbcPassword)

相关问题