scala—在spark 2.0.1Dataframe上执行内部联接时出错

xmq68pz9 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(358)

其他人遇到这个问题，对如何解决这个问题有想法吗？
我一直在尝试更新我的代码以使用spark 2.0.1和scala 2.11。在spark 1.6.0和scala 2.10中，一切都很顺利。我有一个直接的dataframe到dataframe的内部连接，它返回一个错误。数据来自aws rds aurora。请注意，下面的fooDataframe实际上是92列，而不是我显示的两列。即使只有两列，问题仍然存在。
相关信息：
带架构的Dataframe1

foo.show()

+--------------------+------+
|      Transaction ID|   BIN|
+--------------------+------+
|               bbBW0|134769|
|               CyX50|173622|
+--------------------+------+

println(foo.printSchema())

root
|-- Transaction ID: string (nullable = true)
|-- BIN: string (nullable = true)

带架构的Dataframe2

bar.show()

+--------------------+-----------------+-------------------+
|              TranId|       Amount_USD|     Currency_Alpha|
+--------------------+-----------------+-------------------+
|               bbBW0|            10.99|                USD|
|               CyX50|           438.53|                USD|
+--------------------+-----------------+-------------------+

println(bar.printSchema())

root
|-- TranId: string (nullable = true)
|-- Amount_USD: string (nullable = true)
|-- Currency_Alpha: string (nullable = true)

用explain连接Dataframe

val asdf = foo.join(bar, foo("Transaction ID") === bar("TranId"))
println(foo.join(bar, foo("Transaction ID") === bar("TranId")).explain())

== Physical Plan ==

* BroadcastHashJoin [Transaction ID#0], [TranId#202], Inner, BuildRight

:- *Scan JDBCRelation((SELECT

        ...
        I REMOVED A BUNCH OF LINES FROM THIS PRINT OUT
        ...

      ) as x) [Transaction ID#0,BIN#8] PushedFilters: [IsNotNull(Transaction ID)], ReadSchema: struct<Transaction ID:string,BIN:string>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]))
   +- *Filter isnotnull(TranId#202)
      +- InMemoryTableScan [TranId#202, Amount_USD#203, Currency_Alpha#204], [isnotnull(TranId#202)]
         :  +- InMemoryRelation [TranId#202, Amount_USD#203, Currency_Alpha#204], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         :     :  +- Scan ExistingRDD[TranId#202,Amount_USD#203,Currency_Alpha#204]

我得到的错误是：

16/10/18 11:36:50 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 6)
java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'ID IS NOT NULL)' at line 54

这里可以看到完整的堆栈(http://pastebin.com/c9bg2hft)
在我的代码或jdbc查询中，从数据库中提取数据的任何地方，我都没有 ID IS NOT NULL) . 我花了很多时间在google上搜索，发现了一个commit for spark，它在连接的查询计划中添加了空过滤器。这是承诺(https://git1-us-west.apache.org/repos/asf?p=spark.git;a=提交；h=ef770031）

scala apache-spark spark-dataframe

来源：https://stackoverflow.com/questions/40113424/error-when-performing-an-inner-join-on-spark-2-0-1-dataframe

1条答案

按热度按时间

rm5edbpk1#

好奇你是否试过以下方法；

val dfRenamed = bar.withColumnRenamed("TranId", " Transaction ID")
val newDF = foo.join(dfRenamed, Seq("Transaction ID"), "inner")

赞(0）回复(0）举报 2021-05-27

我来回答

scala—在spark 2.0.1Dataframe上执行内部联接时出错

1条答案

相关问题

热门标签

最新问答