在sparkDataframe中将字符串数据类型列转换为maptype

myzjeezk  于 2021-05-24  发布在  Spark
关注(0)|答案(1)|浏览(740)

我有一个如下所示的Dataframe。我要将最后一列trandata从string类型转换为maptype。输出应该与我在第二个表中所示的类似。
我已经编写了udf,但它需要字符串并转换为maptype,我很难用sql.row作为输入获得类似的输出:(

def stringToMap(value: String): Map[String, String] = {
  var valMap = collection.mutable.Map[String, String]()
  val values = value.split(",")
  for (i <- values) {
    valMap = valMap + (i.split("=")(0) -> i.split("=")(1))
  }
  return valMap
}

+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|MESSAGEID     |CATEGORY|TRANDATA                                                                                                                                                                                                                                                                                       |
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|03010         |A       |threadID=123sada,ProcType=InfraLogging,TxnID=4mjx8wfogf
|03011         |A       |threadID=xmjxe2j0jz,ProcType=InfraLogging,TxnID=4mjxe2j0tf
|09941         |D       |compTxnID=xmawdew0tf,to=ABCD,threadID=4mjxe2j0jz,ProcType=InfraLogging
|00994         |D       |compTxnID=xmjxe2j0tf,to=XYZA,threadID=34jxasde0jz,ProcType=InfraLogging
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

表2:导出输出-第三列为maptype

+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|MESSAGEID     |CATEGORY|TRANDATA                                                                                                                                                                                                                                                                                       |
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|03010         |A       |Map(threadID -> 123sada,ProcType -> InfraLogging,TxnID -> 4mjx8wfogf)
vaqhlq81

vaqhlq811#

对于spark 2.4+,您可以将字符串拆分为键值对,然后使用transform将键和值分隔为两个数组列,然后使用map\ from\ arrays创建最终的Map。

df.withColumn("entry", split('TRANDATA, ","))
  .withColumn("key", expr("transform(entry, x -> split(x, '=')[0])"))
  .withColumn("value", expr("transform(entry, x -> split(x, '=')[1])"))
  .withColumn("map", map_from_arrays('key, 'value))
  .drop("entry", "key", "value", "TRANDATA")
  .show(false)

输出:

+---------+--------+----------------------------------------------------------------------------------------+
|MESSAGEID|CATEGORY|map                                                                                     |
+---------+--------+----------------------------------------------------------------------------------------+
|03010    |A       |[threadID -> 123sada, ProcType -> InfraLogging, TxnID -> 4mjx8wfogf]                    |
|03011    |A       |[threadID -> xmjxe2j0jz, ProcType -> InfraLogging, TxnID -> 4mjxe2j0tf]                 |
|09941    |D       |[compTxnID -> xmawdew0tf, to -> ABCD, threadID -> 4mjxe2j0jz, ProcType -> InfraLogging] |
|00994    |D       |[compTxnID -> xmjxe2j0tf, to -> XYZA, threadID -> 34jxasde0jz, ProcType -> InfraLogging]|
+---------+--------+----------------------------------------------------------------------------------------+

相关问题