从UDF PySpark中的MapType列中选择值

b1payxdu  于 2023-11-16  发布在  Spark
关注(0)|答案(2)|浏览(152)

我试图在UDF函数中从PySpark框架中的MapType列中提取值。
下面是PySpark框架:

+-----------+------------+-------------+
|CUSTOMER_ID|col_a       |col_b        |
+-----------+------------+-------------+
|    100    |{0.0 -> 1.0}| {0.2 -> 1.0}|
|    101    |{0.0 -> 1.0}| {0.2 -> 1.0}|
|    102    |{0.0 -> 1.0}| {0.2 -> 1.0}|
|    103    |{0.0 -> 1.0}| {0.2 -> 1.0}|
|    104    |{0.0 -> 1.0}| {0.2 -> 1.0}|
|    105    |{0.0 -> 1.0}| {0.2 -> 1.0}|
+-----------+------------+-------------+

个字符
下面是UDF

@F.udf(T.FloatType())
def test(col):
    return col[1]


下面是代码:

df_temp=df_temp.withColumn('test',test(F.col('col_a')))


当我把col_a列的值传递给自定义项时,我没有得到它。有人能解释一下吗?

093gszye

093gszye1#

在以下情况下,符号col[1]将成功地从 map type列中提取值:

  • col是列表达式
  • 1是Map中存在的键

在你的例子中,你的Map没有一个键=1,这就是为什么它不工作。

from pyspark.sql import functions as F
df = spark.createDataFrame([(100, {0.0: 1.0},)], ['CUSTOMER_ID', 'col_a'])
df.show()
# +-----------+------------+
# |CUSTOMER_ID|       col_a|
# +-----------+------------+
# |        100|{0.0 -> 1.0}|
# +-----------+------------+

df = df.withColumn('col_a_0', F.col('col_a')[0])
df = df.withColumn('col_a_1', F.col('col_a')[1])

df.show()
# +-----------+------------+-------+-------+
# |CUSTOMER_ID|       col_a|col_a_0|col_a_1|
# +-----------+------------+-------+-------+
# |        100|{0.0 -> 1.0}|    1.0|   null|
# +-----------+------------+-------+-------+

字符串

c9x0cxw0

c9x0cxw02#

要提取maptype列中的值,请使用map_values()

df_temp.withColumn('col_a_1', array_join(map_values("col_a"),',')).show()

+-----------+------------+-------+
|CUSTOMER_ID|       col_a|col_a_1|
+-----------+------------+-------+
|        100|{0.0 -> 1.0}|    1.0|
|        101|{0.0 -> 1.0}|    1.0|
|        102|{0.0 -> 1.0}|    1.0|
+-----------+------------+-------+

字符串

相关问题