从udf pyspark返回字典列表

pieyvz9o  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(999)

我有一个字典列表如下:

department_amount_pairs = [{"department_1": 100},{"department_2": 200},{"department_1": 300}]

我现在做的是

def department_udf(department_amount_pairs ):
    pair = []
    for d in department_amount_pairs:
         pair.append(json.dumps(d))
    return pair

这是我的自定义项定义

extractor = udf(department_udf,ArrayType(StringType()))
spark.udf.register("extractor_udf", extractor)

我就是这么叫这个函数的

data = data.withColumn('pairs',extractor_udf('department_amount'))

它返回json对象..“[{”department\u 1“:100},{”department\u 2“:200},{”department\u 1“:300}]“我必须执行json.loads()来提取这个数组。”。。但是我希望我的自定义项返回一组字典
我尝试不使用json.dumps,并将字典添加到列表中。但是我没有得到任何值..我还尝试将返回类型更改为arraytype(arraytype()),但它也不起作用。。。

mrfwxfqh

mrfwxfqh1#

通过将udf类型指定为 array<map<string,int>> .
例如,

from pyspark.sql.functions import udf

def department_udf():
    return [{"department_1": 100},{"department_2": 200},{"department_1": 300}]

extractor = udf(department_udf, 'array<map<string,int>>')

df = spark.range(1)

df.withColumn('pairs', extractor()).show(truncate=False)
+---+---------------------------------------------------------------------+
|id |pairs                                                                |
+---+---------------------------------------------------------------------+
|0  |[[department_1 -> 100], [department_2 -> 200], [department_1 -> 300]]|
+---+---------------------------------------------------------------------+

相关问题