python Pyspark -将列表插入数据框单元格

irlmq6kh  于 2022-12-10  发布在  Python
关注(0)|答案(1)|浏览(145)

我有一个字典,我想把字典的键列表添加到 Dataframe 中一列的每个单元格中。目前我的尝试还没有成功,我不知道为什么。
字典如下所示:
my_dict = {"A":"1","B":"2","C":"3","D":"4"}
我想将my_dict的键添加到 Dataframe 列中,最终结果如下所示:

+------------+------------+------------+
|       comb1|       comb2|        colA|
+------------+------------+------------+
|          YY|         XX |[A, B, C, D]|
+------------+------------+------------+

下一步的目标是分解,因此 Dataframe 如下所示:

+------------+------------+------------+
|       comb1|       comb2|        colA|
+------------+------------+------------+
|          YY|         XX |           A|
+------------+------------+------------+
|          YY|         XX |           B|
+------------+------------+------------+
|          YY|         XX |           C|
+------------+------------+------------+
|          YY|         XX |           D|
+------------+------------+------------+

如何在列的每一行上插入字典键,然后进行分解?

yi0zb3m4

yi0zb3m41#

您可以使用字典的键创建一些附加的常量列,然后用它们创建一个数组,最后分解该列。
代码比解释容易:

from pyspark.sql import functions as F

# create temporary constant columns with the keys of the dictionary
for k in my_dict.keys():
    df = df.withColumn(f'_temp_{k}', F.lit(k))

df = (
    df
    # add a column with an array collecting all the keys
    .withColumn('colA', F.array(*[f'_temp_{k}' for k in my_dict.keys()]))
    # drop the temporary columns
    .drop(*[f'_temp_{k}' for k in my_dict.keys()])
    # explode the column with the array
    .withColumn('colA', F.explode(F.col('colA')))
)

结果df为:

+-----+-----+----+
|comb1|comb2|colA|
+-----+-----+----+
|   YY|   XX|   A|
|   YY|   XX|   B|
|   YY|   XX|   C|
|   YY|   XX|   D|
+-----+-----+----+

相关问题