python Pyspark -将列表插入数据框单元格

irlmq6kh 于 2022-12-10 发布在 Python

关注(0)|答案(1)|浏览(146)

我有一个字典，我想把字典的键列表添加到 Dataframe 中一列的每个单元格中。目前我的尝试还没有成功，我不知道为什么。
字典如下所示：
my_dict = {"A":"1","B":"2","C":"3","D":"4"}
我想将my_dict的键添加到 Dataframe 列中，最终结果如下所示：

+------------+------------+------------+
|       comb1|       comb2|        colA|
+------------+------------+------------+
|          YY|         XX |[A, B, C, D]|
+------------+------------+------------+

下一步的目标是分解，因此 Dataframe 如下所示：

+------------+------------+------------+
|       comb1|       comb2|        colA|
+------------+------------+------------+
|          YY|         XX |           A|
+------------+------------+------------+
|          YY|         XX |           B|
+------------+------------+------------+
|          YY|         XX |           C|
+------------+------------+------------+
|          YY|         XX |           D|
+------------+------------+------------+

如何在列的每一行上插入字典键，然后进行分解？

python

来源：https://stackoverflow.com/questions/74742501/pyspark-insert-list-to-dataframe-cell

1条答案

按热度按时间

yi0zb3m41#

您可以使用字典的键创建一些附加的常量列，然后用它们创建一个数组，最后分解该列。
代码比解释容易：

from pyspark.sql import functions as F

# create temporary constant columns with the keys of the dictionary
for k in my_dict.keys():
    df = df.withColumn(f'_temp_{k}', F.lit(k))

df = (
    df
    # add a column with an array collecting all the keys
    .withColumn('colA', F.array(*[f'_temp_{k}' for k in my_dict.keys()]))
    # drop the temporary columns
    .drop(*[f'_temp_{k}' for k in my_dict.keys()])
    # explode the column with the array
    .withColumn('colA', F.explode(F.col('colA')))
)

结果df为：

+-----+-----+----+
|comb1|comb2|colA|
+-----+-----+----+
|   YY|   XX|   A|
|   YY|   XX|   B|
|   YY|   XX|   C|
|   YY|   XX|   D|
+-----+-----+----+

赞(0）回复(0）举报 2022-12-10

我来回答

python Pyspark -将列表插入数据框单元格

1条答案

相关问题

热门标签

最新问答