从PySpark中的 Dataframe 提取键值

f5emj3cl  于 2022-11-01  发布在  Spark
关注(0)|答案(2)|浏览(248)

我有下面的 Dataframe ,我已经从JSON文件读取。
| 一个|2个|三个|四个|
| - -|- -|- -|- -|
| {“todo”:[“醒来”,“淋浴”]}|{“todo”:[“刷”,“吃”]}|{“待办事项”:[“读取”,“写入”]}|{“todo”:[“睡觉”,“打盹”]}|
我需要我的输出如下键和值。我该怎么做呢?我需要创建一个模式吗?
| 识别码|托多|
| - -|- -|
| 一个|叫醒,淋浴|
| 2个|刷、吃|
| 三个|读取、写入|
| 四个|睡眠,打盹|

flvlnr44

flvlnr441#

您引用的key-value是一个结构。“keys”是结构字段名,而“values”是字段值。
你要做的事情叫做取消透视,在PySpark中实现的方法之一是使用stack,下面是一个动态的方法,你不需要提供已经存在的列名。
输入 Dataframe :

df = spark.createDataFrame(
    [((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
    '`1` struct<todo:array<string>>, `2` struct<todo:array<string>>, `3` struct<todo:array<string>>, `4` struct<todo:array<string>>')

脚本:

to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")

df.show()

# +---+----------------+

# | ID|            todo|

# +---+----------------+

# |  1|[wakeup, shower]|

# |  2|    [brush, eat]|

# |  3|   [read, write]|

# |  4| [sleep, snooze]|

# +---+----------------+
de90aj5v

de90aj5v2#

使用from_json将字符串转换为数组。分解以将每个唯一元素级联到行。
数据库

df = spark.createDataFrame(
    [(('{"todo":"[wakeup, shower]"}'),('{"todo":"[brush, eat]"}'),('{"todo":"[read, write]"}'),('{"todo":"[sleep, snooze]"}'))],
    ('value1','values2','value3','value4'))

代码

new = (df.withColumn('todo', explode(flatten(array(*[map_values(from_json(x, "MAP<STRING,STRING>")) for x in df.columns])))) #From string to array to indivicual row
   .withColumn('todo', translate('todo',"[]",'')#Remove corner brackets
              ) ).show(truncate=False)

结果

+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|value1                     |values2                |value3                  |value4                    |todo          |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|wakeup, shower|
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|brush, eat    |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|read, write   |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|sleep, snooze |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+

相关问题