连接两个Dataframea和b,并将非键列从Dataframeb转换为json字符串

eanckbw9  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(336)

我有两个Dataframe,如下所示,
Dataframea:

DEPT_ID DEPT_NAME
10      Finance
20      Marketing

Dataframeb:

EMP_ID DEPT_ID EMP_NAME EMP_SALARY
101    10      AAAA     1000
102    20      BBBB     2000
103    10      CCCC     1500
104    20      DDDD     3000

预期结果:在pyspark中,我需要在 DEPT_ID 并将DataFrameB中的非键列转换为\uJSON字符串并存储在 Json_Data Dataframec中的列
Dataframec:

DEPT_ID DEPT_NAME Json_Data
10      Finance   [{"_status": "normal","EmpDetails":{"EMP_ID":"101","EMP_NAME":"AAAA","EMP_SALARY":"1000"},{"EMP_ID":"103","EMP_NAME":"CCCC","EMP_SALARY":"1500" }]
20      Marketing [{"_status": "normal","EmpDetails":{"EMP_ID":"102","EMP_NAME":"BBBB","EMP_SALARY":"2000"},{"EMP_ID":"104","EMP_NAME":"DDDD","EMP_SALARY":"3000" }]
smdncfj3

smdncfj31#

您可以加入并分组 DEPT_ID 以及 DEPT_NAME 将员工详细信息列表收集到结构中。使用 to_json 要获取json字符串:

from pyspark.sql import functions as F

df_c = df_a.join(df_b, ["DEPT_ID"]).groupBy("DEPT_ID", "DEPT_NAME").agg(
    F.to_json(
        F.struct(
            F.lit("normal").alias("_status"),
            F.collect_list(
                F.struct(
                    F.col("EMP_ID"),
                    F.col("EMP_NAME"),
                    F.col("EMP_SALARY")
                )
            ).alias("EmpDetails")
        )
    ).alias("Json_Data")
)

df_c.show(truncate=False)

# +-------+---------+-----------------------------------------------------------------------------------------------------------------------------------------+

# |DEPT_ID|DEPT_NAME|Json_Data                                                                                                                                |

# +-------+---------+-----------------------------------------------------------------------------------------------------------------------------------------+

# |10     |Finance  |{"_status":"normal","EmpDetails":[{"EMP_ID":101,"EMP_NAME":"AAAA","EMP_SALARY":1000},{"EMP_ID":103,"EMP_NAME":"CCCC","EMP_SALARY":1500}]}|

# |20     |Marketing|{"_status":"normal","EmpDetails":[{"EMP_ID":102,"EMP_NAME":"BBBB","EMP_SALARY":2000},{"EMP_ID":104,"EMP_NAME":"DDDD","EMP_SALARY":3000}]}|

# +-------+---------+-----------------------------------------------------------------------------------------------------------------------------------------+

相关问题