pyspark 您正在尝试访问一列,但多列具有该名称

h7appiyu  于 2023-03-01  发布在  Spark
关注(0)|答案(1)|浏览(145)

我正在尝试连接两个 Dataframe ,使它们都具有以下命名列。执行左外连接的最佳方法是什么?

df = df.join(df_forecast, ["D_ACCOUNTS_ID", "D_APPS_ID", "D_CONTENT_PAGE_ID"], 'left')

目前,我收到一个错误:

You're trying to access a column, but multiple columns have that name.

我错过了什么?

e4eetjau

e4eetjau1#

import pyspark.sql.functions as f

join_keys = ["D_ACCOUNTS_ID", "D_APPS_ID", "D_CONTENT_PAGE_ID"]

df = (
    df
    .join(df_forecast, join_keys, 'left')
    .select(
        *join_keys,
        # selecting columns from left side of the join that are not in the join keys.
        *[df[element].alias('df_'+element) for element in df.columns if element not in join_keys],
        # selecting columns from right side of the join that are not in the join keys.
        *[df_forecast[element].alias('df_forecast_'+element) for element in df_forecast.columns if element not in join_keys]
    )
)

相关问题