在Pyspark时嵌套

anauzrmj 于 2022-12-28 发布在 Spark

关注(0)|答案(1)|浏览(125)

我需要应用大量的when条件，这些条件通过索引从列表中获取输入。我想问一下，是否有一种方法可以高效地编写代码，在不影响运行时效率的情况下生成相同的结果。
下面是我正在使用的代码

df=df.withColumn('date_match_label', F.when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[0]} matches with {date_cols[3]}")
                                  .when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[1]} matches with {date_cols[3]}")
                                  .when(F.col(date_cols[0])==F.col(date_cols[3]), f"{date_cols[1]} matches with {date_cols[3]}")
                                  .when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
                                  .when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
                                  .when(F.col(date_cols[1])==F.col(date_cols[4]), f"{date_cols[1]} matches with {date_cols[4]}")
                                  .when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
                                  .when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
                                  .when(F.col(date_cols[2])==F.col(date_cols[5]), f"{date_cols[1]} matches with {date_cols[5]}")
                                  .otherwise('No Match'))

这里date_cols包含六个列名，我需要检查前三列和后三列，如果匹配，就返回注解。
当前方法的问题是随着列表大小的增加，我不得不添加越来越多的行，这使得我的代码容易出错，看起来很难看。我想知道是否有一种方法，我只需要指定列表索引，需要与其他列表元素进行比较。

pyspark

来源：https://stackoverflow.com/questions/74866567/nested-when-in-pyspark

1条答案

按热度按时间

svgewumm1#

考虑到您希望将列表的前半部分（包含列名）与列表的后半部分进行比较，您可以动态构建代码表达式，这样就不需要在每次列表扩展时编写更容易出错的代码。
您可以通过以下方式在索引的帮助下动态构建代码：

from itertools import product
from pyspark.sql.functions import when,col

n=len(cols)
req=list(range(0,n))

res = list(product(req[:n//2], req[n//2:]))

start = '''df.withColumn('date_match_label','''

whens =[]
for i,j in res:
    whens.append(f'''when(col(cols[{i}])==col(cols[{j}]), f"cols[{i}] matches with cols[{j}]")''')
    
final_exp = start + '.'.join(whens) + '''.otherwise('No Match'))'''

这将生成如下所示的最终表达式，考虑到有4列（比较上半部分和下半部分）：

上面是一个字符串表达式，所以，要执行它，可以使用eval函数，得到的结果如下：

df = eval(final_exp)
df.show(truncate=False)

赞(0）回复(0）举报 2022-12-28

我来回答

在Pyspark时嵌套

1条答案

相关问题

热门标签

最新问答