python在pysparkDataframe中执行动态条件

z4bn682m 于 2021-05-18 发布在 Spark

关注(0)|答案(1)|浏览(479)

我有一个Dataframe，目标有20列。在这20列中，有4列是必需的，也就是说，这4列应该有值，并且不应该为null，并且这些列中没有空格。我想为这4列筛选包含null和空格的行。

This can be done with below condition.
filtered_df=target_df.filter((trim(target_df['Col 1'])==' ') | (target_df['Col 1'].isNull()) |
(trim(target_df['Col 2'])==' ') | (target_df['Col 2'].isNull()) |
(trim(target_df['Col 3'])==' ') | (target_df['Col 3'].isNull()) |
(trim(target_df['Col 4'])==' ') | (target_df['Col 4'].isNull()))
I want to make this dynamic and based on list of columns, i want to generate the condition.
mandatory_col=['col 1', 'col 2', 'col 3', 'col 4']
ln=[]
for ele in mandatory_col:
    str1="(trim(target_df['{}'])==' ') | (target_df['{}'].isNull())".format(ele, ele)
    ln.append(str1)
condition=' | '.join(ln)
print(condition):
(trim(target_df['Col 1'])==' ') | (target_df['Col 1'].isNull()) |
(trim(target_df['Col 2'])==' ') | (target_df['Col 2'].isNull()) |
(trim(target_df['Col 3'])==' ') | (target_df['Col 3'].isNull()) |
(trim(target_df['Col 4'])==' ') | (target_df['Col 4'].isNull())
filtered_df=target_df.filter(condition)  
when I try to execute above condition,  it throws error
ParseException: 
mismatched input ')' expecting {'COLLECT', 'CONVERT', 'DELTA', 'HISTORY', 'MATCHED', 'MERGE', 'OPTIMIZE', 'SAMPLE', 'TIMESTAMP', 'UPDATE', 'VERSION',....., IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 77)
The reason being condition is string and df filter takes pyspark.sql.column.Column.
please suggest how can I achieve executing string expression.

python DataFrame apache-spark pyspark

来源：https://stackoverflow.com/questions/64558182/executing-dynamic-condition-in-pyspark-data-frame

1条答案

按热度按时间

iklwldmw1#

请使用expr使用此字符串操作。我希望你的列名中没有空格。请注意，您所要求的是基于用例的。阅读关于stackoverflow的提问指南。

from pyspark.sql.functions import expr
mandatory_col = ['col1', 'col2', 'col3', 'col4']
str1 = ''
for ele in mandatory_col:
    str1 = str1 + '''trim('{}')'''.format(ele) + ''' is not null or ''' + '''trim('{}')'''.format(ele) + ''' = ' '  or '''
print(str1[:-5])
trim('col1') is not null or trim('col1') = ' '  or trim('col2') is not null or trim('col2') = ' '  or trim('col3') is not null or trim('col3') = ' '  or trim('col4') is not null or trim('col4') = ' '
filtered_df = target_df.filter(expr(str1[:-5]))

赞(0）回复(0）举报 2021-05-19

我来回答

python在pysparkDataframe中执行动态条件

1条答案

相关问题

热门标签

最新问答