我有一个Dataframe,目标有20列。在这20列中,有4列是必需的,也就是说,这4列应该有值,并且不应该为null,并且这些列中没有空格。我想为这4列筛选包含null和空格的行。
This can be done with below condition.
filtered_df=target_df.filter((trim(target_df['Col 1'])==' ') | (target_df['Col 1'].isNull()) |
(trim(target_df['Col 2'])==' ') | (target_df['Col 2'].isNull()) |
(trim(target_df['Col 3'])==' ') | (target_df['Col 3'].isNull()) |
(trim(target_df['Col 4'])==' ') | (target_df['Col 4'].isNull()))
I want to make this dynamic and based on list of columns, i want to generate the condition.
mandatory_col=['col 1', 'col 2', 'col 3', 'col 4']
ln=[]
for ele in mandatory_col:
str1="(trim(target_df['{}'])==' ') | (target_df['{}'].isNull())".format(ele, ele)
ln.append(str1)
condition=' | '.join(ln)
print(condition):
(trim(target_df['Col 1'])==' ') | (target_df['Col 1'].isNull()) |
(trim(target_df['Col 2'])==' ') | (target_df['Col 2'].isNull()) |
(trim(target_df['Col 3'])==' ') | (target_df['Col 3'].isNull()) |
(trim(target_df['Col 4'])==' ') | (target_df['Col 4'].isNull())
filtered_df=target_df.filter(condition)
when I try to execute above condition, it throws error
ParseException:
mismatched input ')' expecting {'COLLECT', 'CONVERT', 'DELTA', 'HISTORY', 'MATCHED', 'MERGE', 'OPTIMIZE', 'SAMPLE', 'TIMESTAMP', 'UPDATE', 'VERSION',....., IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 77)
The reason being condition is string and df filter takes pyspark.sql.column.Column.
please suggest how can I achieve executing string expression.
1条答案
按热度按时间iklwldmw1#
请使用expr使用此字符串操作。我希望你的列名中没有空格。请注意,您所要求的是基于用例的。阅读关于stackoverflow的提问指南。