python在pysparkDataframe中执行动态条件

z4bn682m  于 2021-05-18  发布在  Spark
关注(0)|答案(1)|浏览(479)

我有一个Dataframe,目标有20列。在这20列中,有4列是必需的,也就是说,这4列应该有值,并且不应该为null,并且这些列中没有空格。我想为这4列筛选包含null和空格的行。

  1. This can be done with below condition.
  2. filtered_df=target_df.filter((trim(target_df['Col 1'])==' ') | (target_df['Col 1'].isNull()) |
  3. (trim(target_df['Col 2'])==' ') | (target_df['Col 2'].isNull()) |
  4. (trim(target_df['Col 3'])==' ') | (target_df['Col 3'].isNull()) |
  5. (trim(target_df['Col 4'])==' ') | (target_df['Col 4'].isNull()))
  6. I want to make this dynamic and based on list of columns, i want to generate the condition.
  7. mandatory_col=['col 1', 'col 2', 'col 3', 'col 4']
  8. ln=[]
  9. for ele in mandatory_col:
  10. str1="(trim(target_df['{}'])==' ') | (target_df['{}'].isNull())".format(ele, ele)
  11. ln.append(str1)
  12. condition=' | '.join(ln)
  13. print(condition):
  14. (trim(target_df['Col 1'])==' ') | (target_df['Col 1'].isNull()) |
  15. (trim(target_df['Col 2'])==' ') | (target_df['Col 2'].isNull()) |
  16. (trim(target_df['Col 3'])==' ') | (target_df['Col 3'].isNull()) |
  17. (trim(target_df['Col 4'])==' ') | (target_df['Col 4'].isNull())
  18. filtered_df=target_df.filter(condition)
  19. when I try to execute above condition, it throws error
  20. ParseException:
  21. mismatched input ')' expecting {'COLLECT', 'CONVERT', 'DELTA', 'HISTORY', 'MATCHED', 'MERGE', 'OPTIMIZE', 'SAMPLE', 'TIMESTAMP', 'UPDATE', 'VERSION',....., IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 77)
  22. The reason being condition is string and df filter takes pyspark.sql.column.Column.
  23. please suggest how can I achieve executing string expression.
iklwldmw

iklwldmw1#

请使用expr使用此字符串操作。我希望你的列名中没有空格。请注意,您所要求的是基于用例的。阅读关于stackoverflow的提问指南。

  1. from pyspark.sql.functions import expr
  2. mandatory_col = ['col1', 'col2', 'col3', 'col4']
  3. str1 = ''
  4. for ele in mandatory_col:
  5. str1 = str1 + '''trim('{}')'''.format(ele) + ''' is not null or ''' + '''trim('{}')'''.format(ele) + ''' = ' ' or '''
  6. print(str1[:-5])
  7. trim('col1') is not null or trim('col1') = ' ' or trim('col2') is not null or trim('col2') = ' ' or trim('col3') is not null or trim('col3') = ' ' or trim('col4') is not null or trim('col4') = ' '
  8. filtered_df = target_df.filter(expr(str1[:-5]))

相关问题