我必须对pysparkDataframe中的多列应用某些函数。下面是我的代码:
finaldf=df.withColumn('phone_number',regexp_replace("phone_number","[^0-9]",""))\
.withColumn('account_id',regexp_replace("account_id","[^0-9]",""))\
.withColumn('credit_card_limit',regexp_replace("credit_card_limit","[^0-9]",""))\
.withColumn('credit_card_number',regexp_replace("credit_card_number","[^0-9]",""))\
.withColumn('full_name',regexp_replace("full_name","[^a-zA-Z ]",""))\
.withColumn('transaction_code',regexp_replace("transaction_code","[^a-zA-Z]",""))\
.withColumn('shop',regexp_replace("shop","[^a-zA-Z ]",""))
finaldf=finaldf.filter(finaldf.account_id.isNotNull())\
.filter(finaldf.phone_number.isNotNull())\
.filter(finaldf.credit_card_number.isNotNull())\
.filter(finaldf.credit_card_limit.isNotNull())\
.filter(finaldf.transaction_code.isNotNull())\
.filter(finaldf.amount.isNotNull())
从代码中你可以看到有多余的代码我已经写了扩展程序的长度也。我还了解到spark-udf效率不高。
有没有办法优化这个代码?请告诉我。谢谢!
1条答案
按热度按时间ergxz8rk1#
为了
multiple filters
,你应该这样做。