spark:使用regex删除多个列

wbrvyc0a  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(553)

在使用scala的spark(2.3.0)项目中,我想 drop 使用正则表达式的多列。我尝试使用colregex,但没有成功:

val df = Seq(("id","a_in","a_out","b_in","b_out"))
  .toDF("id","a_in","a_out","b_in","b_out")

val df_in = df
  .withColumnRenamed("a_in","a")
  .withColumnRenamed("b_in","b")
  .drop(df.colRegex("`.*_(in|out)`"))

// Hoping to get columns Array(id, a, b)
df_in.columns
// Getting Array(id, a, a_out, b, b_out)

另一方面,该机制似乎与 select :

df.select(df.colRegex("`.*_(in|out)`")).columns
// Getting Array(a_in, a_out, b_in, b_out)

有几件事我不清楚:
正则表达式中的反引号语法是什么? colRegex 返回一个 Column :在第二个示例中,它如何实际表示多个列?
我可以合并吗 drop 以及 colRegex 或者我需要一些解决方法?

mkh04yzy

mkh04yzy1#

如果你检查colrefex方法的Spark码。。。它希望正则表达式以以下格式传递

/**the column name pattern in quoted regex without qualifier */
 val escapedIdentifier = "`(.+)`".r
 /**the column name pattern in quoted regex with qualifier */
 val qualifiedEscapedIdentifier = ("(.+)" + """.""" + "`(.+)`").r

backticks(`)是封闭regex所必需的,否则上面的模式将无法识别您的输入模式。
您可以尝试选择如下所述有效的特定列

val df = Seq(("id","a_in","a_out","b_in","b_out"))
  .toDF("id","a_in","a_out","b_in","b_out")

val df_in = df
  .withColumnRenamed("a_in","a")
  .withColumnRenamed("b_in","b")
  .drop(df.colRegex("`.*_(in|out)`"))
val validColumns = df_in.columns.filter(p => p.matches(".*_(in|out)$")).toSeq //select all junk columns
val final_df_in = df_in.drop(validColumns:_*) // this will drop all columns which are not valid as per your criteria.
mkshixfv

mkshixfv2#

除了waqar ahmed和kavetiraviteja(公认答案)提出的解决方案之外,还有另一种基于 select 用一些负正则表达式魔法。更简洁,但更难阅读非正则表达式大师。。。

val df_in = df
  .withColumnRenamed("a_in","a")
  .withColumnRenamed("b_in","b")
  .select(df.colRegex("`^(?!.*_(in|out)_).*$`")) // regex with negative lookahead

相关问题