scala—在Dataframe上运行regex,并将结果存储在新的Dataframe中

0x6upsns  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(430)

我有以下Dataframe

+----------------------------------
|______value______________________|
| I am going to school ?        |
| why are you crying ? ?       |
| You are not very good my friend |

我想用emojis过滤这些行,并将它们放入一个新的Dataframe中。我编写了以下代码,将dataframe转换为一个列表,然后遍历该列表以识别带有emojis的句子。但我不知道如何在Dataframe中应用这些正则表达式。
现有代码

def convertDataFrameToList(combinedDataFrame : DataFrame) : List[Any] = {
    val myList=   combinedDataFrame.select("value").rdd.map(r => r(0)).collect.toList
    myList
  }
val listOutput = convertDataFrameToList(myDaframe)
for(element<- listOutput) {
 val emojiValues =  raw"\p{block=Emoticons}".r.findAllIn(element).toSeq
         val   y =    raw"\p{block=Miscellaneous Symbols and Pictographs}".r.findAllIn(element).toSeq
         val p =  emojiValues ++ y

//process further
}

更新
我试过下面的正则表达式

val emoticonResult = myKafkaDataFrame.filter(regexp_extract(col("value"), raw"([\p{block=Emoticons},\p{block=Miscellaneous Symbols and Pictographs},\uuD83E\uDD00-\uD83E\uDDFF])", 1) =!= "")

结果,包含emojis的行以及不包含任何emoji的行也会被返回。我能知道我的代码有什么问题吗?

wnvonmuf

wnvonmuf1#

你可以用 regexp_extract 使用正则表达式:

val emojis = df.filter(regexp_extract($"value", raw"(\p{block=Emoticons})", 1) =!= "")
val no_emojis = df.filter(regexp_extract($"value", raw"(\p{block=Emoticons})", 1) === "")

emojis.show(false)
+--------------------------+
|value                     |
+--------------------------+
|I am going to school ?   |
|why are you crying ? ?  |
+--------------------------+

no_emojis.show(false)
+-------------------------------+
|value                          |
+-------------------------------+
|You are not very good my friend|
+-------------------------------+

相关问题