我有以下Dataframe
+----------------------------------
|______value______________________|
| I am going to school ? |
| why are you crying ? ? |
| You are not very good my friend |
我想在每一行中提取emoji,并将这些值插入到同一Dataframe的新列中,如下所示
+-------------------------------------------------
|______value______________________|______emoji___|
| I am going to school ? | ? |
| why are you crying ? ? | ? ? |
--------------------------------------------------
我有下面的代码来过滤value列中带笑脸的句子。
kafkaTopicDataFrame.filter(regexp_extract(col("value"), raw"([\p{block=Emoticons},\p{block=Miscellaneous Symbols and Pictographs},\uD83E\uDD00-\uD83E\uDDFF])", 1) =!= "")
但是我不知道如何使用sparkscala为相应的行插入一个带有笑脸的新列。
编辑2
如果我想让emoji列包含不同emoji的数组,我编写了以下代码
df.filter(
regexp_extract(col("value"), raw"(\p{block=Emoticons})", 1) =!= ""
).withColumn(
"emoji", array(regexp_replace(
col("value"),raw"([^\p{block=Emoticons}|\p{block=Miscellaneous Symbols and Pictographs}|\uD83E\uDD00-\uD83E\uDDFF])",
""
))
)
实际产量
+-------------------------------------------------
|______value______________________|______emoji___|
| I am going to school ?? | [??] |
| why are you crying ? ? | [??] |
--------------------------------------------------
预期产量
+-------------------------------------------------
|______value______________________|______emoji___|
| I am going to school ? ? | [?] |
| why are you crying ? ? | [?,?] |
--------------------------------------------------
1条答案
按热度按时间xkftehaa1#
可以用空字符串替换非表情符号字符。注意
^
在regex模式的开头,它匹配不是指定字符的字符。编辑: