从dataframe列中提取emojis,并将它们添加到同一dataframe scala spark的不同列中

0s0u357o  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(398)

我有以下Dataframe

+----------------------------------
|______value______________________|
| I am going to school ?        |
| why are you crying ? ?       |
| You are not very good my friend |

我想在每一行中提取emoji,并将这些值插入到同一Dataframe的新列中,如下所示

+-------------------------------------------------
|______value______________________|______emoji___|
| I am going to school ?        |      ?      |
| why are you crying ? ?       |    ? ?    |
--------------------------------------------------

我有下面的代码来过滤value列中带笑脸的句子。

kafkaTopicDataFrame.filter(regexp_extract(col("value"), raw"([\p{block=Emoticons},\p{block=Miscellaneous Symbols and Pictographs},\uD83E\uDD00-\uD83E\uDDFF])", 1) =!= "")

但是我不知道如何使用sparkscala为相应的行插入一个带有笑脸的新列。
编辑2
如果我想让emoji列包含不同emoji的数组,我编写了以下代码

df.filter(
      regexp_extract(col("value"), raw"(\p{block=Emoticons})", 1) =!= ""
    ).withColumn(
      "emoji", array(regexp_replace(
        col("value"),raw"([^\p{block=Emoticons}|\p{block=Miscellaneous Symbols and Pictographs}|\uD83E\uDD00-\uD83E\uDDFF])",
        ""
      ))

    )

实际产量

+-------------------------------------------------
|______value______________________|______emoji___|
| I am going to school ??      |    [??]   |
| why are you crying ? ?       |    [??]   |
--------------------------------------------------

预期产量

+-------------------------------------------------
|______value______________________|______emoji___|
| I am going to school ? ?     |    [?]      |
| why are you crying ? ?       |    [?,?]   |
--------------------------------------------------
xkftehaa

xkftehaa1#

可以用空字符串替换非表情符号字符。注意 ^ 在regex模式的开头,它匹配不是指定字符的字符。

val df2 = df.filter(
    regexp_extract($"value", raw"(\p{block=Emoticons})", 1) =!= ""
).withColumn(
    "emoji", 
    regexp_replace(
        col("value"), 
        raw"([^\p{block=Emoticons}\p{block=Miscellaneous Symbols and Pictographs}\uD83E\uDD00-\uD83E\uDDFF])", 
        ""
    )
)

df2.show(false)
+-------------------------------+-----+
|value                          |emoji|
+-------------------------------+-----+
|I am going to school ?        |?   |
|why are you crying ? ?       |?? |
+-------------------------------+-----+

编辑:

val df2 = df.filter(
    regexp_extract(col("value"), raw"(\p{block=Emoticons})", 1) =!= ""
).withColumn(
    "emoji", 
    regexp_replace(
        col("value"),
        raw"([^\p{block=Emoticons}|\p{block=Miscellaneous Symbols and Pictographs}|\uD83E\uDD00-\uD83E\uDDFF])",
        ""
    )
).withColumn(
    "emoji", 
    regexp_replace(
        col("emoji"),
        raw"([\p{block=Emoticons}|\p{block=Miscellaneous Symbols and Pictographs}|\uD83E\uDD00-\uD83E\uDDFF])", 
        "$1 "
    )
).withColumn(
    "emoji", 
    split(trim(col("emoji")), " ")
)

df2.show(false)
+------------------------+--------+
|value                   |emoji   |
+------------------------+--------+
|I am going to school ? |[?]    |
|why are you crying ? ?|[?, ?]|
+------------------------+--------+

相关问题