pyspark dataframe：删除列中的一些整词，但不区分大小写

juud5qan 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(536)

我试图删除pysparkDataframe列中的一些整词（但不区分大小写）。

import re
s = "I like the book. i'v seen it. Iv've" # add a new phrase
exclude_words = ["I", "I\'v", "I\'ve"]

exclude_words_re = re.compile(r"\b(" + r"|".join(exclude_words) +r")\b|\s", re.I|re.M)
exclude_words_re.sub("" , s)

我补充道

"Iv've"

但是，我得到了：

'like the book. seen it.'

“iv've”不应删除，因为它与排除的单词中的任何整词都不匹配。

python DataFrame apache-spark pyspark regex

来源：https://stackoverflow.com/questions/63887272/pyspark-dataframe-remove-some-whole-words-but-case-insensitive-in-a-column

1条答案

按热度按时间

l2osamch1#

要实现代码的2个更改：
使用适当的正则表达式标志忽略大小写
添加 \b 只包括整句话。

import re
s = "I like the book. i'v seen it. Iv've I've"
exclude_words = ["I", "I\'v", "I\'ve"]

exclude_words_re = re.compile(r"(^|\b)((" + r"|".join(exclude_words) +r"))(\s|$)", re.I|re.M)
exclude_words_re.sub("" , s)

"like the book. seen it. Iv've "

赞(0）回复(0）举报 2021-05-27

我来回答

pyspark dataframe：删除列中的一些整词，但不区分大小写

1条答案

相关问题

热门标签

最新问答