如何将unicode文本转换为python可以读取的文本，以便在webscraping结果中找到特定的单词？

yjghlzjz 于 2023-01-22 发布在 Python

关注(0)|答案(2)|浏览(128)

我试图在Instagram中抓取文本，并检查是否可以在个人简介中找到一些关键字，但用户使用特殊字体，所以我无法识别特定的单词，我如何删除文本的字体或格式，以便我可以搜索该单词？

import re
test="𝙄𝙣𝙝𝙖𝙡𝙚 𝙩𝙝𝙚 𝙛𝙪𝙩𝙪𝙧𝙚 𝙩𝙝𝙚𝙣 𝙚𝙭𝙝𝙖𝙡𝙚 𝙩𝙝𝙚 𝙥𝙖𝙨𝙩. "

x = re.findall(re.compile('past'), test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

未找到文本
另一个例子：

import re
test="ғʀᴇᴇʟᴀɴᴄᴇ ɢʀᴀᴘʜɪᴄ ᴅᴇsɪɢɴᴇʀ"
test=test.lower()

x = re.findall(re.compile('graphic'), test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

未找到文本

python-3.x

来源：https://stackoverflow.com/questions/71163967/how-do-i-convert-a-unicode-text-to-a-text-that-python-can-read-so-that-i-could-f

2条答案

按热度按时间

jecbmhm31#

可以使用unicodedata.normalize返回Unicode字符串的范式。有关示例，请参见以下代码段：

import re
import unicodedata

test="𝙄𝙣𝙝𝙖𝙡𝙚 𝙩𝙝𝙚 𝙛𝙪𝙩𝙪𝙧𝙚 𝙩𝙝𝙚𝙣 𝙚𝙭𝙝𝙖𝙡𝙚 𝙩𝙝𝙚 𝙥𝙖𝙨𝙩. "
 
formatted_test = unicodedata.normalize('NFKD', test).encode('ascii', 'ignore').decode('utf-8')

x = re.findall(re.compile('past'), formatted_test)
if x:    
    print("TEXT FOUND")
else:
    print("TEXT NOT FOUND")

并且输出将是：
找到文本

赞(0）回复(0）举报 2023-01-22

guykilcj2#

如果您正在处理葡萄牙语文本，请小心。如果您有：

string = """𝓿𝓲𝓫𝓻𝓪𝓷𝓽𝓮𝓼 orçamento"""

您用途：

unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('utf-8')

你将失去cedilha（Ç），它的意思是，orçamento将orcamento。
否则，如果用途：

unicodedata.normalize('NFKC', string)

你会留下塞迪拉。
请注意，我将NFKD更改为NFKC，而不是剪切编码和解码。

赞(0）回复(0）举报 2023-01-22

我来回答

如何将unicode文本转换为python可以读取的文本，以便在webscraping结果中找到特定的单词？

2条答案

相关问题

热门标签

最新问答