scrapy 从提取的文本中删除白色换行符Python抓取

68bkxrlz 于 2023-04-12 发布在 Python

关注(0)|答案(1)|浏览(218)

我正面临着一个关于从网站页面中提取文本的问题。我正在使用XPath选择器和Scrapy。
页面包含如下标记：

<div class="snippet-content">
    <h2>First Child</h2>
    <p>Hello</p>
    This is large text ..........
</div>

我基本上需要的文本后，2直接的孩子。选择器，我使用的是这样的：

text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get()

正确提取文本，但它包含white spaces、NBPS和新换行符\r\n字符。

例如：

提取文本是这样的：

"         \r\nRemarks byNBPS Deputy Prime Minister andNBPS Coordinating Minister for Economic Policies Heng Swee Keat at the Opening of the Bilingualism Carnival on 8 April 2023.                                "

有没有一种方法可以得到干净的文本，而不包含所有尾随的whitespaces，linebreaks字符和NBPS字符？

scrapy

来源：https://stackoverflow.com/questions/75979362/remove-white-spaces-line-breaks-from-the-extracted-text-python-scraping

1条答案

按热度按时间

xxb16uws1#

你可以使用xpath函数normalize-space，但这不仅仅是简单地从字符串的开头和结尾删除空格。如果字符串还包含空格或其他空格字符，它也会将它们减少到单个空格，而不管它们位于字符串的何处。
或者，你可以使用python str.strip方法，默认情况下（不设置参数）只删除字符串开头和结尾的空格字符。
示例：

text = response.xpath('normalize-space(//div[contains(@class, "snippet-content")]/text()[last()])').get()

text = response.xpath('//div[contains(@class, "snippet-content")]/text()[last()]').get().strip()

赞(0）回复(0）举报 2023-04-12

我来回答

scrapy 从提取的文本中删除白色换行符Python抓取

1条答案

相关问题

热门标签

最新问答