scrapy 类中由br标记分隔的XPath临时连接文本节点

1u4esq0p  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(158)

我正在用Python、Xpath和Scrapy学习网页抓取。我被下面的问题卡住了。如果你能帮助我,我将感激不尽。
这是HTML代码

<div class="discussionpost">
“This is paragraph one.”
<br>
<br>
“This is paragraph two."'
<br>
<br>
"This is paragraph three.”
</div>

这是我想要得到的输出:“这是第一段。这是第二段。这是第三段。”我想合并所有由<br>分隔的段落。没有<p>标记。
但是,我得到的输出是:“这是第一句话",“这是第二句话",“这是第三句话”
这是我正在使用的代码:

sentences = response.xpath('//div[@class="discussionpost"]/text()').extract()

我理解为什么上面的代码是这样的。但是,我不能改变它来做我需要做的事情。任何帮助都是非常感谢的。

djp7away

djp7away1#

要获取所有文本节点的值,必须调用//text()而不是/text()

sentences = ' '.join(response.x`path('//div[@class="discussionpost"]//text()').extract()).strip()

由Scrapy Shell证明

>>> from scrapy import Selector
>>> html_doc = '''
... <html>
...  <body>
...   <div class="discussionpost">
...    “This is paragraph one.”
...    <br/>
...    <br/>
...    “This is paragraph two."'
...    <br/>
...    <br/>
...    "This is paragraph three.”
...   </div>
...  </body>
... </html>
...
... '''
>>> res = Selector(text=html_doc)
>>> res
<Selector xpath=None data='<html>\n <body>\n  <div class="discussi...'>
>>> sentences = ''.join(res.xpath('//div[@class="discussionpost"]//text()').extract())
>>> sentences
'\n   “This is paragraph one.”\n   \n   \n   “This is paragraph two."\'\n   \n   \n   "This is paragraph three.”\n  '
>>> txt = sentences
>>> txt
'\n   “This is paragraph one.”\n   \n   \n   “This is paragraph two."\'\n   \n   \n   "This is paragraph three.”\n  '
>>> txt = sentences.replace('\n','').replace("\'",'').replace('    ','').replace("“",'').replace('”','').replace('"','').strip()
>>> txt
'This is paragraph one. This is paragraph two. This is paragraph three.'
>>>

更新日期:

import scrapy
class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.ibsgroup.org/threads/hemorrhoids-as-cause-of-pain.363290/']

    def parse(self, response):
        for p in response.xpath('//*[@class="bbWrapper"]'):
            yield {
            'comment': ''.join(p.xpath(".//text()").getall()).strip()
            }

相关问题