scrapy 如何提取同一层次的纯文本和标签？

ev7lccsx 于 2022-11-09 发布在其他

关注(0)|答案(2)|浏览(176)

<p>
    A
    <br>
    <br>
    B
    <a ...>
        <span >C</span>
    </a>
    D
    <a ...>
        <span >E</span>
    </a>
    F
</p>

我想得到的结果是“ABCDEF”。
我知道xpath(text()).getall()可以得到“A”“B”“D”“F”
而xpath(./*)可以得到“C”和“E”
但是这样我就不知道元素的正确顺序了，我该怎么做呢？

scrapy

来源：https://stackoverflow.com/questions/72881096/how-to-extract-pure-texts-and-a-tags-which-are-on-the-same-level

2条答案

按热度按时间

nmpmafwu1#

xpath表达式txt = ''.join([ x.get().strip() for x in response.xpath('//p//text()')])将提取值"ABCDEF".
由碎贝壳证实：

In [1]: from scrapy.selector import Selector

In [2]: %paste
html = '''
<p>
    A
    <br>
    <br>
    B
    <a ...>
        <span >C</span>
    </a>
    D
    <a ...>
        <span >E</span>
    </a>
    F
</p>
'''

## -- End pasted text --

In [3]: res= Selector(text=html)

In [4]: res.xpath('//p//text()').getall()
Out[4]: 
['\n    A\n    ',
 '\n    ',       
 '\n    B\n    ',
 '\n        ',   
 'C',
 '\n    ',       
 '\n    D\n    ',
 '\n        ',   
 'E',
 '\n    ',
 '\n    F\n']

In [5]: txt = [ x.get().strip() for x in res.xpath('//p//text()')]

In [6]: txt
Out[6]: ['A', '', 'B', '', 'C', '', 'D', '', 'E', '', 'F']

In [7]: txt = ''.join([ x.get().strip() for x in res.xpath('//p//text()')])

In [8]: txt
Out[8]: 'ABCDEF'

赞(0）回复(0）举报 2022-11-09

yebdmbv42#

xpath('/p/text()')

or 

xpath('p ::text()')

两者都应该可以。检查this answer以获得更清晰的信息。另外，如果使用python -为提取的元素创建一个列表（）以保持顺序。

赞(0）回复(0）举报 2022-11-09

我来回答

scrapy 如何提取同一层次的纯文本和标签？

2条答案

相关问题

热门标签

最新问答