scrapy 选择子元素中没有文本的元素内的所有文本节点

weylhg0b  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(188)

在抓取一个站点时,我有一个如下的HTML:

<div class="classA classB classC">
  <div class="classD classE">
    <h1 class="classF classD">Text I don't want</h1>
    <ul>....</ul> <!-- containing more text in nested children, don't want -->
  </div>
  Text I want to grab.
  <br>
  More text I want to grab
</div>

在这里,我怎样才能只选择我想要抓取的文本,例如["Text I want to grab", "More text I want to grab"],并防止选择Text I don't want。我尝试使用CSS选择器选择,如下所示:

text = response.css('.classA:not(.classD) *::text').getall()

有没有人知道,在这种情况下该怎么做,我不熟悉xpath,但如果有解决方案,请提出建议?

gjmwrych

gjmwrych1#

你即将达到你的目标。你想阻止<h1 class="classF classD">Text I don't want</h1>使用:不,这是正确的,但你必须选择整个html部分,从那里有你想要的输出意味着你必须选择<div class="classA classB classC">在第一次,然后你必须阻止任何你想要的。所以css表达式应该像这样:

response.css('div.classA.classB.classC:not(.classF)::text').getall()

' '.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])

由碎贝壳证实:

In [1]: from scrapy.selector import Selector

In [2]: %paste

html='''
<div class="classA classB classC">
  <div class="classD classE">
    <h1 class="classF classD">Text I don't want</h1>
    <ul>....</ul> <!-- containing more text in nested children, don't want -->
  </div>
  Text I want to grab.
  <br>
  More text I want to grab
</div>
'''

## -- End pasted text --

In [3]: resp=Selector(text=html)

In [4]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).strip()
Out[4]: 'Text I want to grab.\n  \n  More text I want to grab'

In [5]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).replace('\n','' 
   ...: ).strip()
Out[5]: 'Text I want to grab.    More text I want to grab'

In [6]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).strip().replace 
   ...: ('\n','').strip()
Out[6]: 'Text I want to grab.    More text I want to grab'

Out[7]: ['', 'Text I want to grab.', 'More text I want to grab']

In [8]: ''.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getal
   ...: l()])
Out[8]: 'Text I want to grab.More text I want to grab'

In [9]: ''.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])
Out[9]: 'Text I want to grab.More text I want to grab'

In [10]: ' '.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])        
Out[10]: ' Text I want to grab. More text I want to grab'

相关问题