scrapy 如何使用xpath从多个具有不同html结构的元素中选择文本

vojdkbi0  于 2022-11-23  发布在  其他
关注(0)|答案(2)|浏览(145)

我有这个div,我想问一下,是否可以只使用1个XPATH命令,通过XPATH选择“TEXT_I_NEED_X”?
我能得到的最接近于选择它们的是这样的,但它选择的比我需要的要多:
//div[@class="article-text-with-img"]/p//text()

<div class="article-text-with-img">
    
  <p>
    <a href="#"> Text1 </a>
  </p>
  
  <p>&nbsp;</p>
  
  <p>
    TEXT_I_NEED_A
    <a href="#"> Text2 </a>
  </p>
  
  <p>
    <span>
      TEXT_I_NEED_B
      <a href="#"> Text3 </a>
    </span>
  </p>
  
  <p> 
    <span>
        <span>
            TEXT_I_NEED_C
            <a href="#"> Text4 </a>
        </span>
    </span>
  </p>
  
  <p>
    <span> 
        TEXT_I_NEED_D
    </span>
    <a href="#"> Text5 </a>
  </p>

  <p>
    <span> 
        <spam>
           TEXT_I_NEED_D
        </span>
        <a href="#"> Text5 </a>
    </span>
  </p>
  
</div>
moiiocjp

moiiocjp1#

使用单个XPath表达式:
//div[@class="article-text-with-img"]//a/parent::*/text() | //div[@class="article-text-with-img"]//a/preceding-sibling::span/text()
在命令行上使用xmllint(新行和空格包含在text()中)

xmllint --html --xpath '//div[@class="article-text-with-img"]//a/parent::*/text() | //div[@class="article-text-with-img"]//a/preceding-sibling::span/text()' test.html 


TEXT_I_NEED_A

  TEXT_I_NEED_B
  

        TEXT_I_NEED_C
        
    
 
    TEXT_I_NEED_D


    
       TEXT_I_NEED_E
yshpjwxd

yshpjwxd2#

beautifulsoup示例:

from bs4 import BeautifulSoup

html_doc = <YOUR HTML SNIPPET FROM THE QUESTION>

soup = BeautifulSoup(html_doc, "html.parser")

article = soup.select_one(".article-text-with-img")
for a in article.select("a"):
    a.extract()

text = [t for a in article.find_all(text=True) if (t := a.strip())]
print(text)

印刷品:

['TEXT_I_NEED_A', 'TEXT_I_NEED_B', 'TEXT_I_NEED_C', 'TEXT_I_NEED_D', 'TEXT_I_NEED_D']

相关问题