在Scrapy上使用xpath从响应中删除元素

nsc4cvqm  于 2023-02-22  发布在  其他
关注(0)|答案(1)|浏览(172)

我想从scrappy response中删除特定元素以下是我的步骤

scrapy shell example.com
list = response.xpath(xpath) # len(list) = 220, which means there are multiple target elements exits
for selector in list:
    selector.remove() # or selector.drop(), not know the difference
list = response.xpath(xpath) # len(list) = 0, which means removed successfully

而当我查找response.text时,目标元素仍然存在!!!
如何得到正确的回答

gr8qqesn

gr8qqesn1#

您可以通过最初使用根XPath表达式获取HTML的根元素来获得它。
然后使用根元素的相对路径执行所需的任何drop操作。
完成后,可以使用root.get()获取生成的html文本。
例如,下面是一些示例html:
index.html

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Document</title>
</head>
<body>
  <div class="section">
    <ul>
      <li><a href="link1">...</a></li>
      <li><a href="link2">...</a></li>
      <li><a href="link3">...</a></li>
    </ul>
  </div>
  <div>
    <table>
      <thead>
        <th>
          <td><a href="link4">...</a></td>
          <td><a href="link5">...</a></td>
        </th>
      </thead>
    <tbody>
      <tr>
        <td><a href="link6">...</a></td>
        <td><a href="link7">...</a></td>
      </tr>
    </tbody>
    </table>
  </div>
</body>
</html>

因此我将其命名为scrapy shell ./index.html

...
>>> root = response.xpath('/*')
>>> root
[<Selector xpath='/*' data='<html lang="en">\n<head>\n  <meta chars...'>]

>>> root.get()
>>> root = response.xpath('/*')
>>> root
[<Selector xpath='/*' data='<html lang="en">\n<head>\n  <meta chars...'>]
>>> root.get()
'<html lang="en">\n<head>\n  <meta charset="UTF-8">\n  <meta http-equiv="X-UA-Compatible" content="IE=edge">\n  <meta name="viewport" content="width=device-width, initial-scale=1.0">\n  <title>Document</title>\n</head>\n<body>\n  <div class="section">\n    <ul>\n      <li><a href="link1">...</a></li>\n      <li><a href="link2">...</a></li>\n      <li><a href="link3">...</a></li>\n    </ul>\n  </div>\n  <div>\n    <table>\n      <thead>\n        <th>\n          </th><td><a href="link4">...</a></td>\n          <td><a href="link5">...</a></td>\n        \n      </thead>\n    <tbody>\n      <tr>\n        <td><a href="link6">...</a></td>\n        <td><a href="link7">...</a></td>\n      </tr>\n    </tbody>\n    </table>\n  </div>\n</body>\n</html>'
>>> a_elems = root.xpath('.//a')
>>> len(a_elems)
7
>>> a_elems.drop()
>>> root.xpath('.//a')
[]
>>> root.get()
'<html lang="en">\n<head>\n  <meta charset="UTF-8">\n  <meta http-equiv="X-UA-Compatible" content="IE=edge">\n  <meta name="viewport" content="width=device-width, initial-scale=1.0">\n  <title>Document</title>\n</head>\n<body>\n  <div class="section">\n    <ul>\n      <li>\n      <li>\n      <li>\n    </ul>\n  </div>\n  <div>\n    <table>\n      <thead>\n        <th>\n          </th><td></td>\n          <td></td>\n        \n</thead>\n    <tbody>\n      <tr>\n        <td></td>\n        <td></td>\n      </tr>\n    </tbody>\n    </table>\n  </div>\n</body>\n</html>'

正如您所看到的,示例html中没有更多的<a>元素。
p.s. removedrop之间的区别是remove被弃用,而drop没有。

相关问题