如何调试Scrapy？

zsbz8rwp 于 12个月前发布在其他

关注(0)|答案(4)|浏览(92)

我99%确定我的hxs.select在这个网站上发生了一些事情。我无法提取任何内容。当我运行下面的代码时，我没有得到任何错误反馈。title或link没有被填充。任何帮助？

def parse(self, response):
    self.log("\n\n\n We got data! \n\n\n")
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//div[@class=\'footer\']')
    items = []
    for site in sites:
        item = CarrierItem()
        item['title'] = site.select('.//a/text()').extract()
        item['link'] = site.select('.//a/@href').extract()
        items.append(item)
    return items

字符串
我也试过使用scrapy shell命令和一个url，但是当我在shell中输入view(response)时，它只会返回True，打开的是一个文本文件，而不是我的Web浏览器。

>>> response.url
'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'

>>> hxs.select('//div')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'

>>> view(response)
True

>>> hxs.select('//body')
Traceback (most recent call last):
    File "", line 1, in 
AttributeError: 'NoneType' object has no attribute 'select'

型

scrapy

来源：https://stackoverflow.com/questions/17413398/how-can-i-debug-scrapy

4条答案

按热度按时间

wqsoz72f1#

你可以在命令行中使用pdb，在你的文件中添加一个断点。但是这可能需要一些步骤。
(It可能略有不同的Windows调试）
1.找到您的scrapy可执行文件：

$ whereis scrapy
/usr/local/bin/scrapy

字符串
1.将其作为python脚本调用并启动PDB

$ python -m pdb /usr/local/bin/scrapy crawl quotes

型
1.一旦进入调试器shell，打开另一个shell示例并定位到您的spider脚本（驻留在您的spider项目中）的路径

$ realpath path/to/your/spider.py
/absolute/spider/file/path.py

型
这将输出绝对路径。将其复制到剪贴板。
1.在PDB shell中键入：

b /absolute/spider/file/path.py:line_number

型
.其中行号是调试该文件时需要中断的点。
1.在调试器中点击c.
现在去做一些PythonFu：）

赞(0）回复(0）举报 12个月前

rnmwe5a22#

Scrapy shell确实是一个很好的工具。如果你的文档有一个XML样式表，它可能是一个XML文档。所以你可以使用scrapy shell和xxs而不是hxs，就像这个Scrapy文档中关于删除名称空间的例子：http://doc.scrapy.org/en/latest/topics/selectors.html#removing-namespaces
如果这不起作用，我倾向于返回到纯lxml.etree并转储整个文档的元素：

import lxml.etree
import lxml.html

class myspider(BaseSpider):
    ...
    def parse(self, response):
        self.log("\n\n\n We got data! \n\n\n")
        root = lxml.etree.fromstring(response.body).getroot()
        # or for broken XML docs:
        # root = lxml.etree.fromstring(response.body, parser = lxml.etree.XMLParser(recover=True)).getroot()
        # or for HTML:
        # root = lxml.etree.fromstring(response.body, parser=lxml.html.HTMLParser()).getroot()

        # and then lookup what are the actual elements I can select
        print list(root.iter()) # this could be very big, but at least you all what's inside, the element tags and namespaces

字符串

赞(0）回复(0）举报 12个月前

o2rvlv0m3#

使用VSCode：

1.找到scrapy可执行文件的位置：

$ which scrapy
/Users/whatever/tutorial/tutorial/env/bin/scrapy

字符串
对我来说是在/Users/whatever/tutorial/tutorial/env/bin/scrapy，复制该路径。

2.创建launch.json文件

转到VSCode中的调试选项卡并单击“添加配置”x1c 0d1x

3.将以下模板粘贴到launch.json中

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Current File",
            "args": ["crawl", "NAME_OF_SPIDER"],
            "type": "python",
            "request": "launch",
            "program": "PATH_TO_SCRAPY_FILE",
            "console": "integratedTerminal",
            "justMyCode": false
        }
    ]
}

型
在该模板中，将NAME_OF_SPIDER替换为您的spider的名称（在我的示例中为datasets）。将PATH_TO_SCRAPY_FILE替换为步骤1中获得的输出。（在我的示例中为/Users/whatever/tutorial/tutorial/env/bin/scrapy）。

4.检查VSCode是否在scrapy项目的根目录下打开

5.设置断点，点击debug！

赞(0）回复(0）举报 12个月前

7vux5j2d4#

要使用PDB调试scrapy spider，您需要插入一个调试点，并包含一些代码来打开和关闭它：
This is a very simple spider.
要使此spider可通过pdb调试，您可以添加以下代码：

# -*- coding: utf-8 -*-
import scrapy

import os
import pdb

class QuotesSpiderSpider(scrapy.Spider):
    name = 'simple'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def __init__(self):
        scrapyDebug = os.getenv("SCRAPY_DEBUG")
        if scrapyDebug and int(scrapyDebug):
            pdb.set_trace()

    def parse(self, response):
        quotes = response.xpath("//div[@class='quote']//span[@class='text']/text()").extract()
        yield {'quotes': quotes}

字符串
所以运行spider通常不会调用调试器。如果你有一个bug需要去-，你可以像这样调用调试器：

SCRAPY_DEBUG=1 scrapy crawl simple

型
在spiderinit（）方法中，调试器将被启动。然后您可以在代码中出现问题的地方设置断点。

赞(0）回复(0）举报 12个月前

我来回答

如何调试Scrapy？

4条答案

使用VSCode：

1.找到scrapy可执行文件的位置：

2.创建launch.json文件

3.将以下模板粘贴到launch.json中

4.检查VSCode是否在scrapy项目的根目录下打开

5.设置断点，点击debug！

相关问题

热门标签

最新问答