scrapy 碎壳和碎溅

ubof19bj 于 2022-11-09 发布在其他

关注(0)|答案(3)|浏览(86)

我们一直在使用scrapy-splash middleware将抓取的HTML源代码传递给运行在docker容器中的Splash javascript引擎。
如果我们想在spider中使用Splash，我们需要配置几个必需的项目设置，并生成一个指定特定meta参数的Request：

yield Request(url, self.parse_result, meta={
    'splash': {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,

            # 'url' is prefilled from request url
        },

        # optional parameters
        'endpoint': 'render.json',  # optional; default is render.json
        'splash_url': '<url>',      # overrides SPLASH_URL
        'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
    }
})

但是，我们如何在Scrapy Shell中使用scrapy-splash呢？

scrapy

来源：https://stackoverflow.com/questions/35352423/scrapy-shell-and-scrapy-splash

3条答案

按热度按时间

wqnecbli1#

只需在splashHTTPAPI中 Package 您希望shell到达的URL。
因此，您可能会想要这样的内容：

scrapy shell 'http://localhost:8050/render.html?url=http://example.com/page-with-javascript.html&timeout=10&wait=0.5'

其中：

localhost:port是运行splash服务的位置
url是您要抓取的URL，请不要忘记urlquote它！
render.html是可能的HTTP API端点之一，在这种情况下返回重新排序的HTML页面
timeout超时时间（以秒为单位）
wait在阅读/保存HTML之前等待JavaScript执行的时间（以秒为单位）。

赞(0）回复(0）举报 2022-11-09

xesrikrc2#

您可以在配置好的Scrapy项目中运行不带参数的scrapy shell，然后创建req = scrapy_splash.SplashRequest(url, ...)并调用fetch(req)。

赞(0）回复(0）举报 2022-11-09

y3bcpkx13#

对于使用Docker工具箱的Windows用户：
1.将单引号改为双引号，以防止出现invalid hostname:http错误。
1.将localhost更改为dockerip地址，该地址位于鲸鱼徽标下方。对我来说，它是192.168.99.100。
最后我得到了这个：
scrapy shell "http://192.168.99.100:8050/render.html?url="https://example.com/category/banking-insurance-financial-services/""

赞(0）回复(0）举报 2022-11-09

我来回答

scrapy 碎壳和碎溅

3条答案

相关问题

热门标签

最新问答