Scrapy Splash对整个页面进行屏幕截图

yyhrrdl8  于 2022-11-29  发布在  其他
关注(0)|答案(2)|浏览(210)

如何修改参数以捕获整个页面

def start_requests(self):
    url =#some url
    splash_args = {
        'html': 1,
        'png': 1,
        'width': 600,
    }
    yield SplashRequest(url=url, callback=self.parse,
                        endpoint="render.json",
                        args=splash_args)
    def parse(self, response):
        imgdata = base64.b64decode(response.data['png'])
        filename = 'image.png'
        with open(filename, 'wb') as f:
            f.write(imgdata)

我试着在splash_args中添加"height",图像确实得到了宽度 * 高度,但是额外的高度是空白的,有什么方法可以解决这个问题吗?

sqserrrh

sqserrrh1#

您可以通过向Lua脚本中添加以下行来捕获整个页面

splash:set_viewport_full()
dced5bon

dced5bon2#

您可以将其添加到主蜘蛛:

from scrapy.spiders import Spider
from scrapy_splash import SplashRequest

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["www.dmoz.org"]

    script = '''
        function main(splash, args)
      
              headers = {
                ["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
            
              }
              splash:set_custom_headers(headers)
              splash.private_mode_enabled = false
              
              url=args.url
              assert(splash:go(url))
              assert(splash:wait(5))
              splash:set_viewport_full()

              return {
              image = splash:png(),
              html = splash:html()
                }
        end
    
    '''

    def start_requests(self):
        yield SplashRequest(url="http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
                            callback=self.parse,
                            endpoint="execute",
                            args={"lua_source": self.script})

相关问题