使用pdfkit进行Scrapy,Winerror 206

ufj5ltwl  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(162)

下面是我尝试运行的程序,但它返回Winerror 206,是某种Windows bug还是别的什么!我发现Winerror 206是关于sobprocess的,有人能帮我解决这个问题吗?

import pdfkit
import scrapy

path = r"C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe"
config = pdfkit.configuration(wkhtmltopdf = path)

class WorkRegister(scrapy.Spider):
    name = "WorkRegister_PartB"
    start_urls = ['http://127.0.0.1:5500/sripur.html']

    def parse(self, response):
        trees = response.xpath('//div[3]//div[2]/table//tr')
        all_links = []
        for tree in trees:
            if (tree.xpath('.//td[6]/text()').get()).strip()=="Gram Panchayat":
                print('true...............')
                front_url = 'https://mnregaweb2.nic.in/netnrega/'
                scraped_link = str(tree.xpath('.//td[2]/a/@href').get())
                new_url = f"{front_url}{scraped_link}"
                all_links.append(new_url)

        print(all_links)
        pdfkit.from_url(all_links, r"./register4_Part_B.pdf", configuration=config)
        # pdfkit.from_url(all_links, "D:\Documents\Register_4_Part_B\register4_Part_B.pdf", configuration=config)

错误如下

2022-05-28 10:41:14 [scrapy.core.scraper] ERROR: Spider error processing <GET http://127.0.0.1:5500/sripur.html> (referer: None)
Traceback (most recent call last):
  File "C:\Users\Sumit\AppData\Local\Programs\Python\Python310\lib\site-packages\twisted\internet\defer.py", line 857, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "C:\Users\Sumit\AppData\Local\Programs\Python\Python310\lib\site-packages\scrapy\spiders\__init__.py", line 67, in _parse
    return self.parse(response,**kwargs)
  File "D:\Documents\Register_4_Part_B\pdfcreator.py", line 29, in parse
    pdfkit.from_url(all_links, r"./register4_Part_B.pdf", configuration=config)
  File "C:\Users\Sumit\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfkit\api.py", line 27, in from_url
    return r.to_pdf(output_path)
  File "C:\Users\Sumit\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfkit\pdfkit.py", line 169, in to_pdf
    result = subprocess.Popen(
  File "C:\Users\Sumit\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 966, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\Sumit\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 1435, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 206] The filename or extension is too long
2022-05-28 10:41:14 [scrapy.core.engine] INFO: Closing spider (finished)
5m1hhzi4

5m1hhzi41#

最后找到了解决方案。问题是我传递参数的字符数,它超过了windows的参数字符限制(32k+),我使用了URL缩短器来减少传递参数的字符数,它起作用了。下面是新的代码。谢谢大家。

from socket import timeout
import pyshorteners
import pdfkit
import scrapy

path = r"C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe"
config = pdfkit.configuration(wkhtmltopdf = path)
options = {'page-size': 'A4'}

class WorkRegister(scrapy.Spider):
    name = "WorkRegister_PartB"
    start_urls = ['http://127.0.0.1:9112/sripur2021.html']

    def parse(self, response):
        trees = response.xpath('//div[3]//div[2]/table//tr')
        all_links = []
        for tree in trees:
            if (tree.xpath('.//td[6]/text()').get()).strip()=="Gram Panchayat":
                print('true...............')
                front_url = 'https://mnregaweb2.nic.in/netnrega/'
                scraped_link = str(tree.xpath('.//td[2]/a/@href').get())
                new_url = f"{front_url}{scraped_link}"
                shortener = pyshorteners.Shortener(timeout=120)
                shortened_url = shortener.tinyurl.short(new_url)
                all_links.append(shortened_url)

        print(all_links)
        pdfkit.from_url(all_links, r"./Sripur_20-21_register4_Part_B.pdf", configuration=config, options=options)

相关问题