scrapy 是否可以下载PDF格式的零碎文件?

qlvxas9a  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(220)

我想用下面的代码(由F.Hoque开发)从本网站下载一个PDF文件。

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver    
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC    

class TestSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.ons.gov.uk',
            callback=self.parse,
            wait_time = 3,
            screenshot = True
        )

    def parse(self, response):
        driver = response.meta['driver']
        driver.save_screenshot('screenshot.png')

        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys("Education and childcare")
        driver.save_screenshot('screenshot_1.png')
        click_button=driver.find_element_by_xpath('//*[@id="nav-search-submit"]').click()
        driver.save_screenshot('screenshot_2.png')
        click_button=driver.find_element_by_xpath('//*[@id="results"]/div[1]/div[2]/div[1]/h3/a/span').click()
        click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[1]/section/div/div[1]/div/div[2]/h3/a/span').click()
        click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div/div[1]/div[2]/p[2]/a').click()

另外,我不确定settings.py要将其添加到哪个www.example.com文件中(因为运行代码需要它):


# Middleware

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

# Selenium

from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS = ['--headless']

我正在通过Anaconda 3使用Spyder,我有五个不同的settings.py文件。以下是它们各自的位置:

"C:\Users\David\anaconda3\Lib\site-packages\scrapy\commands\settings.py" 
"C:\Users\David\anaconda3\pkgs\bokeh-2.3.2-py38haa95532_0\Lib\site-packages\bokeh\settings.py" 
"C:\Users\David\anaconda3\Lib\site-packages\bokeh\settings.py" 
"C:\Users\David\anaconda3\pkgs\isort-5.8.0-pyhd3eb1b0_0\site-packages\isort\settings.py" 
"C:\Users\David\anaconda3\Lib\site-packages\isort\settings.py"

我settings.py应该将第二个代码保存到以下哪个www.example.com文件中?

bqjvbblv

bqjvbblv1#

Scrapy可以使用媒体/图像管道下载PDF文件/图像。请查看输出,它们只包含PDF链接,而不是文件。您会注意到,该URL在末尾没有.pdf扩展名,而不是只有链接,如果它有.pdf,那么它将是一个文件,只有这样,我才可以从这里使用Scrapy媒体管道下载PDF文件。如果你点击输出文件,那么它将手动开始downlown.我不知道端点/pdf可以转化为.pdf,然后可以下载

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium import webdriver

from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

class TestSpider(scrapy.Spider):
    name = 'test'

    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.ons.gov.uk',
            callback=self.parse,
            wait_time = 3,
            screenshot = True
        )

    def parse(self, response):
        driver = response.meta['driver']
        #driver.save_screenshot('screenshot.png')

        WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, "q"))).send_keys("Education and childcare")
        #driver.save_screenshot('screenshot_1.png')
        click_button=driver.find_element_by_xpath('//*[@id="nav-search-submit"]').click()
        #driver.save_screenshot('screenshot_2.png')
        click_button=driver.find_element_by_xpath('//*[@id="results"]/div[1]/div[2]/div[1]/h3/a/span').click()
        click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div[1]/section/div/div[1]/div/div[2]/h3/a/span').click()
        #No need to click because click and download not possible
        #click_button=driver.find_element_by_xpath('//*[@id="main"]/div[2]/div/div[1]/div[2]/p[2]/a').click()
        #driver.save_screenshot('screenshot_pdf.png')

        pdf_url= driver.find_element_by_xpath('//*[@class="link-complex js-pdf-dl-link"]').get_attribute('href')

        yield {'url': pdf_url}

输出:

{'url': 'https://www.ons.gov.uk/peoplepopulationandcommunity/educationandchildcare/articles/remoteschoolingthroughthecoronaviruscovid19pandemicengland/april2020tojune2021/pdf'}

相关问题