来自MYSQL数据库(或CSV)的ScrapySplash start_urls和其他字段

sg3maiej  于 2022-11-09  发布在  Mysql
关注(0)|答案(1)|浏览(133)

我尝试使用ScrapySplash从MySQL数据库中提取url字段和其他两个字段。我可以获得URL和爬取工作正常,但我无法获得应该与正在爬取的url对应的其他两个字段。以下是我的最新尝试
dataReader()中有三个项目,URL位于索引[0]处,并与self.start_urls[0]完美匹配
但是,对于itemid,我无法从索引[1]中提取相应的行数据,对于location,我也无法从索引[2]中提取相应的行数据
我猜这与 meta语句或for循环有关。让for语句不使用[0]选项会更有意义,但如果使用其他任何方法,我都无法让它爬行

我的小淘气.py

import scrapy
from ..items import MyScrapyItem
from .readdata import dataReader
from scrapy_splash import SplashRequest

class my_scrapy(scrapy.Spider):
    name = "my_scrapy"
    allowed_domains = ['www.google.com']
    start_urls = dataReader()

    script = '''
        function main(splash, args)
            splash.private_mode_enabled = false
            assert(splash:go(splash.args.url))
            assert(splash:wait(15))
            return {
            splash:html()
            }
            end
    '''

    def start_requests(self):
        # create initial requests for urls in start_urls
        for url in self.start_urls[0]:
            yield SplashRequest(url=url, callback=self.parse, endpoint="execute", args={
                'lua_source': self.script, 'wait': 5
            }, meta={"itemid": self.start_urls[1], "locationid": self.start_urls[2]})

    def parse(self, response):
        items = MyScrapyItem()
        itemid = response.meta['itemid']
        locationid= response.meta['locationid']
        #print(response.data)

        chairs = response.xpath('//div[3]//div[1]//h4[1]//span[1]//span[1]/text()').extract()
        tables = response.xpath('//div[2]//div[1]//h4[1]//span[1]//span[1]/text()').extract()

        items['itemid'] = itemid
        items['locationid'] = locationid

        items['chairs'] = chairs
        items['table'] = table

        yield items
ecfsfe2w

ecfsfe2w1#

这很管用!!!
通过定义由readdata()函数生成的每一列:

start_urls = dataReader()
starturls = dataReader()[0]
itemids = dataReader()[1]
locationids = dataReader()[2]

并对starturls使用range(len.. as循环,同时对从readdata()中提取的另外两列使用 meta值

def start_requests(self):
        # create initial requests for urls in start_urls
        for url in range(len(self.starturls)):
            yield SplashRequest(url=self.starturls[url], callback=self.parse, endpoint="execute", args={
                'lua_source': self.script, 'wait': 5
            }, meta={"itemid": self.itemids[url], "locationid": self.locationids[url]})

相关问题