Scrapy蜘蛛中间件

t0ybt7op  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(146)

我在spider中有一个函数(check_duplicates()),它检查我的数据库中是否存在url,如果不存在,则将url传递给parse_product方法:

def check_duplicates(url):
    connection = mysql.connector.connect(
        host='host_ip',
        port=3306,
        user='username',
        password='pass',
        database='base_name',
    )
    cursor = connection.cursor()
    sqlq = f"SELECT url FROM my_table WHERE url = '{url}'"
    cursor.execute(sqlq)
    results = cursor.fetchone()
    return results

class CianSpider(scrapy.Spider):
    name = 'spider_name'

    def start_requests(self):
        url = 'https://some_site.ru'
        yield Request(
            url=url,
            method='GET',
            callback=self.parse)

    def parse(self, response,**cb_kwargs):
        for item in response.css('a[href*=".item"]::attr("href")').extract():
            url = response.urljoin(item)
            if check_duplicates(url) is None:
                yield scrapy.Request(
                    url=url,
                    cookies=self.cookies,
                    callback=self.parse_product,
                )

    def parse_product(self, response,**cb_kwargs):
        pass

我如何使用Scrapy spider中间件实现这个机制(我应该如何以及在哪里注册url验证功能)?

yrwegjxp

yrwegjxp1#

您可以使用一个自定义的DownloadMiddleware来分析传入的请求,并检查请求的url。
在您的middlewares.py文件中:

from scrapy.exceptions import IgnoreRequest
import mysql

class YourProjectNameDownloaderMiddleware:

    def process_request(self, request, spider):
        url = request.url
        connection = mysql.connector.connect(
            host='host_ip',
            port=3306,
            user='username',
            password='pass',
            database='base_name',
        )
        cursor = connection.cursor()
        sqlq = f"SELECT url FROM my_table WHERE url = '{url}'"
        cursor.execute(sqlq)
        if not cursor.fetchone():
            return request
        raise IgnoreRequest

然后在您的settings.py文件中:

DOWNLOADER_MIDDLEWARES = {
    'YourProjectName.middlewares.YourProjectNameDownloaderMiddleware': 100,
}

您将需要在所有包含MyProjectName的位置输入实际的项目名称。

相关问题