Scrapy爬网完成,但未爬网所有开始请求

06odsfpq  于 2023-08-05  发布在  其他
关注(0)|答案(1)|浏览(109)

我试图使用scrapy库运行broad crawl- crawl,在那里我解析了数百万个网站。Spider连接到PostgreSQL数据库。这是我在启动蜘蛛之前加载未处理的URL的方式:

def get_unprocessed_urls(self, suffix):
        """
        Fetch unprocessed urls.
        """

        print(f'Fetching unprocessed urls for suffix {suffix}...')

        cursor = self.connection.cursor('unprocessed_urls_cursor', withhold=True)
        cursor.itersize = 1000
        cursor.execute(f"""
            SELECT su.id, su.url FROM seed_url su
            LEFT JOIN footer_seed_url_status fsus ON su.id = fsus.seed_url_id
            WHERE su.url LIKE \'%.{suffix}\' AND fsus.seed_url_id IS NULL;
        """)

        ID = 0
        URL = 1

        urls = [Url(url_row[ID], self.validate_url(url_row[URL])) for url_row in cursor]

        print('len urls:', len(urls))
        return urls

字符串
这是我的蜘蛛:

class FooterSpider(scrapy.Spider):

    ...

    def start_requests(self):

        urls = self.handler.get_unprocessed_urls(self.suffix)

        for url in urls:

            yield scrapy.Request(
                url=url.url,
                callback=self.parse,
                errback=self.errback,
                meta={
                    'seed_url_id': url.id,
                }
            )

    def parse(self, response):

        try:

            seed_url_id = response.meta.get('seed_url_id')

            print(response.url)

            soup = BeautifulSoup(response.text, 'html.parser')

            footer = soup.find('footer')

            item = FooterItem(
                seed_url_id=seed_url_id,
                html=str(footer) if footer is not None else None,
                url=response.url
            )
            yield item
            print(f'Successfully processed url {response.url}')

        except Exception as e:
            print('Error while processing url', response.url)
            print(e)

            seed_url_id = response.meta.get('seed_url_id')

            cursor  = self.handler.connection.cursor()
            cursor.execute(
                "INSERT INTO footer_seed_url_status(seed_url_id, status) VALUES(%s, %s)",
                (seed_url_id, str(e)))
            
            self.handler.connection.commit()

    def errback(self, failure):
        print(failure.value)

        try:

            error = repr(failure.value)
            request = failure.request

            seed_url_id = request.meta.get('seed_url_id')

            cursor  = self.handler.connection.cursor()
            cursor.execute(
                "INSERT INTO footer_seed_url_status(seed_url_id, status) VALUES(%s, %s)",
                (seed_url_id, error))
            
            self.handler.connection.commit()

        except Exception as e:
            print(e)


以下是我对抓取的自定义设置(摘自上面的broad crawl文档页面):

SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
CONCURRENT_REQUESTS = 100
CONCURRENT_ITEMS=1000
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
REACTOR_THREADPOOL_MAXSIZE = 20
COOKIES_ENABLED = False
DOWNLOAD_DELAY = 0.2


我的问题是:爬行器不爬取所有URL,而是在仅爬取几百个(或几千个,这个数字似乎不同)之后停止。日志中不显示警告或错误。以下是“完成”爬网后的示例日志:

{'downloader/exception_count': 2,
 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 1,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'downloader/request_bytes': 345073,
 'downloader/request_count': 1481,
 'downloader/request_method_count/GET': 1481,
 'downloader/response_bytes': 1977255,
 'downloader/response_count': 1479,
 'downloader/response_status_count/200': 46,
 'downloader/response_status_count/301': 791,
 'downloader/response_status_count/302': 512,
 'downloader/response_status_count/303': 104,
 'downloader/response_status_count/308': 2,
 'downloader/response_status_count/403': 2,
 'downloader/response_status_count/404': 22,
 'dupefilter/filtered': 64,
 'elapsed_time_seconds': 113.895788,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 8, 3, 11, 46, 31, 889491),
 'httpcompression/response_bytes': 136378,
 'httpcompression/response_count': 46,
 'log_count/ERROR': 3,
 'log_count/INFO': 11,
 'log_count/WARNING': 7,
 'response_received_count': 43,
 "robotstxt/exception_count/<class 'twisted.internet.error.DNSLookupError'>": 1,
 "robotstxt/exception_count/<class 'twisted.web._newclient.ResponseNeverReceived'>": 1,
 'robotstxt/request_count': 105,
 'robotstxt/response_count': 43,
 'robotstxt/response_status_count/200': 21,
 'robotstxt/response_status_count/403': 2,
 'robotstxt/response_status_count/404': 20,
 'scheduler/dequeued': 151,
 'scheduler/dequeued/memory': 151,
 'scheduler/enqueued': 151,
 'scheduler/enqueued/memory': 151,
 'start_time': datetime.datetime(2023, 8, 3, 11, 44, 37, 993703)}
2023-08-03 11:46:31 [scrapy.core.engine] INFO: Spider closed (finished)


奇怪的是,这个问题似乎只出现在我试图用于爬行的两台机器中的一台上。当我在PC(Windows 11)上本地运行爬网时,爬网不会停止。但是,当我在我们公司的服务器(Microsoft Azure Windows 10机器)上运行代码时,抓取过早停止,如上所述。

编辑:可以找到完整的日志here。在这种情况下,进程在几个URL之后停止。

oo7oh9g9

oo7oh9g91#

我终于找到了问题所在。Scrapy要求所有的起始URL都有一个HTTP模式,例如stackoverflow.com不能工作,但https://stackoverflow.com可以。
我使用以下代码来验证URL是否包含模式:

if not url.startswith("http"):
    url = 'http://' + url

字符串
然而,这种验证是错误的。我的数据包含数百万个URL,其中一些显然是退化或非常规的(http.gay似乎是一个有效的重定向域),例如:

httpsf52u5bids65u.xyz
httppollenmap.com
http.gay


这些网址将通过我的方案检查,即使他们不包含一个模式,他们会打破我的爬行过程。
我将验证改为这样,问题就消失了:

if not (url.startswith("http://") or url.startswith('https://')):
    url = 'http://' + url

相关问题