Scrapy:ProcessPoolExecutor无法将start_requests从main.py传递到另一个蜘蛛程序

disho6za  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(124)

I have a csv file containing about 900 TikTok user urls and would like to scrape the info which I have achieved but since Scrapy is single threaded I'm trying to divide the process into at least 20 concurrent processes with Scrapy and concurrent.futures with different parameters (Process no 1 scrapes users by index from the csv file from 1-20, process no2 scrapes users from 20-40, etc... ).
Here the GetTikTokFrontPageHTMLSpider would be called from crawl() . This only works with the traditional calling of crawl() as a normal function followed by reactor.run() but not with ProcessPoolExecutor() . When I run with method 1 (please see def run_concurrency() in the code) I get the output:

Finished in 0.0 second(s)
[[0, 2], [2, 4], [4, 6], [6, 8], [8, 10], [10, 12], [12, 14], [14, 16], [16, 18], [18, 20], [20, 20]]
Finished in 0.0 second(s)
Finished in 0.0 second(s)

And when run with methods 2 I get:

Finished in 0.0 second(s)
[[0, 2], [2, 4], [4, 6], [6, 8], [8, 10], [10, 12], [12, 14], [14, 16], [16, 18], [18, 20], [20, 20]]
[0, 2]
[2, 4]
[4, 6]
[6, 8]
[8, 10]
[10, 12]
[12, 14]
[14, 16]
[16, 18]
[18, 20]
[20, 20]
Finished in 0.0 second(s)
Finished in 0.0 second(s)
Finished in 0.0 second(s)
Finished in 0.0 second(s)
Finished in 0.0 second(s)
Finished in 0.0 second(s)
Finished in 0.0 second(s)
Finished in 0.0 second(s)

The process seems to go through every executor.submit enter code here (not sure for the map method as it but doesn't seem to be working as intended as the spider GetTikTokFrontPageHTML never gets executed even once. The full methods for def run_concurrency :
def run_concurrency Method 1

def run_concurrency():
    with concurrent.futures.ProcessPoolExecutor() as executor:
        user_idx = get_user_idx(2)
        print(user_idx)
        executor.map(crawl, user_idx) #Method 1 (With map)

def run_concurrency Method 2

def run_concurrency():
    with concurrent.futures.ProcessPoolExecutor() as executor:
        user_idx = get_user_idx(2)
        print(user_idx)
        for idx in user_idx: #Method 2 (With submit)
           print(idx)
           executor.submit(crawl, idx)

And here's the full codes (I only include the needed part for GetTikTokFrontPageHTMLSpider for this particular question):
main.py

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
import concurrent.futures
import math
import time
from TikTokUser import get_users_count

from ScrapeTikTok.spiders.GetTikTokFrontPageHTMLSpider import GettiktokfrontpageHTMLSpider

configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)

@defer.inlineCallbacks
def crawl(user_idx):

    yield runner.crawl(GettiktokfrontpageHTMLSpider, user_start=user_idx[0], user_end=user_idx[1])
    reactor.stop()

def get_user_idx(batch_count):
    user_idx = []
    users_count = 20  # get_users_count() #Total number of users to be scraped
    whole_batch_count = math.floor(users_count / batch_count)
    for i in range(0, batch_count * whole_batch_count, batch_count):
        append_values = user_idx.append([i, i + batch_count])
    last_values = [batch_count * whole_batch_count, users_count]
    append_last_values = user_idx.append(last_values)
    return user_idx

def run_concurrency():
    with concurrent.futures.ProcessPoolExecutor() as executor:
        user_idx = get_user_idx(2)
        print(user_idx)
        executor.map(crawl, user_idx[0], user_idx[1]) #Method 1 (With map)
        # for idx in user_idx: #Method 2 (With submit)
        #    print(idx)
        #    executor.submit(crawl, idx)

# crawl(0,1) #If these two are un-commented it goes through to start_requests in GetTikTokFrontPageHTMLSpider.py

# reactor.run()

if __name__ == '__main__':
    run_concurrency()
reactor.run()

GetTikTokFrontPageHTMLSpider.py

import scrapy
import requests

from TikTokUser import get_user_urls

class GettiktokfrontpageHTMLSpider(scrapy.Spider):
    name = 'GetTikTokFrontPageHTMLSpider'
    allowed_domains = ['smartproxy.com']

    def __init__(self, user_start=None, user_end=None):
        self.user_start = user_start
        self.user_end = user_end

    def start_requests(self):
        print("START REQUEST")
        user_urls = get_user_urls()
        if self.user_end > len(user_urls):
            self.user_end = len(user_urls)
        for user_url in user_urls[self.user_start:self.user_end]:          
            yield self.parse(user_url)

    def parse(self, user_url):
        ...

How do I write main.py so that crawl() will be called to create multiple processes (in this test case of batch_count=2 where 2 users processed per process, so 11 processes for 20 users) with different parameters for each process being passed to crawl and hence the spider?

enxuqcxy

enxuqcxy1#

I would try reading in the urls prior to starting the concurrent processes and feed a list of urls to each process that way they are not all trying to extract data from the same file. I would also use scrapy.CrawlerProcess for each process, and leave as much of the scrapy logic as possible inside of the GetTikTokFrontPageHTMLSpider.py module.
例如 :
GetTikTokFrontPageHTMLSpider.py

import scrapy
import requests
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
import math

class GettiktokfrontpageHTMLSpider(scrapy.Spider):
    name = 'GetTikTokFrontPageHTMLSpider'
    allowed_domains = ['smartproxy.com']
    start_urls = []

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url)

    def parse(self, user_url):
        ...

def crawl(urls):
    GettiktokfrontpageHTMLSpider.start_urls = urls # assign the urls to the spider
    configure_logging()
    settings = get_project_settings()
    process = CrawlerProcess(settings)           # use a crawler process
    process.crawl(GettiktokfrontpageHTMLSpider)
    process.start()                              # start the crawl

中 的 每 一 个
然后 在 您 的 main.py 中 :

from concurrent.futures import ProcessPoolExecutor
from GetTikTokFrontPageHTMLSpider import crawl   # import the crawl function

def get_user_urls(csvfile):  # gather urls in bunches of 20 and yield them
    """ This is just an example of a function that gathers urls
        from a csv file
    """
    with open(csvfile) as csvfile:
        url_list = []
        for line in csvfile:
            url_list.append(line)
            if len(url_list) == 20:
                yield url_list
                url_list = []
        if len(url_list) > 0:
            yield url_list

def run_concurrency():
    urls = get_user_urls()  # get 20 urls at a time
    with ProcessPoolExecutor() as executor:
        executor.map(crawl, urls)   # map the crawl method to each bunch of 20

if __name__ == '__main__':
    run_concurrency()

格式

相关问题