我正在做一个应用程序项目,它允许用户在输入一组关键字后获得网页搜索结果,这些关键字将被发送到Ask。为此,我在Flask和Scrapy中创建了一个api,灵感来自下面的文章,针对api。但是,这个api不起作用,因为我无法将用作关键字的数据从我的api传递到我的scraper。下面是我的flask api文件:
import crochet
crochet.setup()
from flask import Flask , render_template, jsonify, request, redirect, url_for
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from scrapy.signalmanager import dispatcher
import time
import os
# Importing our Scraping Function from the amazon_scraping file
from scrap.askScraping import AskScrapingSpider
# Creating Flask App Variable
app = Flask(__name__)
output_data = []
crawl_runner = CrawlerRunner()
# By Deafult Flask will come into this when we run the file
@app.route('/')
def index():
return render_template("index.html") # Returns index.html file in templates folder.
# After clicking the Submit Button FLASK will come into this
@app.route('/', methods=['POST'])
def submit():
if request.method == 'POST':
s = request.form['url'] # Getting the Input Amazon Product URL
global baseURL
baseURL = s
# This will remove any existing file with the same name so that the scrapy will not append the data to any previous file.
if os.path.exists("<path_to_outputfile.json>"):
os.remove("<path_to_outputfile.json>")
return redirect(url_for('scrape')) # Passing to the Scrape function
@app.route("/scrape")
def scrape():
scrape_with_crochet(baseURL="https://www.ask.com/web?q={baseURL}") # Passing that URL to our Scraping Function
time.sleep(20) # Pause the function while the scrapy spider is running
return jsonify(output_data) # Returns the scraped data after being running for 20 seconds.
@crochet.run_in_reactor
def scrape_with_crochet(baseURL):
# This will connect to the dispatcher that will kind of loop the code between these two functions.
dispatcher.connect(_crawler_result, signal=signals.item_scraped)
# This will connect to the ReviewspiderSpider function in our scrapy file and after each yield will pass to the crawler_result function.
eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
return eventual
# This will append the data to the output data list.
def _crawler_result(item, response, spider):
output_data.append(dict(item))
if __name__== "__main__":
app.run(debug=True)
我的刮刀之一
import scrapy
import datetime
class AskScrapingSpider(scrapy.Spider):
name = 'ask_scraping'
def start_requests(self):
myBaseUrl = ''
start_urls = []
def __init__(self, category='',**kwargs): # The category variable will have the input URL.
self.myBaseUrl = category
self.start_urls.append(self.myBaseUrl)
super().__init__(**kwargs)
custom_settings = {'FEED_URI': 'scrap/outputfile.json', 'CLOSESPIDER_TIMEOUT' : 15} # This will tell scrapy to store the scraped data to outputfile.json and for how long the spider should run.
yield scrapy.Request(start_urls, callback=self.parse, meta={'pos': 0})
def parse(self, response):
print('url:', response.url)
start_pos = response.meta['pos']
print('start pos:', start_pos)
dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
items = response.css('div.PartialSearchResults-item')
for pos, result in enumerate(items, start_pos+1):
yield {
'title': result.css('a.PartialSearchResults-item-title-link.result-link::text').get().strip(),
'snippet': result.css('p.PartialSearchResults-item-abstract::text').get().strip(),
'link': result.css('a.PartialSearchResults-item-title-link.result-link').attrib.get('href'),
'position': pos,
'date': dt,
}
# --- after loop ---
next_page = response.css('.PartialWebPagination-next a')
if next_page:
url = next_page.attrib.get('href')
print('next_page:', url) # relative URL
# use `follow()` to add `https://www.ask.com/` to URL and create absolute URL
yield response.follow(url, callback=self.parse, meta={'pos': pos+1})
当我运行的时候绝对没有错误。在看了用户对我的问题的回答后,我改变了我的scraper的代码如下,但没有成功,因为,在传递数据到scraper后,我在浏览器中得到了下面的url localhost:5000/scrape
空括号[]
,而括号通常应该包含我的scraper返回的数据:
import scrapy
import datetime
class AskScrapingSpider(scrapy.Spider):
name = 'ask_scraping'
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, meta={'pos': 0})
custom_settings = {'FEED_URI': 'scrap/outputfile.json', 'CLOSESPIDER_TIMEOUT' : 15}
def __init__(self, category='',**kwargs):
self.myBaseUrl = category
self.start_urls.append(self.myBaseUrl)
super().__init__(**kwargs)
def parse(self, response):
print('url:', response.url)
start_pos = response.meta['pos']
print('start pos:', start_pos)
dt = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
items = response.css('div.PartialSearchResults-item')
for pos, result in enumerate(items, start_pos+1):
yield {
'title': result.css('a.PartialSearchResults-item-title-link.result-link::text').get().strip(),
'snippet': result.css('p.PartialSearchResults-item-abstract::text').get().strip(),
'link': result.css('a.PartialSearchResults-item-title-link.result-link').attrib.get('href'),
'position': pos,
'date': dt,
}
# --- after loop ---
next_page = response.css('.PartialWebPagination-next a')
if next_page:
url = next_page.attrib.get('href')
print('next_page:', url) # relative URL
# use `follow()` to add `https://www.ask.com/` to URL and create absolute URL
yield response.follow(url, callback=self.parse, meta={'pos': pos+1})
我还在我的main.py
文件中将crawl_runner = CrawlerRunner()
替换为
project_settings = get_project_settings()
crawl_runner = CrawlerProcess(settings = project_settings)
并执行了以下导入
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
但是当我重新加载Flask服务器时,我收到了以下错误:
2022-06-21 11:44:55 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-06-21 11:44:57 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.10.4 (tags/v3.10.4:9d38120, Mar 23 2022, 23:13:41) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.3 3 May 2022), cryptography 37.0.2, Platform Windows-10-10.0.19044-SP0
2022-06-21 11:44:57 [werkzeug] WARNING: * Debugger is active!
2022-06-21 11:44:57 [werkzeug] INFO: * Debugger PIN: 107-226-838
2022-06-21 11:44:57 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_TIMEOUT': 15}
2022-06-21 11:44:57 [werkzeug] INFO: 127.0.0.1 - - [21/Jun/2022 11:44:57] "GET / HTTP/1.1" 200 -
2022-06-21 11:44:58 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1315, in run
self.mainLoop()
File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1325, in mainLoop
reactorBaseSelf.runUntilCurrent()
File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 964, in runUntilCurrent
f(*a,**kw)
File "C:\Python310\lib\site-packages\crochet\_eventloop.py", line 420, in runs_in_reactor
d = maybeDeferred(wrapped, *args,**kwargs)
--- <exception caught here> ---
File "C:\Python310\lib\site-packages\twisted\internet\defer.py", line 190, in maybeDeferred
result = f(*args,**kwargs)
File "C:\Users\user\Documents\AAprojects\Whelpsgroups1\API\main.py", line 62, in scrape_with_crochet
eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 205, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 238, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 313, in _create_crawler
return Crawler(spidercls, self.settings, init_reactor=True)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 82, in __init__
default.install()
File "C:\Python310\lib\site-packages\twisted\internet\selectreactor.py", line 194, in install
installReactor(reactor)
File "C:\Python310\lib\site-packages\twisted\internet\main.py", line 32, in installReactor
raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed
2022-06-21 11:44:58 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1315, in run
self.mainLoop()
File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1325, in mainLoop
reactorBaseSelf.runUntilCurrent()
File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 964, in runUntilCurrent
f(*a,**kw)
File "C:\Python310\lib\site-packages\crochet\_eventloop.py", line 420, in runs_in_reactor
d = maybeDeferred(wrapped, *args,**kwargs)
--- <exception caught here> ---
File "C:\Python310\lib\site-packages\twisted\internet\defer.py", line 190, in maybeDeferred
result = f(*args,**kwargs)
File "C:\Users\user\Documents\AAprojects\Whelpsgroups1\API\main.py", line 62, in scrape_with_crochet
eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 205, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 238, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 313, in _create_crawler
return Crawler(spidercls, self.settings, init_reactor=True)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 82, in __init__
default.install()
File "C:\Python310\lib\site-packages\twisted\internet\selectreactor.py", line 194, in install
installReactor(reactor)
File "C:\Python310\lib\site-packages\twisted\internet\main.py", line 32, in installReactor
raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed
2022-06-21 11:45:54 [werkzeug] INFO: 127.0.0.1 - - [21/Jun/2022 11:45:54] "←[32mPOST / HTTP/1.1←[0m" 302 -
2022-06-21 11:45:54 [scrapy.crawler] INFO: Overridden settings:
{'CLOSESPIDER_TIMEOUT': 15}
2022-06-21 11:45:54 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1315, in run
self.mainLoop()
File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1325, in mainLoop
reactorBaseSelf.runUntilCurrent()
File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 964, in runUntilCurrent
f(*a,**kw)
File "C:\Python310\lib\site-packages\crochet\_eventloop.py", line 420, in runs_in_reactor
d = maybeDeferred(wrapped, *args,**kwargs)
--- <exception caught here> ---
File "C:\Python310\lib\site-packages\twisted\internet\defer.py", line 190, in maybeDeferred
result = f(*args,**kwargs)
File "C:\Users\user\Documents\AAprojects\Whelpsgroups1\API\main.py", line 62, in scrape_with_crochet
eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 205, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 238, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 313, in _create_crawler
return Crawler(spidercls, self.settings, init_reactor=True)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 82, in __init__
default.install()
File "C:\Python310\lib\site-packages\twisted\internet\selectreactor.py", line 194, in install
installReactor(reactor)
File "C:\Python310\lib\site-packages\twisted\internet\main.py", line 32, in installReactor
raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed
2022-06-21 11:45:54 [twisted] CRITICAL: Unhandled error in EventualResult
Traceback (most recent call last):
File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1315, in run
self.mainLoop()
File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 1325, in mainLoop
reactorBaseSelf.runUntilCurrent()
File "C:\Python310\lib\site-packages\twisted\internet\base.py", line 964, in runUntilCurrent
f(*a,**kw)
File "C:\Python310\lib\site-packages\crochet\_eventloop.py", line 420, in runs_in_reactor
d = maybeDeferred(wrapped, *args,**kwargs)
--- <exception caught here> ---
File "C:\Python310\lib\site-packages\twisted\internet\defer.py", line 190, in maybeDeferred
result = f(*args,**kwargs)
File "C:\Users\user\Documents\AAprojects\Whelpsgroups1\API\main.py", line 62, in scrape_with_crochet
eventual = crawl_runner.crawl(AskScrapingSpider, category = baseURL)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 205, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 238, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 313, in _create_crawler
return Crawler(spidercls, self.settings, init_reactor=True)
File "C:\Python310\lib\site-packages\scrapy\crawler.py", line 82, in __init__
default.install()
File "C:\Python310\lib\site-packages\twisted\internet\selectreactor.py", line 194, in install
installReactor(reactor)
File "C:\Python310\lib\site-packages\twisted\internet\main.py", line 32, in installReactor
raise error.ReactorAlreadyInstalledError("reactor already installed")
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed
我查看了此stackOverflow question,但没有成功。
1条答案
按热度按时间oxcyiej71#
你不应该屈服
scrapy.Request
在init方法中。
删除此行:
并将您init方法更改为:
这可能行得通。
更新日期:
如果你想在你的请求中传递参数,在那些行中改变之后,你可以覆盖start_requests()方法,如下所示:
**更新2:**如果你的Scrapy Spider在你的 flask 应用后台运行,试试这个:写下这些行:
而不是:
当然,您应该导入CrawlerProcess和get_project_settings,如下所示:
**更新3:**我写过一些类似的项目,它工作正常,你可以检查this repo