我有一个基本的scrapy项目,在那里我硬编码了2个变量- pProd和pReviews。我现在想要么从csv文件中读取这些变量,要么在调用spider时传递它们。我已经尝试了几个小时,但在调用spider时使用-a属性似乎毫无进展。例如:
scrapy crawl myspider -a Prod="P123" -a Revs="200" -o test.csv
下面是我的代码和硬编码变量:
import scrapy
from scrapy import Spider, Request
import re
import json
class myspider(Spider):
name = 'myspider'
allowed_domains = ['mydom.com']
start_urls = ['https://api.mydom.com']
def start_requests(self):
urls = ["https://api.mydom.com"]
pProd = "P123"
pReviews = 200
for url in urls:
#Generate URL as API only brings back 100 at a time
for i in range(0, pReviews, 100):
links = 'https://api.mydom.com/data/reviews.json?Filter=ProductId%3A' + pProd + '&Offset=' + str(i) + '&passkey=123qwe'
yield scrapy.Request(
url=str(links),
cb_kwargs={'ProductID' : pProd},
callback=self.parse_reviews,
)
def parse_reviews(self, response, ProductID):
data = json.loads(response.text)
proddata = data['Includes']
reviews = data['Results']
p_prodid = ProductID
try:
p_prodcat = proddata['Products'][ProductID]['CategoryId']
except:
p_prodcat = None
for review in reviews:
try:
r_reviewdate = review['SubmissionTime']
except:
r_reviewdate = None
yield{
'prodid' : p_prodid,
'prodcat' : p_prodcat,
'reviewdate' : r_reviewdate,
}
我尝试了几种不同的方法,包括在def start_requests中添加变量名,如:
def start_requests(self, pProd='', pReviews='',**kwargs):
但似乎没有得到任何地方。希望能得到一点指导,我哪里错了。
1条答案
按热度按时间mspsb9vt1#
每次编写Scrapy的spider代码时,不必声明构造函数(init),只需像以前那样指定参数即可:
在您的spider代码中,您可以将它们用作spider参数: