我刚刚开始学习Scrapy,我有这样一个问题。对于我的“蜘蛛”,我必须从谷歌表单表中获取一个URL列表(start_urls),我有这样的代码:
import gspread
from oauth2client.service_account import ServiceAccountCredentials
scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
creds = ServiceAccountCredentials.from_json_keyfile_name('token.json', scope)
client = gspread.authorize(creds)
sheet = client.open('Sheet_1')
sheet_instance = sheet.get_worksheet(0)
records_data = sheet_instance.col_values(col=2)
for link in records_data:
print(link)
........
我如何配置中间件,以便当启动spider(scrappy crawl my_spider
)时,此代码中的链接自动替换为start_urls?也许我需要在www.example.com中创建一个类middlewares.py?有任何帮助,请提供示例。此规则必须适用于所有新的spider,从start_requests中的文件生成列表(例如start_urls = [l.strip() for an open string('urls.txt ').readline()]
)并不方便...
1条答案
按热度按时间a64a0gku1#
请阅读
spider.py:
middlewares.py:
urls.txt:
输出: