使用Google Sheets API将Scrapy Spider部署到Heroku

zed5wv10  于 2022-11-09  发布在  Go
关注(0)|答案(1)|浏览(152)

我有一个正在工作的scrapy spider,它使用google sheets pipeline将数据抓取到google sheets中。脚本工作得很好,没有任何问题。但是,我似乎在Heroku上部署代码时遇到了问题。我在google后尝试了一个scrapyd的解决方案,但我仍然不知所措。我不知道我在Heroku上的部署有什么问题。
下面是在项目根目录下运行pip freeze > requirements.txt后requirements.txt的样子:

async-generator==1.10
attrs==22.1.0
Automat==20.2.0
beautifulsoup4==4.11.1
cachetools==5.2.0
certifi==2022.6.15
cffi==1.15.1
charset-normalizer==2.1.1
chromedriver-binary-auto==0.2.0
click==8.1.3
colorama==0.4.5
constantly==15.1.0
cryptography==38.0.1
cssselect==1.1.0
docker==6.0.0
et-xmlfile==1.1.0
filelock==3.8.0
google-api-core==2.10.2
google-api-python-client==2.64.0
google-auth==2.13.0
google-auth-httplib2==0.1.0
googleapis-common-protos==1.56.4
gspread==3.6.0
h11==0.13.0
herokuify-scrapyd==1.0
httplib2==0.20.4
hyperlink==21.0.0
idna==3.3
incremental==22.10.0
itemadapter==0.7.0
itemloaders==1.0.6
jmespath==1.0.1
lxml==4.9.1
numpy==1.23.2
openpyxl==3.0.10
outcome==1.2.0
packaging==21.3
pandas==1.4.4
parsel==1.6.0
Protego==0.2.1
protobuf==4.21.7
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
PyDispatcher==2.0.6
pyOpenSSL==22.1.0
pyparsing==3.0.9
PySocks==1.7.1
python-dateutil==2.8.2
python-dotenv==0.20.0
pytz==2022.2.1
PyYAML==6.0
queuelib==1.6.2
requests==2.28.1
requests-file==1.5.1
retrying==1.3.3
rsa==4.9
scrapinghub==2.4.0
Scrapy==2.7.0
scrapyd==1.3.0
scrapyd-client==1.2.2
selenium==4.4.3
service-identity==21.1.0
shub==2.14.2
six==1.16.0
sniffio==1.3.0
sortedcontainers==2.4.0
soupsieve==2.3.2.post1
tldextract==3.4.0
toml==0.10.2
tqdm==4.55.1
trio==0.21.0
trio-websocket==0.9.2
Twisted==22.8.0
typing_extensions==4.4.0
uberegg==0.1.1
uritemplate==4.1.1
urllib3==1.26.12
w3lib==2.0.1
webdriver-manager==3.8.3
websocket-client==1.4.1
wsproto==1.2.0
zope.interface==5.5.0

我的过程文件有web: scrapyd
我的scrapy.cfg看起来像这样:

[settings]
default = quotes.settings

[scrapyd]
application = herokuify_scrapyd.app.application

[deploy]
url = https://scrapy-test555.herokuapp.com/
project = quotes

运行Heroku logs --tail后,我在终端中看到的内容很简单:

2022-10-19T19:50:07.862075+00:00 heroku[web.1]: Unidling
2022-10-19T19:50:07.863982+00:00 heroku[web.1]: State changed from down to starting
2022-10-19T19:50:15.760861+00:00 heroku[web.1]: Starting process with command `scrapyd`
2022-10-19T19:50:20.356776+00:00 app[web.1]: 2022-10-19T19:50:19+0000 [-] Loading /app/.heroku/python/lib/python3.10/site-packages/scrapyd/txapp.py...
2022-10-19T19:50:20.356920+00:00 app[web.1]: 2022-10-19T19:50:20+0000 [-] Scrapyd web console available at http://0.0.0.0:9555/
2022-10-19T19:50:20.356980+00:00 app[web.1]: 2022-10-19T19:50:20+0000 [-] Loaded.
2022-10-19T19:50:20.357186+00:00 app[web.1]: 2022-10-19T19:50:20+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] twistd 22.8.0 (/app/.heroku/python/bin/python 3.10.8) starting up.
2022-10-19T19:50:20.363591+00:00 app[web.1]: 2022-10-19T19:50:20+0000 [twisted.scripts._twistd_unix.UnixAppLogger#info] reactor class: twisted.internet.epollreactor.EPollReactor.
2022-10-19T19:50:20.364115+00:00 app[web.1]: 2022-10-19T19:50:20+0000 [-] Site starting on 9555
2022-10-19T19:50:20.364222+00:00 app[web.1]: 2022-10-19T19:50:20+0000 [twisted.web.server.Site#info] Starting factory <twisted.web.server.Site object at 0x7f21484c6ec0>
2022-10-19T19:50:20.366450+00:00 app[web.1]: 2022-10-19T19:50:20+0000 [Launcher] Scrapyd 1.3.0 started: max_proc=32, runner='scrapyd.runner'
2022-10-19T19:50:20.477015+00:00 heroku[web.1]: State changed from starting to up
2022-10-19T19:50:21.940532+00:00 app[web.1]: 2022-10-19T19:50:21+0000 [twisted.python.log#info] "10.1.22.15" - - [19/Oct/2022:19:50:21 +0000] "GET /logs/ HTTP/1.1" 404 145 "https://scrapy-test555.herokuapp.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
2022-10-19T19:50:21.942392+00:00 heroku[router]: at=info method=GET path="/logs/" host=scrapy-test555.herokuapp.com request_id=b7e2adb9-46d4-4c83-8786-3becd47266ac fwd="102.176.65.108" dyno=web.1 connect=0ms service=7ms status=404 bytes=315 protocol=https
2022-10-19T19:50:23.231947+00:00 app[web.1]: 2022-10-19T19:50:23+0000 [twisted.python.log#info] "10.1.22.15" - - [19/Oct/2022:19:50:22 +0000] "GET /favicon.ico HTTP/1.1" 404 153 "https://scrapy-test555.herokuapp.com/logs/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
2022-10-19T19:50:23.233637+00:00 heroku[router]: at=info method=GET path="/favicon.ico" host=scrapy-test555.herokuapp.com request_id=5f6a07c2-2ca5-4c3b-94f9-46a0a3dae313 fwd="102.176.65.108" dyno=web.1 connect=0ms service=14ms status=404 bytes=323 protocol=https

没有其他事情发生,也没有输出到Google Sheets。请帮助。这是我第一次尝试在云中部署我的代码,所以我有点迷路了。

vbkedwbf

vbkedwbf1#

我终于找到了一个解决方案。我所需要的只是heroku run scrapy crawl quote,这就解决了它。

相关问题