此问题在此处已有答案:
Scrapy-Splash Session Handling(1个答案)
三个月前关门了。
我真的不明白为什么一个基本的请求后登录我 checkout 使用scrapy,我已经提出了几个问题,在各种Scrapy论坛(问题链接,reddit,github,Stackoverflow)但他们并不简单地提供答案。我可以很容易地实现这一点与 selenium 没有任何问题,复制相同的与scrapy现在似乎是一个问题,我已经尝试了50多个不同的SO解决方案。我只需要一个原因,为什么我被注销,一旦我产卵另一个请求后登录。
下面是基本的Selenium和Scrapy脚本,其中包含用于登录的虚拟帐户详细信息。
from selenium.webdriver import Chrome, ChromeOptions
from selenium.webdriver.chrome.service import Service
# define our URL
url = 'https://www.oddsportal.com/login/'
username = 'chuky'
password = 'A151515a'
path = r'C:\Users\Glodaris\OneDrive\Desktop\Repo\Scraper\chromedriver.exe'
webdriver_service = Service(path)
options = ChromeOptions()
# options=options
browser = Chrome(service=webdriver_service, options=options)
browser.get(url)
browser.implicitly_wait(2)
browser.find_element(By.ID, 'onetrust-accept-btn-handler').click()
browser.find_element(By.ID,'login-username1').send_keys(username)
browser.find_element(By.ID,'login-password1').send_keys(password)
browser.implicitly_wait(10)
browser.find_element(By.XPATH,'//*[@id="col-content"]//button[@class="inline-btn-2"]').click()#.send_keys(self.password)
print('successful login')
browser.implicitly_wait(10)
browser.get('https://www.oddsportal.com/results/')
斯克拉皮
class OddsportalSpider(CrawlSpider):
name = 'oddsportal'
allowed_domains = ['oddsportal.com']
# start_urls = ['http://oddsportal.com/results/']
login_page = 'https://www.oddsportal.com/login/'
def start_requests(self):
"""called before crawling starts. Try to login"""
yield scrapy.Request(
url=self.login_page,
callback=self.login,
dont_filter=True
)
# parse response
def login(self, response):
"""Generate a login request."""
yield FormRequest.from_response(
response=response,
formdata={'login-username': 'chuky',
'login-password': 'A151515a',
'login-submit': '',
},
callback=self.after_login,
dont_filter=True
)
#simply check if log-in was successful, and spawn another request to /results/
def after_login(self, response):
if b"Wrong username or password" in response.body:
logger.warning("LOGIN ATTEMPT FAILED")
return
else:
logger.info("LOGIN ATTEMPT SUCCESSFUL")
url = 'https://www.oddsportal.com/results/'
return scrapy.Request(url=url,callback=self.parse_item, dont_filter=True)
def parse_item(self, response):
print( 'Thissssssssss----------------------',response.url)
open_in_browser(response)
我得到注销,一旦我产生一个请求/results/后,成功登录.据说,在默认情况下,scrapy处理cookie,我已经尝试发送cookie和标题旁边的每一个请求,但没有工作.请我需要有人尝试从另一端,并告诉我这一点的原因,因为我的React显示,我登录,但发送一个请求后,该日志的我.
步骤重现scrayReact:
1.勉强起步的项目
1.网站Map----oddsportal.com
1.将用户代理设置为默认Scrapy用户代理:用户代理= 'oddsportal_website(+http:www.yourdomain.com)'
1.运行蜘蛛程序:古怪的运动
记录档
{'BOT_NAME': 'oddsportal_website',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'NEWSPIDER_MODULE': 'oddsportal_website.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['oddsportal_website.spiders']}
2022-08-15 09:47:48 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-08-15 09:47:48 [scrapy.extensions.telnet] INFO: Telnet Password: 66aa39ca3b133f3d
2022-08-15 09:47:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-08-15 09:47:48 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'oddsportal_website.middlewares.UserAgentRotatorMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-08-15 09:47:48 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'log_count/DEBUG': 9,
'log_count/INFO': 11,
'request_depth_max': 2,
'response_received_count': 4,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2022, 8, 15, 8, 47, 48, 449490)}
1条答案
按热度按时间55ooxyrt1#
您已登录!这只是因为用户名不是响应的一部分,而是通过API调用或使用JavaScript和Cookie加载的(您可以在结果页面上查看页面源代码,然后搜索Chuky,您不会找到它),而且由于Scrapy只从您设置的URL加载响应(没有JS或其他API调用)它不会显示。确认您已登录的一个好方法是转到
https://www.oddsportal.com/settings/
,它在HTML中有用户名