HTTP状态代码未使用scrapy和selenium处理

ntjbwcob  于 2022-11-29  发布在  其他
关注(0)|答案(1)|浏览(131)

我面临的错误HTTP status code is not handled or not allowed如何解决这些错误我正在使用 selenium 和scrapy在一起我也使用的user agent在设置,但HTTP错误不会解决请推荐任何解决方案这是页面链接https://www.askgamblers.com/online-casinos/countries/uk

import scrapy
from scrapy.http import Request
from selenium import webdriver
import time
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager

class TestSpider(scrapy.Spider):
    name = 'test'
    

    def start_requests(self):
            options = webdriver.ChromeOptions()
            options.add_argument("--no-sandbox")
            options.add_argument("--disable-gpu")
            options.add_argument("--window-size=1920x1080")
            options.add_argument("--disable-extensions")
            driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
            
            URL = 'https://www.askgamblers.com/online-casinos/countries/uk'
            driver.get(URL)
            
            time.sleep(3)
            page_links =driver.find_elements(By.XPATH, "//div[@class='card__desc']//a[starts-with(@href, '/online')]")
            for link in page_links:
                    href=link.get_attribute("href")
                    yield scrapy.Request(href)
            driver.quit()

    def parse(self, response):
            title=response.css(By.CSS_SELECTOR, "h1.ch-title::text").get()
            yield{
                    'title':title
                    }
xoefb8l8

xoefb8l81#

你得到这样的错误,因为该网站是在cloudflare保护。

https://www.askgamblers.com/online-casinos/countries/uk is using Cloudflare CDN/Proxy!

https://www.askgamblers.com/online-casinos/countries/uk is NOT using Cloudflare SSL

Scrapy with Selenium/Scrapy不能处理(我测试过)cloudflare保护,但只有powerful selenium engine可以完成这项工作。最后,我将bs 4与selenium集成,以更健壮的方式解析内容。

脚本:

from selenium import webdriver
import time
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1920x1080")
options.add_argument("--disable-extensions")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
                    
URL = 'https://www.askgamblers.com/online-casinos/countries/uk'
driver.get(URL)
time.sleep(2)
urls= []
page_links =driver.find_elements(By.XPATH, "//div[@class='card__desc']//a[starts-with(@href, '/online')]")
for link in page_links:
    href=link.get_attribute("href")
    urls.append(href)
    #print(href)

for url in urls:
    driver.get(url)
    time.sleep(1)
    soup = BeautifulSoup(driver.page_source,"lxml")
    try:
        title=soup.select_one("h1.ch-title").get_text(strip=True)
        print(title)
    except:
        print('empty')
        pass

输出:

Mr.Play Casino
Bet365 Casino
Slotnite Casino
Trada Casino
PlayFrank Casino
Karamba Casino
Hello! Casino
21 Prive Casino
Casilando Casino
AHTI Games Casino
BacanaPlay Casino
Spinland Casino
Fun Casino
Slot Planet Casino
21 Casino
Conquer Casino
CasinoCasino
Barbados Casino
King Casino
Slots Magic Casino
Spin Station Casino
HeySpin Casino
CasinoLuck
Casino RedKings

相关问题