403使用scrapy python时的响应

4xrmg8kj  于 2022-11-09  发布在  Python
关注(0)|答案(1)|浏览(182)

我试图学习scrapy和做爬网的网站,但我得到了403响应时,做爬网
这是我的蜘蛛:

import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
from w3lib.html import remove_tags

def remove_currency(value):
    return value.replace('£','').strip()

class WhiskyscraperItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field(input_processor = MapCompose(remove_tags), output_processor = TakeFirst())
    price = scrapy.Field(input_processor = MapCompose(remove_tags, remove_currency), output_processor = TakeFirst())
    link = scrapy.Field()

class WhiskeySpider(scrapy.Spider):
    name = 'whisky'
    start_urls = ['https://www.whiskyshop.com/scotch-whisky?item_availability=In+Stock']

    def parse(self, response):
        for products in response.css('div.product-item-info'):
            l = ItemLoader(item = WhiskyscraperItem(), selector=products)

            l.add_css('name', 'a.product-item-link')
            l.add_css('price', 'span.price')
            l.add_css('link', 'a.product-item-link::attr(href)')

            yield l.load_item()

        next_page = response.css('a.action.next').attrib['href']
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

我不知道我是否做错了什么,但代码工作得很好,但它只是被拒绝的403响应,我能做什么?

8e2ybdfx

8e2ybdfx1#

@Barry the Platipus已经很好地声明了网站是在Cloudflare protection下的。所以发送一般请求在这里是行不通的。这就是为什么拇指的一般规则是你可以应用cloud scraper or selenium。我用了cloudscraperScrapy/Selenium with scrapy/scrapy-SeleniumRequest两者都没有工作。scrapy-SeleniumRequest返回200响应状态,但空输出,并只生成一些Cloudflare会谈,但只有强大的原始Selenium引擎与BeautifulSoup的工作就像一个魅力!

工作代码示例:

from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
import pandas as pd
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)

data = []
for page in range(0, 8):
    driver.get(f'https://www.whiskyshop.com/scotch-whisky?item_availability=In+Stock&p={page}')
    driver.maximize_window()
    time.sleep(8)

    soup = BeautifulSoup(driver.page_source,"html.parser")
    for card in soup.select('div[class="products wrapper grid products-grid"] > ol > li > div.product-item-info'): 
        title = card.h3.get_text(strip=True) 
        price = card.select_one('span.price').get_text(strip=True) if card.select_one('span.price') else None
        link=card.a.get('href')

        data.append({
            'title':title,
            'price':price,
            'link':link
            })

df = pd.DataFrame(data)
print(df)

输出:

title  ...                                               link
0          Bunnahabhain 12 Year Old Cask Strength 2022  ...  https://www.whiskyshop.com/bunnahabhain-12-yea...      
1           Lagavulin 12 Year Old Special Release 2022  ...  https://www.whiskyshop.com/lagavulin-12-year-o...      
2              Johnnie Walker Ghost & Rare Port Dundas  ...  https://www.whiskyshop.com/johnnie-walker-ghos...      
3              Cardhu 16 Year Old Special Release 2022  ...  https://www.whiskyshop.com/cardhu-16-year-old-...      
4    Speyside #1 50 Year Old Batch 5 That Boutique-...  ...  https://www.whiskyshop.com/speyside-1-50-year-...      
..                                                 ...  ...                                                ...      
795            Glenkeir Treasure Carribean Blended Rum  ...  https://www.whiskyshop.com/glenkeir-carribean-...      
796            The Loch Fyne Caol Ila 10 Year Old 2010  ...  https://www.whiskyshop.com/loch-fyne-caol-ila-...      
797          Glen Moray Warehouse 1 1998 Barolo Finish  ...  https://www.whiskyshop.com/glen-moray-warehous...      
798     Cardhu 14 Year Old Diageo Special Release 2021  ...  https://www.whiskyshop.com/cardhu-14-year-old-...      
799  Lagavulin 26 Year Old Diageo Special Release 2021  ...  https://www.whiskyshop.com/lagavulin-26yo-diag...      

[800 rows x 3 columns]

相关问题