我试图学习scrapy和做爬网的网站,但我得到了403响应时,做爬网
这是我的蜘蛛:
import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst, MapCompose
from w3lib.html import remove_tags
def remove_currency(value):
return value.replace('£','').strip()
class WhiskyscraperItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field(input_processor = MapCompose(remove_tags), output_processor = TakeFirst())
price = scrapy.Field(input_processor = MapCompose(remove_tags, remove_currency), output_processor = TakeFirst())
link = scrapy.Field()
class WhiskeySpider(scrapy.Spider):
name = 'whisky'
start_urls = ['https://www.whiskyshop.com/scotch-whisky?item_availability=In+Stock']
def parse(self, response):
for products in response.css('div.product-item-info'):
l = ItemLoader(item = WhiskyscraperItem(), selector=products)
l.add_css('name', 'a.product-item-link')
l.add_css('price', 'span.price')
l.add_css('link', 'a.product-item-link::attr(href)')
yield l.load_item()
next_page = response.css('a.action.next').attrib['href']
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
我不知道我是否做错了什么,但代码工作得很好,但它只是被拒绝的403响应,我能做什么?
1条答案
按热度按时间8e2ybdfx1#
@Barry the Platipus
已经很好地声明了网站是在Cloudflare protection
下的。所以发送一般请求在这里是行不通的。这就是为什么拇指的一般规则是你可以应用cloud scraper or selenium
。我用了cloudscraper
和Scrapy/Selenium with scrapy/scrapy-SeleniumRequest
两者都没有工作。scrapy-SeleniumRequest
返回200
响应状态,但空输出,并只生成一些Cloudflare
会谈,但只有强大的原始Selenium
引擎与BeautifulSoup
的工作就像一个魅力!工作代码示例:
输出: