下面是代码和输出。我试着查找这个问题,但我能找到的大多数是这些链接
https://stackoverflow.com/questions/70444479/scrapy-not-able-to-scrape-ratings-data-on-amazon
https://www.quora.com/Why-cant-Amazon-prices-be-scraped
任何帮助都是感激的谢谢
蜘蛛代码
import json
import scrapy
from ..items import AmazontutorialItem
class AmazonspiderSpider(scrapy.Spider):
name = 'amazonSpider'
allowed_domains = ['amazon.in']
start_urls = \
[
'https://www.amazon.in/gp/bestsellers/videogames/1376528031/ref=zg_bs_nav_videogames_2_1376518031'
]
def parse(self, response):
items = AmazontutorialItem()
productName = response.xpath("//*[contains(concat( ' ', @class, ' ' ), concat( ' ', 'a-link-normal', ' ' ))]//span//div/text()").extract()
productAuthor = response.css('.a-color-base ._cDEzb_p13n-sc-css-line-clamp-1_1Fn1y').css('::text').extract()
productPrice = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "_cDEzb_p13n-sc-price_3mJ9Z", " " ))]/text()').extract()
productImageLink = response.css('.p13n-product-image::attr(src)').extract()
items['productName'] = productName
items['productAuthor'] = productAuthor
items['productPrice'] = productPrice
items['productImageLink'] = productImageLink
yield items
终端输出
2022-08-31 14:02:16 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: amazonTutorial)
2022-08-31 14:02:16 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 2.0.1, Twisted 22.4.0, Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Windows-10-10.0.22000-SP0
2022-08-31 14:02:16 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'amazonTutorial',
'NEWSPIDER_MODULE': 'amazonTutorial.spiders',
'SPIDER_MODULES': ['amazonTutorial.spiders'],
'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; '
'+http://www.google.com/bot.html)'}
2022-08-31 14:02:16 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-08-31 14:02:16 [scrapy.extensions.telnet] INFO: Telnet Password: 6adf5086c628832a
2022-08-31 14:02:17 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-08-31 14:02:17 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-08-31 14:02:17 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-08-31 14:02:17 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-08-31 14:02:17 [scrapy.core.engine] INFO: Spider opened
2022-08-31 14:02:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-08-31 14:02:17 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-08-31 14:02:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.in/gp/bestsellers/videogames/1376528031/ref=zg_bs_nav_videogames_2_1376518031> (referer: None)
2022-08-31 14:02:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.in/gp/bestsellers/videogames/1376528031/ref=zg_bs_nav_videogames_2_1376518031>
{'productAuthor': ['ACTIVISION',
'Electronic Arts',
'Generic',
'Electronic Arts',
'ACTIVISION',
'ROCKSTAR GAMES',
'REES52',
'ROCKSTAR GAMES',
'Microsoft',
'ACTIVISION',
'Generic',
'Generic',
'Square Enix',
'UBI Soft',
'Rockstar North',
'Paradox Interactive',
'Generic',
'Generic',
'ADGAMES',
'Blizzard Entertainment',
'UBI Soft',
'Generic',
'Valve',
'Generic',
'Excalibur by Unlimited',
'UBI Soft',
'SEGA',
'ADGAMES',
'ACTIVISION',
'Generic',
'AD Games',
'Bluehole Studio Inc., PUBG Corporation',
'UBI Soft',
'Bethesda',
'Eidos',
'AG Gaming',
'ACTIVISION',
'ACTIVISION',
'ROCKSTAR GAMES',
'Generic',
'Generic',
'ADGAMES',
'Warner Bros.',
'Generic',
'UBI Soft',
'2K GAMES',
'Bethesda',
'Codemasters'],
'productImageLink': ['https://images-eu.ssl-images-amazon.com/images/I/91Wjtmyrg9L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81IXtVuvlmL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/714zMHvejkL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81yegjdGUjL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51KuZ6TnmfL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/71mNkKmd3JL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/41SI-1pARKL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51BOMq+7w7L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51djUfKMJyL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81yhTa3zjlL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/819Nhgz+3NL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/61DSfTeIAdS._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81r70W2EVRL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/516v7vChU8L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81PR8qtHJJL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51VV5Z8M5KL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/91wL7h6OX6L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/31lL8a0n17L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51T43LR-tlL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81kTM28TXpL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/A1iEiu4PEJL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51ewfEKk2vL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/510G-36LZWL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81L8-mjNlrL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/61bDL5UUuFL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/71nrt+t8bAL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51eu5RaeIvL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81beRvbvv1L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51rlW7AK2xL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51kYpa4lksL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51tgtEXNi9L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/811qcvGij2L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/91vFfJh2IbL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/71H7c4DPQEL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81qTUih-eUL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/41afQxgahrL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81ViUDBvP+L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51gZP1Yh1nL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/61y3yx53X2L._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51GhmLgOLPL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/615H16JHVSL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51kXWEFYqPL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/41zuts4V8EL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81Xa2jR5ApL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/813H5aEHrdL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/31WujBtNSbL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/71LZ7amiLOL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/71S9QD541VL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/51SUf67G2QL._AC_UL300_SR300,200_.jpg',
'https://images-eu.ssl-images-amazon.com/images/I/81LXTC16qBL._AC_UL300_SR300,200_.jpg'],
'productName': ['ACTIVISION Call of Duty: WWII (PS4)',
'FIFA 23 for PC',
'GTA (Combo) PC Game - Digital Download (No Online '
'Multiplayer/No REDEEM* Code) - | NO DVD NO CD |',
'FIFA 22 (PC)',
'Spider-Man: Friend or Foe (PC DVD)',
'PC GTA 4 (PC)',
'Waveshare Handheld Game HAT for Raspberry Pi '
'4B/3B+/3B/2B/B+/A+/Zero/Zero W Portable Game Console Gameboy '
'3.5inch IPS Screen on Board Gamepad Joystick, Smoothly '
'Display',
'Grand Theft Auto: Episodes From Liberty City (PC)',
'Age of Empires IV: Standard - Windows 10 (Digital Code) (PC)',
'Call of Duty: WWII (Xbox One)',
'Project IGI (2000) Offline PC Game',
'G-T-A-San-Andrea - (Digital Download) Full PC Game - (NO DVD '
'NO CD) - (NO ONLINE MULTIPLAYER MODE) - PC.',
'Sleeping Dogs - Definitive Edition',
"Tom Clancy's Rainbow Six Siege (PC)",
'Grand Theft Auto V - PC - (ROCKSTAR SOCIAL CLUB DOWNLOAD '
'CODE-NO CD/DVD)',
'Take Command: Second Manassas',
'Trick (pc)',
'Hitman 2: Silent Assassin (2002) Offline PC Game',
'EPC Games: AOE (1,2 & 3) (Digital Download) No DVD/CD (No '
'Online Multiplayer) - Single Player Mode (PC Game)',
'Assassin Creed III Pc Game DVD For Windows Full Setup '
'Offline',
'World of Warcraft (PC/Mac)',
"Assassin's Creed IV: Black Flag (PS4)",
'GTA-San Andraes (PC GAME) - PC Download - [No Multiplayer/No '
'Redeem* Code] - | *NO DVD NO CD* | - WIN 10/11',
'Counter-strike: Global Offensive (PC)',
'Empire Earth (2001) Offline PC Game',
'Train Simulator 2015 (PC)',
"Tom Clancy's: Rainbow Six Siege (Free PS5 Upgrade)",
'Total War : Three Kingdoms Royal Edition (PC)',
'Total_OverDose Pc Game Dvd (Windows)',
'Call of Duty: Black Ops II (PS3)',
'Generic SpideMan 3 PC GAME- (Digital Download) - [ NO DVD NO '
'CD - NO ONLINE MULTIPLAYER/NO ACTIVATION CODE* ] - PC',
'JUST_CUSE_2 PC GAME DVD',
"Player Unknown's Battle Grounds -PUBG (Code in the Box)",
'Driver 3/Driver: Parallel Lines (PC)',
'Fallout 3 - Game of the Year Edition (PC Code)',
'SEKEIRO: SHADOWS DIE TWICE – GOTY EDITION (PC GAME) - '
'Digital Download (No Online Multiplayer/No REDEEM* Code) - | '
'NO DVD NO CD |',
'Kane and Lynch 2: Dog Days (PC DVD)',
'GA Retails - Battlefield Hardline Action Adventure Standard '
'Edition Offline PC Game (for PC)',
'ACTIVISION Call of Duty: WWII (PS4)+UBI Soft Far Cry 5 (PS4)',
'Activision Blizzard Inc Call of Duty WWII - PlayStation 4 '
'Standard Edition',
'Grand Theft Auto: Vice City (PC)',
'Commandos: Behind Enemy Lines (1998) Offline PC Game',
'Command & Conquer: Generals – Zero Hour (2003) Offline PC '
'Game',
'Assassin Creed 4 Black Flag Pc Game DVD For Windows',
'Fear 2: Project Origin (PC)',
'Delta Force 2 (1999) Offline PC Game',
"Tom Clancy's: The Division (PS4)",
'The Bureau Xcom Declassified (PC)',
'Prey - PlayStation 4',
'F1: 2013 (PC)'],
'productPrice': []}
2022-08-31 14:02:19 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-31 14:02:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 325,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 53976,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.822063,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 8, 31, 8, 32, 19, 119698),
'httpcompression/response_bytes': 315618,
'httpcompression/response_count': 1,
'item_scraped_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 8, 31, 8, 32, 17, 297635)}
2022-08-31 14:02:19 [scrapy.core.engine] INFO: Spider closed (finished)
1条答案
按热度按时间new9mtju1#
看起来你的xpath表达式是正确的,但是障碍来自cookie和user-agent。如果你把user-agent和cookie作为头文件注入,那么它应该可以工作
范例:
输出: