我如何使用python Scrapy从该网站提取汽车链接

kupeojn6  于 2022-11-09  发布在  Python
关注(0)|答案(1)|浏览(142)

Here, I am trying to extract all the car links from this website "https://www.euroncap.com/en/ratings-rewards/electric-vehicles/#?selectedMake=0&selectedMakeName=Select%20a%20make&selectedModel=0&selectedStar=&includeFullSafetyPackage=true&includeStandardSafetyPackage=true&selectedModelName=All&selectedProtocols=45155,41776&selectedClasses=1202,1199,1201,1196,1205,1203,1198,1179,40250,1197,1204,1180,34736,44997&allClasses=true&allProtocols=false&allDriverAssistanceTechnologies=false&selectedDriverAssistanceTechnologies=&thirdRowFitment=false" for example. I am trying to extract the link of "Volvo c40 recharge" for extracting I used python Scrapy response.css('div.rating-table-row-c.c9 a').xpath('@href').extract() but I am getting output as ['/en{{assessment.Url}}'] but the actual url is "/en/results/volvo/c40-recharge/45878" How can I extract this?.

a9wyjsp7

a9wyjsp71#

这些数据是用JavaScript呈现的,所以不能直接用scrapy获取(除非使用scrapy-splash或selenium-scrapy等),可以通过禁用JavaScript并重新加载页面来查看。
如果你在devtools中打开“Network”选项卡,那么你可以看到它从一个JSON文件中获取数据,所以你可以直接从这个文件中获取你想要的数据。
带碎屑外壳的示例:

In [1]: headers = {
   ...: "Accept": "application/json, text/plain, */*",
   ...: "Accept-Encoding": "gzip, deflate, br",
   ...: "Accept-Language": "en-US,en;q=0.5",
   ...: "Cache-Control": "no-cache",
   ...: "Connection": "keep-alive",
   ...: "DNT": "1",
   ...: "Host": "www.euroncap.com",
   ...: "Pragma": "no-cache",
   ...: "Referer": "https://www.euroncap.com/en/ratings-rewards/electric-vehicles/",
   ...: "Sec-Fetch-Dest": "empty",
   ...: "Sec-Fetch-Mode": "cors",
   ...: "Sec-Fetch-Site": "same-origin",
   ...: "Sec-GPC": "1",
   ...: "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.372
   ...: 9.169 Safari/537.36"
   ...: }

In [2]: req = scrapy.Request(url='https://www.euroncap.com/Umbraco/EuroNCAP/SearchApi/GetAssessmentSearch?protocols=451
   ...: 55,41776&make=0&model=0&carClasses=1202,1199,1201,1196,1205,1203,1198,1179,40250,1197,1204,1180,34736,44997&dri
   ...: verAssistanceTechnologies=&allProtocols=false&allClasses=true&allDriverAssistanceTechnologies=false&includeFull
   ...: SafetyPackage=true&includeStandardSafetyPackage=true&showOnlyHybrid=true&showOnlyFleet=false&starNumber=&thirdR
   ...: owFitment=false', headers=headers)

In [3]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.euroncap.com/Umbraco/EuroNCAP/SearchApi/GetAssessmentSearch?protocols=45155,41776&make=0&model=0&carClasses=1202,1199,1201,1196,1205,1203,1198,1179,40250,1197,1204,1180,34736,44997&driverAssistanceTechnologies=&allProtocols=false&allClasses=true&allDriverAssistanceTechnologies=false&includeFullSafetyPackage=true&includeStandardSafetyPackage=true&showOnlyHybrid=true&showOnlyFleet=false&starNumber=&thirdRowFitment=false> (referer: https://www.euroncap.com/en/ratings-rewards/electric-vehicles/)

In [4]: jsonData = response.json()

# The specific URL you requested (check the JSON file and loop through the data however you want to).

In [5]: print(jsonData['AssessmentSearchResults'][0]['Assessments'][1]['Url'])
/results/volvo/c40-recharge/45878

相关问题