我想从https://www.newegg.com/tools/laptop-finder刮像屏幕类型和标题等数据,但我卡住了,因为我的脚本被抓取,但没有刮
网站的HTML代码是
<tr>
<td class="td-item">
<a class="goods-info" href="https://www.newegg.com/p/N82E16834156430?Item=N82E16834156430" data-toggle="modal" data-target="#modal-pc-builder-pdp">
<div class="goods-img">
<img src="https://c1.neweggimages.com/ProductImageCompressAll125/34-156-430-03.jpg" alt="MSI Katana 15 B12VGK-082US 15.6" Gaming Laptop">
</div>
<div class="goods-title">
<div class="goods-title-content">MSI Katana 15 B12VGK-082US 15.6" Gaming Laptop</div>
<div class="goods-rating">
<i class="rating rating-4" aria-label="rated 4 out of 5"></i>
<span class="goods-rating-num font-s text-gray">(31)</span>
</div>
</div>
</a>
</td>
<td class="td-spec"><div class="hid-text">Screen Size</div><span>15.6"</span></td>
<td class="td-spec"><div class="hid-text">CPU type</div><span>Intel Core i7 12th Gen</span></td>
<td class="td-spec"><div class="hid-text">Memory</div><span>16GB</span></td>
<td class="td-spec"><div class="hid-text">Storage</div><span>1 TB PCIe</span></td>
<td class="td-spec"><div class="hid-text">GPU</div><span>NVIDIA GeForce RTX 4070 Laptop GPU</span></td>
<td class="td-spec"><div class="hid-text">Resolution</div><span>1920 x 1080</span></td>
<td class="td-spec"><div class="hid-text">Weight</div><span>4 - 4.9 lbs.</span></td>
<td class="td-spec"><div class="hid-text">Backlit Keyboard</div><span>Backlit</span></td>
<td class="td-spec"><div class="hid-text">Touchscreen</div><span>No</span></td>
<td class="td-spec"><div class="hid-text">CPU Speed</div><span>12650H (2.30GHz)</span></td>
<td class="td-spec"><div class="hid-text">Number of Cores</div><span>10-core (6P+4E) Processor</span></td>
<td class="td-spec"><div class="hid-text">Color</div><span>Black</span></td>
<td class="td-spec"><div class="hid-text">Display Type</div><span>Full HD</span></td>
<td class="td-spec"><div class="hid-text">Graphic Type</div><span>Dedicated Card</span></td>
<td class="td-spec"><div class="hid-text">Operating System</div><span>Windows 11 Home</span></td>
<td class="td-spec"><div class="hid-text">Webcam</div><span>Yes</span></td>
<td class="td-action">
<div class="item-action grid col-w-3">
<div class="goods-price-current hide-click-for-details">
<div class="goods-price font-s">
<div class="goods-price-current">
<span class="goods-price-label"></span>
<span class="goods-price-symbol">$</span>
<span class="goods-price-value"><strong>1,159</strong><sup>.00</sup></span>
</div>
</div>
</div>
<div class="goods-operate xxs-hide">
<div class="goods-button-area">
<label class="input-check input-check-s compare-check">
<input type="checkbox" autocomplete="off" aria-label="checkbox">
<span class="input-check-mark text-hide">checkmark</span>
<div class="input-check-text">Compare</div>
</label>
<button title="Add MSI Katana 15 B12VGK-082US 15.6" 144 Hz IPS Intel Core i7 12th Gen 12650H (2.30GHz) NVIDIA GeForce RTX 4070 Laptop GPU 16GB Memory 1 TB NVMe SSD Windows 11 Home 64-bit Gaming Laptop to cart" class="button button-s bg-orange">Add to cart</button>
</div>
</div>
</div>
</td>
</tr>
由于我只是学习刮我刮只有标题和屏幕大小只有一台笔记本电脑下面是我的刮代码
import scrapy
class LaptopSpider(scrapy.Spider):
name = "laptop"
headers = {
"authority": "ssl.doas.state.ga.us",
"pragma": "no-cache",
"cache-control": "no-cache",
"sec-ch-ua": "\"Chromium\";v=\"92\", \" Not A;Brand\";v=\"99\", \"Google Chrome\";v=\"92\"",
"accept": "application/json, text/javascript, */*; q=0.01",
"x-requested-with": "XMLHttpRequest",
"sec-ch-ua-mobile": "?0",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36",
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"origin": "https://ssl.doas.state.ga.us",
"sec-fetch-site": "same-origin",
"sec-fetch-mode": "cors",
"sec-fetch-dest": "empty",
"referer": "https://ssl.doas.state.ga.us/gpr/",
"accept-language": "en-US,en;q=0.9"
}
start_urls = ['https://www.newegg.com/tools/laptop-finder']
custom_settings = {'REDIRECT_ENABLED': False}
handle_httpstatus_list = [302]
def parse(self, response):
product = response.css('tr td.td-item')
for item in product:
yield {
'Title': item.css('.goods-title-content::text').get(),
'Screen Size': item.xpath('.//div[text()="Screen Size"]/following-sibling::span/text()').get(),
}
我的日志文件是
2023-09-10 10:12:28 [scrapy.core.engine] INFO: Spider opened
2023-09-10 10:12:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-09-10 10:12:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-09-10 10:12:29 [scrapy.core.engine] DEBUG: Crawled (302) <GET https://www.newegg.com/tools/laptop-finder> (referer: None)
2023-09-10 10:12:29 [scrapy.core.engine] INFO: Closing spider (finished)
2023-09-10 10:12:29 [scrapy.extensions.feedexport] INFO: Stored json feed (0 items) in: j.json
2023-09-10 10:12:29 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
帮帮我
1条答案
按热度按时间qzlgjiam1#
对这个问题不可能给予明确、直接的答复。
你需要把这些知识作为基础:
1.你可以使用的方法来调试你的代码https://docs.scrapy.org/en/latest/topics/debug.html
1.我建议使用start_requests初始化方法将头传递到第一个请求中,该请求的URL为https://www.newegg.com/tools/laptop-finder