关于这个问题,我一直在苦苦寻找一条出路:(我可能展示的功能不起作用,是错误的,但它是更多的过程,我感到困惑)
我试图让我的蜘蛛得到“标准棚”页面上所有产品的价格。这是包含产品的页面的链接:https://www.charnleys.co.uk/product-category/gardening/garden-accessories/garden-furniture/sheds/standard-sheds/
但是,如果您单击产品链接,您会看到路径更改为“charnleys.co.uk/shop/shed-product-name“,因此我的爬行器无法跟踪。
我想做的是收集“standard-sheds”页面上的URL,将它们附加到一个数组中并进行迭代,然后让我的蜘蛛进入这些URL并收集价格。但是,我不确定如何让我的蜘蛛遍历URL数组。我将列出我当前创建的函数。
任何帮助都是非常感谢的。
from gc import callbacks
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
urls = []
class CharnleySpider(CrawlSpider):
name = 'crawler'
allowed_domains = ['charnleys.co.uk']
start_urls = ['https://www.charnleys.co.uk']
# https://www.charnleys.co.uk/product-category/gardening/garden-accessories/garden-furniture/sheds/standard-sheds/
# https://www.charnleys.co.uk/shop/bentley-supreme-apex/
rules = (
Rule(LinkExtractor(allow='product-category/gardening/garden-accessories/garden-
furniture/sheds', deny='sheds')),
Rule(LinkExtractor(allow='standard-sheds'), callback='collect_urls')
)
def collect_urls(self, response):
for elements in response.css('div.product-image'):
urls.append(elements.css('div.product-image a::attr(href)').get())
def html_return_price_strings(self, response):
#Searches through html of webpage and returns all string with "£" attatched.
all_html = response.css('html').get()
for line in all_html.split('\n'):
for word in line.split():
if word.startswith('£'):
print (word)
def parse_product(self, response, html_return_price_strings):
yield {
'name' : response.css('h2.product_title::text').get(),
'price' : html_return_price_strings()
}
1条答案
按热度按时间eimct9ow1#
当你开始旅行到每个列表页面/详细信息页面,并达到详细信息页面后,如果你关闭JS,那么你会注意到价格部分又名内容从页面已经消失的意思动态加载的JavaScript。所以Scrapy不能呈现JS,但你可以通过
scrapy-SeleniumRequest.
抓取动态内容在这里我使用scrapy默认蜘蛛,这是比crawlSpider更健壮。代码:
输出:
...等等