scrapy 当zaubee.com每个餐厅链接的href属性设置为“#”时,如何从www.example.com导航和提取餐厅详细信息?

ia2d9nvy  于 2023-06-23  发布在  其他
关注(0)|答案(1)|浏览(172)

当zaubee.comscrapy中的href属性设置为“#”时,我如何抓取www.example.com网站以从每个餐厅的页面中提取业务详细信息??
我目前正在进行一个网页抓取项目,该项目将从zaubee.com网站收集公司信息。但是,每个餐厅链接的href参数都设置为#,这使我无法访问各个餐厅站点并收集所需的数据。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class zaubeeSpider(scrapy.Spider):
    name = 'zaubeeerestaurant'
    allowed_domains = ['www.zaubee.com']
    start_urls = ['https://zaubee.com/category/restaurant-in-fredonia-hclq6jom']

def parse(self, response):
    restaurantlink = response.xpath("//div[@class='search-result__title-wrapper']/h2")
    for restaurant in restaurantlink:
        name= restaurant.xpath(".//text()").get()
        link = restaurant.xpath(".//@href").get()
        yield {
            'name':name,
            'link':link
        }
        yield response.follow(url=link,callback =self.parse_restaurant)

def parse_restaurant(self,response):
    name = response.xpath("//h1[@class='postcard__title postcard__title--claimed']/text()").get()
    website = response.xpath("(//a[@class='profile__website__link']/@href)[1]").get()
    address = response.xpath("(//address[@class='profile__address--compact']/text())[1]").get()

    yield{
        'name':name,
        "website":website,
        'address':address
    }

我以前使用Scrapy创建了一个抓取解决方案,但我需要帮助来克服这个挑战。我可以使用什么方法或变通方法来访问每个餐厅的页面并获取必要的信息?
一次输入的输出:

2023-06-04 23:38:10 [scrapy.core.scraper] DEBUG: Scraped from <200 [https://zaubee.com/category/restaurant-in-fredonia-hclq6jom](https://zaubee.com/category/restaurant-in-fredonia-hclq6jom)>
{'name': 'Restaurants in Fredonia New York', 'link': '#'}

当它试图得到内部链接如下所示

2023-06-04 23:38:12 [scrapy.core.scraper] DEBUG: Scraped from <200 [https://zaubee.com/category/restaurant-in-fredonia-hclq6jom](https://zaubee.com/category/restaurant-in-fredonia-hclq6jom)>
{'name': None, 'website': None, 'address': None}

我试图进入每个餐厅的链接,并收集餐厅名称,地址,电话,特定链接的时间。

krugob8w

krugob8w1#

只是你的xpath选择器错了。

import scrapy
import unicodedata
import re

class zaubeeSpider(scrapy.Spider):
    name = 'zaubeeerestaurant'
    start_urls = ['https://zaubee.com/category/restaurant-in-fredonia-hclq6jom']
    allowed_domains = ['zaubee.com']

    def parse(self, response):
        restaurants = response.xpath('//div[@data-value]')
        for restaurant in restaurants:
            name = restaurant.xpath('.//h3/text()[not(span)]').getall()
            name = ''.join(name).strip()
            link = restaurant.xpath(".//a/@href").get(default='')
            yield {
                'name': name,
                'link': response.urljoin(link)
            }
            yield response.follow(url=link, callback=self.parse_restaurant)

    def parse_restaurant(self,response):
        name = response.xpath('//h1/text()').get()
        website = response.xpath('//a[@rel]/@href').get(default='')
        website = re.sub(r'//', r'https://', website)
        address = response.xpath('//div[contains(@class, "address")]/span[last()]/text()').get(default='')
        address = unicodedata.normalize("NFKD", address).replace('\n', ' ').strip()

        yield{
            'name': name,
            "website": website,
            'address': address
        }

相关问题