scrapy 属性内href元素的xpath

col17t5w  于 2022-11-23  发布在  其他
关注(0)|答案(3)|浏览(171)

我正在处理分页。我如何从下面的HTML选择器中获取href值?我不能使用**//a[@data-page-number ='2']/@href**,因为在每一页之后2会变成3。

<a data-page-number="2" data-offset="30" href="/Restaurants-g297633-oa30-Kochi_Cochin_Ernakulam_District_Kerala.html#EATERY_LIST_CONTENTS" class="nav next rndBtn ui_button primary taLnk" onclick="      require('common/Radio')('restaurant-filters').emit('paginate', this.getAttribute('data-offset'));; ta.trackEventOnPage('STANDARD_PAGINATION', 'next', '2', 0); return false;
  ">
Next
</a>
zzoitvuj

zzoitvuj1#

您想要取得next按钮的href属性。

正如您所看到的,它在onclick属性中有next值,因此我们可以使用它来过滤所有其他a标签。
Scrapy shell示例:

In [1]: url='https://www.tripadvisor.in/Restaurants-g297633-Kochi_Cochin_Ernakulam_District_Kerala.html#EATERY_LIST_CON
   ...: TENTS'

In [2]: req = scrapy.Request(url=url)

In [3]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.in/Restaurants-g297633-Kochi_Cochin_Ernakulam_District_Kerala.html#EATERY_LIST_CONTENTS> (referer: None)

In [4]: response.xpath('//a[contains(@onclick, "next")]/@href').get()
Out[4]: '/Restaurants-g297633-oa30-Kochi_Cochin_Ernakulam_District_Kerala.html#EATERY_LIST_CONTENTS'
zpf6vheq

zpf6vheq2#

//*[@class="unified pagination js_pageLinks"]/a[2]/@href

//*[@class="unified pagination js_pageLinks"]/a同时选择了上一页和下一页的url,所以通过切片,你必须得到下一页的url。
当然,当您选择元素时,请使用JavaScript,否则它会将静态元素与动态元素混合并匹配。

用于分页的完整工作代码:

import scrapy
class TestSpider(scrapy.Spider):
    name = 'tes'
    start_urls = ['https://www.tripadvisor.in/Restaurants-g297633-oa60-Kochi_Cochin_Ernakulam_District_Kerala.html#EATERY_LIST_CONTENTS']

    def parse(self, response):
        
        for card in response.xpath('//*[@class="zdCeB Vt o"]'):
            yield {'Title':card.xpath('.//a[@class="Lwqic Cj b"][1]//text()').getall()[-1]}

        next_page = response.xpath('//*[@class="unified pagination js_pageLinks"]/a[2]/@href').get()
        if next_page:
            next_page_url = response.urljoin(next_page)
            yield scrapy.Request(next_page_url,callback=self.parse)

输出:

{'Title': 'Vanitha Hotel'}
2022-09-25 22:39:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html#EATERY_LIST_CONTENTS> (referer: https://www.tripadvisor.in/Restaurants-g297633-oa1080-Kochi_Cochin_Ernakulam_District_Kerala.html)
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'The Muyal RESTAURANT'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'K K R Food Products'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Akathalam Homely Food'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Thanneer Mathan Restaurant'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Holly Hock'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Cochin Halwa Centre'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Canvas Restaurant Pizzeria'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Canvas Restaurant & Pizzeria'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Cafe Delaviz'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Cafe Sora'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Honey Dew Bakery'}
2022-09-25 22:39:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.in/Restaurants-g297633-oa1110-Kochi_Cochin_Ernakulam_District_Kerala.html>
{'Title': 'Food Barrel Restaurant'}
2022-09-25 22:39:07 [scrapy.core.engine] INFO: Closing spider (finished)
2022-09-25 22:39:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 152484,
 'downloader/request_count': 36,
 'downloader/request_method_count/GET': 36,
 'downloader/response_bytes': 4029630,
 'downloader/response_count': 36,
 'downloader/response_status_count/200': 36,
 'elapsed_time_seconds': 62.328141,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 9, 25, 16, 39, 7, 777225),
 'httpcompression/response_bytes': 22935503,
 'httpcompression/response_count': 36,
 'item_scraped_count': 1062,
yzckvree

yzckvree3#

您可以使用

"//a[@data-page-number]/@href"

这将定位带有data-page-number属性的a标签元素。我猜这应该是唯一的定位符。

UPD

您使用了错误的验证工具。
xpather.com是更好的XPath表达式验证工具。

相关问题