scrapy 如何通过Xpath从相对URL生成完整URL?

omjgkv6w  于 2022-11-09  发布在  其他
关注(0)|答案(2)|浏览(169)
<td class="searchResultsLargeThumbnail" data-hj-suppress="">

            <a href="/ilan/emlak-konut-satilik-atasehir-agaoglu-soutside-2-plus1-ferah-cephe-iyi-konum-1057265758/detay" title="ATAŞEHİR AĞAOĞLU SOUTSİDE 2+1 FERAH CEPHE İYİ KONUM">
...
            <a href="/ilan/emlak-konut-satilik-atapark-konutlarinda-buyuk-tip-2-plus1-ebeveyn-banyolu-102-m-daire-1057086925/detay" title="Atapark Konutlarında Büyük Tip 2+1 Ebeveyn Banyolu 102 m² Daire">
...
            <a href="/ilan/emlak-konut-satilik-metropol-istanbul-yuksek-katli-cift-banyolu-satilik-2-plus1-daire-1049614464/detay" title="Metropol İstanbul Yüksek Katlı Çift Banyolu Satılık 2+1 Daire">
...

有一个网站有这样的页面。我试图抓取每个广告的内部页面信息。对于这个迭代,我需要页面的绝对链接,而不是相对链接。
运行以下代码后:

import scrapy

class AtasehirSpider(scrapy.Spider):
    name = 'atasehir'
    allowed_domains = ['www.sahibinden.com']
    start_urls = ['https://www.sahibinden.com/satilik/istanbul-atasehir?address_region=2']

    def parse(self, response):
        for ad in response.xpath("//td[@class='searchResultsLargeThumbnail']/a/@href"):
            print(ad.get())

我得到的输出如下:

/ilan/emlak-konut-satilik-atasehir-agaoglu-soutside-2-plus1-ferah-cephe-iyi-konum-1057265758/detay
/ilan/emlak-konut-satilik-atapark-konutlarinda-buyuk-tip-2-plus1-ebeveyn-banyolu-102-m-daire-1057086925/detay
/ilan/emlak-konut-satilik-metropol-istanbul-yuksek-katli-cift-banyolu-satilik-2-plus1-daire-1049614464/detay
...
2022-10-14 03:37:23 [scrapy.core.engine] INFO: Closing spider (finished)

我已经试过几种解决方案了。
第一个
我认为“follow()”有一个很简单的方法来解决这个问题,但是我不能克服这个错误,因为我没有足够的编程概念。

qv7cva1a

qv7cva1a1#

Scrapy有一个内置的方法可以使用response.urljoin()来实现这个功能,你可以在所有链接上执行这个操作,不管它们是否是一个相关的url。Scrapy的实现会为你做检查。它只需要一个参数,因为它会插入url来自动生成响应。
例如:

def parse(self, response):
    for ad in response.xpath("//td[@class='searchResultsLargeThumbnail']/a/@href").getall():
        ad = response.urljoin(ad)
        print(ad)
jtoj6r0c

jtoj6r0c2#

您可以尝试以下操作:

def parse(self, response):
    for ad in response.xpath("//td[@class='searchResultsLargeThumbnail']/a/@href"):
        ad_url = f"https://www.https://www.sahibinden.com/{ad}"
        print(ad_url)

相关问题