html 巨蟒|我如何获得产品的链接,没有与 selenium 的href

wgxvkvu9  于 2023-02-20  发布在  其他
关注(0)|答案(3)|浏览(121)

我试图废弃一个网站,但这个产品没有href
link: https://es.wallapop.com/app/search?keywords=monitor&filters_source=search_box&latitude=39.46895&longitude=-0.37686
我用 selenium 来浏览网页,用beautifulsoup来消化结果,但是如果我想打开每一个产品来获取更多的数据,我不知道该怎么做

kt06eoxx

kt06eoxx1#

您可以使用他们的Ajax API来下载有关项目的信息(其中一个信息是URL):

import requests

api_url = "https://api.wallapop.com/api/v3/general/search"
params = {
    "keywords": "monitor",
    "filters_source": "search_box",
    "latitude": "39.46895",
    "longitude": "-0.37686",
}

data = requests.get(api_url, params=params).json()

for o in data["search_objects"]:
    print(
        "{:<30} {:<5} {}".format(
            o["title"][:30], o["price"], "https://es.wallapop.com/item/" + o["web_slug"]
        )
    )

图纸:

Monitor Philips                50.0  https://es.wallapop.com/item/pantalla-pc-874564380
monitor de vigilancia de bebé  48.0  https://es.wallapop.com/item/monitor-de-vigilancia-de-bebe-874583928
MONITOR PC NUEVO               100.0 https://es.wallapop.com/item/monitor-pc-nuevo-874568539
Monitor LG 24"                 65.0  https://es.wallapop.com/item/monitor-lg-24-874529151
Monitor Samsung 24"            80.0  https://es.wallapop.com/item/monitor-samsung-24-874496925
monitor roto gaming msi 32     50.0  https://es.wallapop.com/item/monitor-roto-gaming-msi-32-874589740
Monitor 22 pulgadas Full HD    30.0  https://es.wallapop.com/item/monitor-22-pulgadas-full-hd-874571734
Monitor Keep Out 32" Full HD H 120.0 https://es.wallapop.com/item/monitor-keep-out-32-full-hd-hdmi-874555259
Monitor Asus 4k 60hz con caja  230.0 https://es.wallapop.com/item/monitor-asus-4k-60hz-con-caja-874575205
Monitor Keep Out XGM24P 24" Fu 85.0  https://es.wallapop.com/item/monitor-keep-out-xgm24p-24-full-hd-hdmi-874557938
MONITOR HP W2072a NEGRO        50.0  https://es.wallapop.com/item/monitor-hp-w2072a-negro-874446605
Monitor ASUS VZ249HE 23.8" Ful 119.0 https://es.wallapop.com/item/monitor-asus-vz249he-23-8-full-hd-874542630
Samsung Monitor Curvo Gaming   150.0 https://es.wallapop.com/item/samsung-monitor-curvo-gaming-874418062
Monitor Lenovo L22e-20 21,5" F 80.0  https://es.wallapop.com/item/monitor-lenovo-l22e-20-21-5-full-hd-hdmi-874560092
Monitor Acer 18.5''            20.0  https://es.wallapop.com/item/monitor-acer-18-5-874380113
se vende monitor asus          45.0  https://es.wallapop.com/item/se-vende-monitor-asus-874244944
HP 22W Monitor                 60.0  https://es.wallapop.com/item/hp-22w-monitor-874320493
MONITOR ASUS 19 NUEVO          60.0  https://es.wallapop.com/item/monitor-asus-19-nuevo-874239039
Monitor Hp                     150.0 https://es.wallapop.com/item/monitor-hp-874114954
Monitor pc                     10.0  https://es.wallapop.com/item/monitor-pc-874098257
Monitor HP negro 20 pulgadas   15.0  https://es.wallapop.com/item/monitor-hp-negro-20-pulgadas-874234251
Monitor 21.5" Full HD LED      80.0  https://es.wallapop.com/item/monitor-21-5-full-hd-led-874261942
Monitor MEDION MD 5043 OD, des 10.0  https://es.wallapop.com/item/monitor-medion-md-5043-od-despiece-874263768
Monitor LED de 24" - LG 24EA53 80.0  https://es.wallapop.com/item/monitor-led-de-24-lg-24ea53vq-p-874296411
Monitor para ordenador         10.0  https://es.wallapop.com/item/monitor-para-ordenador-873697820
Monitor HP 27o negociable      110.0 https://es.wallapop.com/item/monitor-hp-27o-negociable-874017054
Monitor Samsung                30.0  https://es.wallapop.com/item/monitor-samsung-873754031
Monitor philips                30.0  https://es.wallapop.com/item/monitor-philips-873752989
BenQ GL2460 - Monitor LED de 2 50.0  https://es.wallapop.com/item/benq-gl2460-monitor-led-de-24-full-hd-2ms-hdmi-874289910
Monitor 2 k                    100.0 https://es.wallapop.com/item/monitor-2-k-873847429
Monitores Tv Monitor Lg M197wd 29.9  https://es.wallapop.com/item/monitores-tv-monitor-lg-m197wdp-hdmi-873875385
Samsung Monitor Profesional de 150.0 https://es.wallapop.com/item/samsung-monitor-profesional-de-27-2k-874038879
Monitor hp 22xw Pavillon como  70.0  https://es.wallapop.com/item/monitor-hp-22xw-como-nuevo-874014297
Monitor 24" Benq XL2411T 144Hz 150.0 https://es.wallapop.com/item/monitor-24-benq-xl2411t-144hz-874159199
Monitor                        65.0  https://es.wallapop.com/item/monitor-873408045
Monitor benq 144hz             120.0 https://es.wallapop.com/item/monitor-benq-144hz-873858204
Monitor 21,5" Samsung          69.0  https://es.wallapop.com/item/monitor-21-5-samsung-873777001
2x monitores Asus 19.5" sin es 60.0  https://es.wallapop.com/item/2x-monitores-asus-19-5-sin-estrenar-874000367
Monitores baratos              7.0   https://es.wallapop.com/item/monitores-baratos-873394797
MONITOR HP 21'5 LCD            95.0  https://es.wallapop.com/item/monitor-hp-21-5-lcd-873879049

编辑:要获取下一页的产品,您可以添加start=参数到URL:

import requests

api_url = "https://api.wallapop.com/api/v3/general/search"
params = {
    "keywords": "monitor",
    "filters_source": "search_box",
    "latitude": "39.46895",
    "longitude": "-0.37686",
    "start": 0
}

for page in range(0, 3):  # <-- increase number of pages here
    params['start'] = page * 40

    data = requests.get(api_url, params=params).json()
    for o in data["search_objects"]:
        print(
            "{:<30} {:<5} {}".format(
                o["title"][:30], o["price"], "https://es.wallapop.com/item/" + o["web_slug"]
            )
        )
cqoc49vn

cqoc49vn2#

你可以使用wallapop的API来获取这些数据。有很多方法可以处理这些数据,我建议使用Scrapy框架。下面是使用Scrapy获取项目url(以及将项目数据保存为json文件)的代码:

import json
import scrapy

class WallapopSpider(scrapy.Spider):
    name = "wallapop"
    search_url = "https://api.wallapop.com/api/v3/general/search?keywords={keywords}&latitude={latitude}&longitude={longitude}&start={start}&items_count={items_count}&filters_source=search_box&order_by={order_by}"
    item_url = "https://api.wallapop.com/api/v3/items/"

    def start_requests(self):
        meta = {
            "keywords": "monitor",
            "latitude": 39.46895,
            "longitude": -0.37686,
            "start": 0,
            "items_count": 40,
            "order_by": "most_relevance",
        }
        yield scrapy.Request(
            url=self.search_url.format(**meta), callback=self.parse_search
        )

    def parse_search(self, response):
        # parsing the response into a dict
        dict_data = json.loads(response.text)
        # iterating over the search results
        for item in dict_data["search_objects"]:
            yield scrapy.Request(
                url=self.item_url + item["id"], callback=self.parse_item
            )

    def parse_item(self, response):
        # parsing the response into a dict
        item_data = json.loads(response.text)
        print(item_data["share_url"])
        print(item_data["description"])
        # saving the item data on a json file
        with open(f"{item_data['id']}.json", "w") as f:
            json.dump(item_data, f)

我通过浏览器开发工具上的网络流量发现了这两个API端点。我使用瓦伦西亚的纬度和经度值以及“monitor”作为关键字。运行查询所需的参数是关键字、纬度和经度。
我还可以建议在estela这样的spider管理解决方案上运行Crawler。

e4yzc0pl

e4yzc0pl3#

虽然我认为API更高效,但如果你想要 selenium +bs4解决方案,这是一个例子。
使用这些功能:

# from selenium import webdriver
# from selenium.webdriver.common.by import By
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC
# from bs4 import BeautifulSoup

## scroll to an element and click [targetEl can be and element or selector] ##
def scrollClick(driverX, targetEl, maxWait=5, scroll2Top=False, printErr=True):
    try:
        xWait = WebDriverWait(driverX, maxWait)
        if isinstance(targetEl, str):
            xWait.until(
                EC.presence_of_element_located((By.CSS_SELECTOR,targetEl)))
            targetEl = driverX.find_element(By.CSS_SELECTOR, targetEl)
        xWait.until(EC.element_to_be_clickable(targetEl))
        driverX.execute_script('''
            arguments[0].scrollIntoView(arguments[1]);
        ''', targetEl, bool(scroll2Top)) ## execute js to scroll
        targetEl.click()
    except Exception as e: 
        if printErr: print(repr(e), '\nFailed to click', targetEl)

## find a nextSibling of refEl that matches selector [if specified by sel] ##
def selectNextSib(driverX, refEl, sel=False, printError=False):
    sel = sel.strip() if isinstance(sel, str) and sel.strip() else False
    try: ## execute js code to find next card
        return driverX.execute_script('''
            var sibling = arguments[0].nextElementSibling;
            while (sibling && arguments[1]) {
                if (sibling.matches(arguments[1])) break;
                sibling = sibling.nextElementSibling; } 
            return sibling;''', refEl, sel)
    except Exception as e: 
        if printError: print(f'Error finding next "{sel}":',repr(e))

## [bs4] extract text or attribute from a tag inside tagSoup ##
def selectGet(tagSoup, selector='', ta='', defaultVal=None):
    el = tagSoup.select_one(selector) if selector else tagSoup
    if el is None: return defaultVal
    return el.get(ta, defaultVal) if ta else el.get_text(' ', strip=True)

## parse product page html and extract product details ##
def getProductDetails(prodPgHtml:str, prodUrl=None):
    pSoup = BeautifulSoup(prodPgHtml.encode('utf-8'))
    detsDiv = pSoup.select_one('div.detail-item')
    detKeys = ['category_id', 'is_bulky', 'is_bumped',
               'is_free_shipping_allowed', 'item_id', 'item_uuid',
               'main_image_thumbnail', 'mine', 'sell_price',
               'seller_user_id', 'subcategory_id', 'itle', 'title']
    pDets = {} if detsDiv is None else {
        k.lstrip('data-').replace('-', '_'): v
        for k, v in sorted(detsDiv.attrs.items(), key=lambda x: x[0])
        if k.lstrip('data-').replace('-', '_') in detKeys
    }
    pDets['description'] = selectGet(pSoup, 'div.card-product-detail-top>p')
    pDets['date_posted'] = selectGet(pSoup, 'div[class$="published"]')
    pDets['views_count'] = selectGet(pSoup, 'i.ico-eye+span')
    pDets['likes_count'] = selectGet(pSoup, 'i.ico-coounter_favourites+span')
    pDets['seller_name'] = selectGet(pSoup, 'h2.card-user-detail-name')
    uLink = selectGet(pSoup, 'a.card-user-right[href]', 'href')
    if uLink: pDets['seller_link'] = urljoin(prodUrl, uLink)

    ### EXTRACT ANY OTHER DETAILS YOU WANT ###

    pDets['product_link'] = prodUrl
    return pDets

您可以在结果页面上循环浏览这些卡片,单击每个卡片打开一个新选项卡并抓取产品详细信息

kSearch, maxItems = 'monitor', 1500 ## adjust as preferred
url = f'https://es.wallapop.com/app/search?keywords={"+".join(kSearch.split())}'
url = f'{url}&filters_source=search_box&latitude=39.46895&longitude=-0.37686'

browser = webdriver.Chrome() 
browser.get(url)
browser.maximize_window()

scrollClick(browser, 'button[id="onetrust-accept-btn-handler"]') ## accept cookies
scrollClick(browser, 'tsl-button[id="btn-load-more"]') ## load more [then ∞-scroll]

itemCt, scrapedLinks, products = 0, [], [] ## initiate
itemSel, nextItem = 'a.ItemCardList__item[title]', None
try: nextItem = browser.find_element(By.CSS_SELECTOR, itemSel) ## first card
except Exception as e: print('No items found:', repr(e))

while nextItem:
    itemCt += 1 # counter
    cpHtml, cpTxt = '', '' # clear/initiate
    resultsTab = browser.current_window_handle # to go back

    try: # click card -> open new tab -> scrape product details
        cpHtml, cpTxt = nextItem.get_attribute('outerHTML'), nextItem.text
        scrollClick(browser, nextItem) ## click current card
        # add wait ?
        browser.switch_to.window(browser.window_handles[1]) ## go to 2nd tab
        WebDriverWait(browser, 5).until(EC.presence_of_element_located(
            (By.CSS_SELECTOR, 'div.detail-item'))) ## wait to load details
        pLink = browser.current_url ## product URL
        if pLink not in scrapedLinks: # skip duplicates [just in case]
            products.append(getProductDetails(browser.page_source, pLink))
        scrapedLinks.append(pLink)
    except Exception as e:
        print('!', [itemCt], ' '.join(cpTxt.split()), repr(e)) ## print error
        pSoup = BeautifulSoup(cpHtml.encode('utf-8'), 'lxml')
        products.append({
            'title': selectGet(pSoup, '', 'title'),
            'price': selectGet(pSoup, 'span.ItemCard__price'),
            'errorMsg': f'{type(e)} {e}'
        }) ## [ make do with info in card ]

    try: # close all tabs other than results tab
        for w in browser.window_handles:
            if w != resultsTab:
                browser.switch_to.window(w)
                browser.close()
            browser.switch_to.window(resultsTab)
    except Exception as e:
        print('Failed to restore results-tab-only window:', repr(e))
        break

    # print('', end=f"\r[{itemCt} of {maxItems}] {' '.join(cpTxt.split())} {repr(e)}")

    if isinstance(maxItems, int):
        if maxItems < itemCt: break

    nextItem = selectNextSib(browser, nextItem, itemSel) # get next result card

一些注意事项:

  • 第二个scrollClick调用是针对"Load More"按钮的-只需单击一次,然后向下滚动时会加载更多结果。
  • 维护和检查scrapedLinks似乎没有必要,因为似乎没有任何重复的过滤掉。[这只是一个习惯,因为有些网站确实有重复,特别是如果有分页...]
  • 如果您将maxItems设置为None [或者一个非常大的数字],它应该会一直滚动和抓取,直到
  • 没有更多的结果要加载,或者
  • 页面崩溃[当我测试时,它在800个结果之前崩溃]。在这方面,API有一个优势(您可以在单击"加载更多"时观察network logs,并将API request复制到curlconverter以生成代码来复制它,如果您想开始探索它...)
  • 得到的products应该是a list of dictionaries,每个都包含关于不同产品的详细信息。

结果应类似于

顺便说一句,你也可以用panda .to_csv之类的东西来保存结果

# import pandas as pd
pd.DataFrame(products).to_csv('monitors.csv', index=False)

相关问题