易趣网出售的商品使用 selenium 退货[]

thtygnil  于 2022-11-10  发布在  其他
关注(0)|答案(2)|浏览(118)

我几乎没有网络抓取的经验,也不能用BeautifulSoup解决这个问题,所以我尝试了SelSelum(今天安装了它)。我正试着在eBay上刮掉卖出的东西。我在试着刮:
https://www.ebay.com/sch/i.html?_from=R40&_nkw=oakley+sunglasses&_sacat=0&Brand=Oakley&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_oaa=1&_fsrp=1&_dcat=79720
下面是我的代码,我在其中加载html代码并将其转换为Seleniumhtml:

ebay_url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=oakley+sunglasses&_sacat=0&Brand=Oakley&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_oaa=1&_fsrp=1&_dcat=79720'

    html = requests.get(ebay_url)
    #print(html.text)

    driver = wd.Chrome(executable_path=r'/Users/mburley/Downloads/chromedriver')
    driver.get(ebay_url)

其在正确的URL处正确地打开新的Chrome会话。我正在努力获得标题,价格和销售日期,然后加载到一个CSV文件。以下是我对这些问题的代码:


# Find all div tags and set equal to main_data

    all_items = driver.find_elements_by_class_name("s-item__info clearfix")[1:]
    #print(main_data)

    # Loop over main_data to extract div classes for title, price, and date
    for item in all_items:
    date = item.find_element_by_xpath("//span[contains(@class, 'POSITIVE']").text.strip()
    title = item.find_element_by_xpath("//h3[contains(@class, 's-item__title s-item__title--has-tags']").text.strip()
    price = item.find_element_by_xpath("//span[contains(@class, 's-item__price']").text.strip()

    print('title:', title)
    print('price:', price)
    print('date:', date)
    print('---')
    data.append( [title, price, date] )

它只返回[]。我认为eBay可能屏蔽了我的IP,但加载的html代码看起来是正确的。希望有人能帮忙!谢谢!

gzjq41n4

gzjq41n41#

没有必要使用Selenium进行eBay抓取,因为数据不是由JavaScript呈现的,因此可以从普通的HTML中提取。使用BeautifulSoup Web抓取库就足够了。
请记住,当您多次尝试请求某个站点时,可能会出现站点解析问题。EBay可能会认为这是一个发送请求的机器人(不是真正的用户)。
为了避免这种情况,其中一种方法是发送请求中包含用户代理的headers,然后站点将假定您是用户并显示信息。
作为额外的步骤是轮换这些用户代理。理想的情况是将代理与轮换的用户代理组合使用(除了验证码解算器)

from bs4 import BeautifulSoup
import requests, json, lxml

# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
    }

params = {
    '_nkw': 'oakley+sunglasses',      # search query  
    'LH_Sold': '1',                   # shows sold items
    '_pgn': 1                         # page number
}

data = []

while True:
    page = requests.get('https://www.ebay.com/sch/i.html', params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(page.text, 'lxml')

    print(f"Extracting page: {params['_pgn']}")

    print("-" * 10)

    for products in soup.select(".s-item__info"):
        title = products.select_one(".s-item__title span").text
        price = products.select_one(".s-item__price").text
        link = products.select_one(".s-item__link")["href"]

        data.append({
          "title" : title,
          "price" : price,
          "link" : link
        })

    if soup.select_one(".pagination__next"):
        params['_pgn'] += 1
    else:
        break

    print(json.dumps(data, indent=2, ensure_ascii=False)

输出示例

Extracting page: 1
----------
[
  {
    "title": "Shop on eBay",
    "price": "$20.00",
    "link": "https://ebay.com/itm/123456?hash=item28caef0a3a:g:E3kAAOSwlGJiMikD&amdata=enc%3AAQAHAAAAsJoWXGf0hxNZspTmhb8%2FTJCCurAWCHuXJ2Xi3S9cwXL6BX04zSEiVaDMCvsUbApftgXEAHGJU1ZGugZO%2FnW1U7Gb6vgoL%2BmXlqCbLkwoZfF3AUAK8YvJ5B4%2BnhFA7ID4dxpYs4jjExEnN5SR2g1mQe7QtLkmGt%2FZ%2FbH2W62cXPuKbf550ExbnBPO2QJyZTXYCuw5KVkMdFMDuoB4p3FwJKcSPzez5kyQyVjyiIq6PB2q%7Ctkp%3ABlBMULq7kqyXYA"
  },
  {
    "title": "Oakley X-metal Juliet  Men's Sunglasses",
    "price": "$280.00",
    "link": "https://www.ebay.com/itm/265930582326?hash=item3deab2a936:g:t8gAAOSwMNhjRUuB&amdata=enc%3AAQAHAAAAoH76tlPncyxembf4SBvTKma1pJ4vg6QbKr21OxkL7NXZ5kAr7UvYLl2VoCPRA8KTqOumC%2Bl5RsaIpJgN2o2OlI7vfEclGr5Jc2zyO0JkAZ2Gftd7a4s11rVSnktOieITkfiM3JLXJM6QNTvokLclO6jnS%2FectMhVc91CSgZQ7rc%2BFGDjXhGyqq8A%2FoEyw4x1Bwl2sP0viGyBAL81D2LfE8E%3D%7Ctkp%3ABk9SR8yw1LH9YA"
  },
  {
    "title": " Used Oakley PROBATION Sunglasses Polished Gold/Dark Grey  (OO4041-03)",
    "price": "$120.00",
    "link": "https://www.ebay.com/itm/334596701765?hash=item4de7847e45:g:d5UAAOSw4YtjTfEE&amdata=enc%3AAQAHAAAAoItMbbzfQ74gNUiinmOVnzKlPWE%2Fc54B%2BS1%2BrZpy6vm5lB%2Bhvm5H43UFR0zeCU0Up6sPU2Wl6O6WR0x9FPv5Y1wYKTeUbpct5vFKu8OKFBLRT7Umt0yxmtLLMWaVlgKf7StwtK6lQ961Y33rf3YuQyp7MG7H%2Fa9fwSflpbJnE4A9rLqvf3hccR9tlWzKLMj9ZKbGxWT17%2BjyUp19XIvX2ZI%3D%7Ctkp%3ABk9SR8yw1LH9YA"
  },

作为替代,您可以使用SerpApi中的Ebay Organic Results API。这是一个付费的API,有一个免费的计划,可以在后端处理块和解析。
对所有页面进行分页的示例代码:

from serpapi import EbaySearch
from urllib.parse import (parse_qsl, urlsplit)
import os, json

params = {
    "api_key": os.getenv("API_KEY"),      # serpapi api key    
    "engine": "ebay",                     # search engine
    "ebay_domain": "ebay.com",            # ebay domain
    "_nkw": "oakley+sunglasses",          # search query
    "LH_Sold": "1"                        # shows sold items
}

search = EbaySearch(params)        # where data extraction happens

page_num = 0

data = []

while True:
    results = search.get_dict()     # JSON -> Python dict

    if "error" in results:
        print(results["error"])
        break

    for organic_result in results.get("organic_results", []):
        link = organic_result.get("link")
        price = organic_result.get("price")

        data.append({
          "price" : price,
          "link" : link
        })

    page_num += 1
    print(page_num)

    next_page_query_dict = dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)) 
    current_page = results["serpapi_pagination"]["current"] # 1,2,3...

    # looks for the next page data (_pgn):
    # {'_nkw': 'minecraft redstone', '_pgn': '19', 'engine': 'ebay'}
    if "next" in results.get("pagination", {}):

        # if current_page = 20 and next_page_query_dict["_pgn"] = 20: break
        if int(current_page) == int(next_page_query_dict["_pgn"]):
            break

        # update next page data
        search.params_dict.update(next_page_query_dict)
    else:
        break
    print(json.dumps(data, indent=2))

产出:

[
   {
    "price": {
      "raw": "$68.96",
      "extracted": 68.96
    },
    "link": "https://www.ebay.com/itm/125360598217?epid=20030526224&hash=item1d3012ecc9:g:478AAOSwCt5iqgG5&amdata=enc%3AAQAHAAAA4Ls3N%2FEH5OR6w3uoTlsxUlEsl0J%2B1aYmOoV6qsUxRO1d1w3twg6LrBbUl%2FCrSTxNOjnDgIh8DSI67n%2BJe%2F8c3GMUrIFpJ5lofIRdEmchFDmsd2I3tnbJEqZjIkWX6wXMnNbPiBEM8%2FML4ljppkSl4yfUZSV%2BYXTffSlCItT%2B7ZhM1fDttRxq5MffSRBAhuaG0tA7Dh69ZPxV8%2Bu1HuM0jDQjjC4g17I3Bjg6J3daC4ZuK%2FNNFlCLHv97w2fW8tMaPl8vANMw8OUJa5z2Eclh99WUBvAyAuy10uEtB3NDwiMV%7Ctkp%3ABk9SR5DKgLD9YA"
  },
  {
    "price": {
      "raw": "$62.95",
      "extracted": 62.95
    },
    "link": "https://www.ebay.com/itm/125368283608?epid=1567457519&hash=item1d308831d8:g:rnsAAOSw7PJiqMQz&amdata=enc%3AAQAHAAAA4AwZhKJZfTqrG8VskZL8rtfsuNtZrMdWYpndpFs%2FhfrIOV%2FAjLuzNzaMNIvTa%2B6QUTdkOwTLRun8n43cZizqtOulsoBLQIwy3wf19N0sHxGF5HaIDOBeW%2B2sobRnzGdX%2Fsmgz1PRiKFZi%2BUxaLQpWCoGBf9n8mjcsFXi3esxbmAZ8kenO%2BARbRBzA2Honzaleb2tyH5Tf8%2Bs%2Fm5goqbon%2FcEsR0URO7BROkBUUjDCdDH6fFi99m6anNMMC3yTBpzypaFWio0u2qu5TgjABUfO1wzxb4ofA56BNKjoxttb7E%2F%7Ctkp%3ABk9SR5DKgLD9YA"
  },
  # ...
]

免责声明我为SerpApi工作。

vcirk6k6

vcirk6k62#

您可以使用下面的代码来获取详细信息。此外,您还可以使用Pandas将数据存储在CSV文件中。

代码:

ebay_url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=oakley+sunglasses&_sacat=0&Brand=Oakley&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_oaa=1&_fsrp=1&_dcat=79720'

html = requests.get(ebay_url)

# print(html.text)

driver = wd.Chrome(executable_path=r'/Users/mburley/Downloads/chromedriver')
driver.maximize_window()
driver.implicitly_wait(30)
driver.get(ebay_url)

wait = WebDriverWait(driver, 20)
sold_date = []
title = []
price = []
i = 1
for item in driver.find_elements(By.XPATH, "//div[contains(@class,'title--tagblock')]/span[@class='POSITIVE']"):
    sold_date.append(item.text)
    title.append(driver.find_element_by_xpath(f"(//div[contains(@class,'title--tagblock')]/span[@class='POSITIVE']/ancestor::div[contains(@class,'tag')]/following-sibling::a/h3)[{i}]").text)
    price.append(item.find_element_by_xpath(f"(//div[contains(@class,'title--tagblock')]/span[@class='POSITIVE']/ancestor::div[contains(@class,'tag')]/following-sibling::div[contains(@class,'details')]/descendant::span[@class='POSITIVE'])[{i}]").text)
    i = i + 1

print(sold_date)
print(title)
print(price)

data = {
         'Sold_date': sold_date,
         'title': title,
         'price': price
        }
df = pd.DataFrame.from_dict(data)
df.to_csv('out.csv', index = 0)

进口:

import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By

相关问题