pandas 使用Beautiful Soup刮易趣出售的物品

iqxoj9l9  于 2023-01-28  发布在  其他
关注(0)|答案(2)|浏览(141)

我在找易趣上卖出去的东西。我在找:
https://www.ebay.com/sch/i.html?_from=R40&_nkw=oakley+sunglasses&_sacat=0&Brand=Oakley&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_oaa=1&_fsrp=1&_dcat=79720
下面是我的代码,我加载在html代码,并转换为soup对象:

ebay_url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=oakley+sunglasses&_sacat=0&Brand=Oakley&rt=nc&LH_Sold=1&LH_Complete=1&_ipg=200&_oaa=1&_fsrp=1&_dcat=79720'
    response = requests.get(ebay_url)

    soup = bs(response.text, 'html.parser')
    #print(soup.prettify())

我的工作得到的标题,价格,和日期出售,然后加载到一个csv文件.下面是代码,我有标题:

title = soup.find_all("h3", "s-item__title s-item__title--has-tags")
    print(title)

    listing_titles = []

    for i in range(1,len(title)):
    listing_titles.append(title[i].text)

    print(listing_titles)

它只返回空的方括号,如[]。html soup对象打印正确,响应打印为200。看起来我的代码应该可以工作,并且找到发布价格和销售日期应该是相似的。我想知道这是否是 selenium 的工作。希望有人能帮助!谢谢!

9avjhtql

9avjhtql1#

首先,你可以找到所有基于类的div,然后在上面循环,得到标题、价格和日期

main_data=soup.find_all("div",class_="s-item__info clearfix")[1:]
for i in main_data:
    print(i.find("span",class_="POSITIVE").get_text())
    print(i.find("h3",class_="s-item__title s-item__title--has-tags").get_text())
    print(i.find("span",class_="s-item__price").get_text())

输出:

Sold  Aug 15, 2021
Oakley A Wire 2.0  Sunglasses Brushed Thick Frames Green Lenses
$185.00
...
jexiocij

jexiocij2#

响应可以是空的,因为requests请求可能被阻止,因为requests库中的默认user-agentpython-requests,以告知网站它是发送请求的机器人或脚本。Check what user agent you have
除了提供浏览器用户代理之外的附加步骤可以是旋转user-agent,例如,以在PC、移动设备和平板电脑之间切换,以及在例如Chrome、Firefox、Safari、Edge等浏览器之间切换。
也可以使用分页从所有页面获取所有结果,解决方案是使用无限while循环并测试会导致其退出的内容(按钮、元素)。
在我们的示例中,这是页面上的一个按钮(.pagination__next选择器)。
检查在线IDE中的代码。

from bs4 import BeautifulSoup
import requests, lxml
import pandas as pd

# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
    }
    
params = {
    '_nkw': 'oakley+sunglasses',      # search query  
    'LH_Sold': '1',                   # shows sold items
    '_pgn': 1                         # page number
}

data = []

while True:
    page = requests.get('https://www.ebay.com/sch/i.html', params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(page.text, 'lxml')
    
    print(f"Extracting page: {params['_pgn']}")

    print("-" * 10)
    
    for products in soup.select(".s-item__pl-on-bottom"):
        title = products.select_one(".s-item__title span").text
        price = products.select_one(".s-item__price").text
        try:
            sold_date = products.select_one(".s-item__title--tagblock .POSITIVE").text
        except:
            sold_date = None
        
        data.append({
          "title" : title,
          "price" : price,
          "sold_date": sold_date
        })

    if soup.select_one(".pagination__next"):
        params['_pgn'] += 1
    else:
        break
    
# save to CSV (install, import pandas as pd)
pd.DataFrame(data=data).to_csv("ebay_products.csv", index=False)

输出:创建文件:"易趣_产品. csv"
作为一种替代方案,您可以使用SerpApi的Ebay Organic Results API,这是一个付费API,具有在后端处理块和解析的免费计划。
示例代码:

from serpapi import EbaySearch
import os
import pandas as pd

params = {
    "api_key": os.getenv("API_KEY"),      # serpapi key, https://serpapi.com/manage-api-key   
    "engine": "ebay",                     # search engine
    "ebay_domain": "ebay.com",            # ebay domain
    "_nkw": "oakley+sunglasses",          # search query
    "LH_Sold": "1"                        # shows sold items
}

search = EbaySearch(params)        # where data extraction happens

page_num = 0

data = []

while True:
    results = search.get_dict()     # JSON -> Python dict

    if "error" in results:
        print(results["error"])
        break
    
    for organic_result in results.get("organic_results", []):
        title = organic_result.get("title")
        price = organic_result.get("price")

        data.append({
          "title" : title,
          "price" : price
        })
                    
    page_num += 1
    print(page_num)
    
    if "next" in results.get("pagination", {}):
        params['_pgn'] += 1

    else:
        break
    
pd.DataFrame(data=data).to_csv("ebay_products.csv", index=False)

输出:创建文件:"易趣_产品. csv"
如果你想知道更多关于网站抓取的信息,有一篇13 ways to scrape any public data from any website的博客文章。

相关问题