pandas 从自动装载网站上报废产品

mnemlml8  于 2022-12-28  发布在  其他
关注(0)|答案(2)|浏览(113)

我的问题是,我刮产品从一个网站,加载产品自动当你向下滚动,我做了24 itmes的scaping,所以我的问题是什么代码可以用它来循环所有的产品,我想在下面的链接,但链接没有一个字,可以表明我在什么页面

from bs4 import BeautifulSoup
import requests 
import pandas as pd
from time import sleep
import urllib.parse
import urllib
import webbrowser
import json
import urllib.request

product_name = []
product_brand = []
product_price =[]
product_img = []
relative_url = []

    
website = 'https://en-saudi.ounass.com/women/beauty/fragrance'
    
response = requests.get(website)
    
soup = BeautifulSoup(response.content, 'html.parser')
    
results = soup.find_all('div', {'class':'Product-contents'})
    
for result in results :
    #name
      try:
        product_name.append(result.find('div',{'class':'Product-name'}).get_text())
      except:
        product_name.append('n/a')
    
    #brand
      try:
        product_brand.append ( result.find('div',{'class':'Product-brand'}).get_text())
      except:
        product_brand.append('n/a')
        
    #price
      try:
        product_price.append ( result.find('span',{'class':'Product-minPrice'}).get_text())
      except:
        product_price.append('n/a')
    #pics
      try:
        product_img.append (result.find('img',{'class':'Product-image'}).get('data-src'))
      except:
        product_img.append('n/a')
    #relative_url
      try:
        relative_url.append (result.find('a',{'class':'Product-link'}).get('href'))
      except:
         relative_url.append('n/a')
wyyhbhjk

wyyhbhjk1#

你只需要使用公共API。这里有很多你需要的信息。它也比selenium工作得快得多。下面是一个例子,你的问题中的字段:

import requests
import pandas as pd

results = []
page = 0
while True:
    url = f"https://en-saudi.ounass.com/api/women/beauty/fragrance?sortBy=popularity-asc&p={page}&facets=0"
    hits = requests.get(url).json()['hits']
    if hits:
        page += 1
        for hit in hits:
            results.append({
                'Name': hit['analytics']['name'],
                'Brand': hit['analytics']['brand'],
                'Price': hit['price'],
                'Image': hit['_imageurl'],
                'Link': f"https://en-saudi.ounass.com/{hit['slug']}.html"
            })
    else:
        break
df = pd.DataFrame(results)
print(df)

输出:

Name  ...                                               Link
0           Cœur de Jardin Eau de Parfum, 100ml  ...  https://en-saudi.ounass.com/shop-miller-harris...
1        Patchouli Intense Eau de Parfum, 100ml  ...  https://en-saudi.ounass.com/shop-nicolai-parfu...
2            Blue Sapphire Eau de Parfum, 100ml  ...  https://en-saudi.ounass.com/shop-boadicea-the-...
3           Ambre Vanillé Eau de Toilette, 50ml  ...  https://en-saudi.ounass.com/shop-laura-mercier...
4     Baccarat Rouge 540 Scented Body Oil, 70ml  ...  https://en-saudi.ounass.com/shop-maison-franci...
...                                         ...  ...                                                ...
2368               Olene Eau de Toilette, 100ml  ...  https://en-saudi.ounass.com/shop-diptyque-olen...
2369  Magnolia Nobile Leather Purse Spray, 20ml  ...  https://en-saudi.ounass.com/shop-acqua-di-parm...
2370           Eau du Soir Eau de Parfum, 100ml  ...  https://en-saudi.ounass.com/shop-sisley-eau-du...
2371              Yvresse Eau de Toilette, 80ml  ...  https://en-saudi.ounass.com/shop-ysl-beauty-yv...
2372               Lalibela Eau de Parfum, 75ml  ...  https://en-saudi.ounass.com/shop-memo-paris-la...
mnowg1ta

mnowg1ta2#

你需要selenium来完成这个任务。selenium打开一个网页(使用驱动程序)并执行你指定的操作,比如滚动。
代码本身将取决于网站的结构,但这里是让你开始的主要步骤:
1.下载Chrome或Firefox驱动程序
1.导入 selenium
1.配置selenium以使用驱动程序
1.打开网站
1.用scroll和用户向下箭头键向下滚动查找元素。
1.从加载的产品中获取所需信息使用python sleep确保所有内容都已加载,并根据需要再次滚动

# Import
 from selenium import webdriver
 from selenium.webdriver.common.keys import Keys

 # Open a driver (using Firefox in the example) which can download
 profile = webdriver.FirefoxProfile()
 profile.set_preference('intl.accept_languages', 'en-us')
 profile.update_preferences()
 driver = webdriver.Firefox(firefox_profile=profile, executable_path='executable_path')

 # Open the site
 driver.get('https://www.example.com/products')

 # Find the elmenet with the scroll and scroll using arrow down key (10 times)
 elem = driver.find_element_by_xpath('xpath_to_element_with_scroll')
 while i < 10:
     elem.send_keys(Keys.ARROW_DOWN)
     i++

 # Here you will find the products and save them somewhere and do it all again if needed.

相关问题