selenium 使用Selie从网站上抓取标题、价格和日期时出现问题

ohfgkhjo  于 2022-11-10  发布在  其他
关注(0)|答案(3)|浏览(168)

Hi Guuys我正在尝试收集一些关于Zalando鞋子的信息,并使用Seleum网络驱动器将价格、标题、日期和时间保存在不同的变量中。这是我的代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
import csv

DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')

# Get the data of product 1 (If I change the /div/div[1]/div and I choose another number, it will get ther data of other shoe)

product_1 = driver.find_element(By.XPATH, '//*[@id="release-calendar"]/div/div[1]/div')

element_text = product_1.text

print(element_text)

当我打印下一段代码的ELEMENT_TEXT时,我得到了关于该产品的许多信息。我想把它保存在不同的变量中,所以我尝试了一件事(继续阅读)
109,95欧元耐克运动装WMNS扣篮低CZ 10 de noviembre de 2022,8:15创纪录
所以问题是,在这个小代码起作用后,我试图通过添加这个代码来拆分数据,然后保护不同变量中不同类型的数据,但我遇到了一个问题:

from selenium import webdriver
from selenium.webdriver.common.by import By
import csv

DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')

# Select product 1

product_1 = driver.find_element(By.XPATH, '//*[@id="release-calendar"]/div/div[1]/div')

element_text = product_1.text

# Split the data

element_text_split = element_text.split()  

# Price 1 --> Result=109.95

price_1 =element_text_split[0]
print(price_1)

# Result=109,95

# Title 1 --> Result=€

title_1 =element_text_split[1]
print(title_1)

这两张照片的结果是:“109.95”和“欧元”
我以为Element_Text_Split[1]是耐克运动装,但不是,它是欧元符号,因为我是按数据之间的空格来划分数据的。
如果我想得到球鞋的名字,这是个大问题,因为名字之间没有相同的空格,比如:耐克扣篮低Cz或空中乔丹One Mid 1
我怎么才能解决这个问题??塞恩斯

cygmwpex

cygmwpex1#

我想你可能在找这样的东西吧?


# Needed libs

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# We create the driver

DRIVER_PATH = 'C:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)

# We maximize the window

driver.maximize_window()

# We navigate to the url

url='https://www.zalando.es/release-calendar/zapatillas-mujer/'
driver.get(url)

# We save a list of elements that are products (search for that xpath in the page and you will see what kind of element it is)

products = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@id='release-calendar']//div[contains(@data-cid,'cid')]")))

# We make a loop for that list and for each of then we take the price, the brand, the model and the date.

for i, product in enumerate(products):
    price = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i+1}']/div[2]"))).text
    brand = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i+1}']/div[3]"))).text
    model = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i+1}']/div[4]"))).text
    date = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i+1}']/div[5]"))).text
    url = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i+1}']//a"))).get_attribute("href")
    image = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, f"//div[@data-cid='cid{i+1}']//img"))).get_attribute("src")
    print(f"""{price}
{brand}
{model}
{date}
{url}
{image}
""")
bn31dyow

bn31dyow2#

一种想法是查看许多不同产品的变量ELEMENT_TEXT,并决定一种不同的文本拆分方式--split method可以接受一个较小的字符串来拆分较长的字符串。
如果这不起作用,您还可以遍历ELEMENT_TEXT_SPLIT变量(它只是一个字符串列表),并通过查找某些较小的字符串或使用regex来分解该字符串列表。
例如,要找到价格,您可以先查找数字、句点,然后再查找数字。我猜产品的名字不是在前面就是在后面。盖尔!

5kgi1eie

5kgi1eie3#

您可以通过使用Selify和BS4的强大方式获取所需的数据

from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
import pandas as pd
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)

d = []
driver.get('https://www.zalando.es/release-calendar/zapatillas-mujer/')
driver.maximize_window()
time.sleep(5)

soup = BeautifulSoup(driver.page_source,"html.parser")
price= [x.get_text(strip=True) for x in soup.select('.Wqd6Qu + div')]

# print(price)

title= [x.get_text(strip=True) for x in soup.select('.Wqd6Qu + div + div + div')]

# print(title)

date = [x.get_text(strip=True).split(',')[0] for x in soup.select('.Wqd6Qu + div + div + div + div')]

# print(date)

hour = [x.get_text(strip=True).split(',')[1] for x in soup.select('.Wqd6Qu + div + div + div + div')]

# print(hour)

cols = ['title', 'price', 'date', 'hour']

df = pd.DataFrame(data=list(zip(title,price,date,hour)), columns=cols)
print(df)

输出:

title     price             date           hour
0       WMNS DUNK LOW CZ  109,95 €  10 de noviembre de 2022   14:15
1    HYPERTURF ADVENTURE  139,95 €  11 de noviembre de 2022   14:00
2       W AIR MAX 95 ESS  189,95 €  11 de noviembre de 2022   14:00
3           CITY CLASSIC  119,95 €  11 de noviembre de 2022   14:00
4           CITY CLASSIC  119,95 €  11 de noviembre de 2022   14:00
5         WMNS AIR 1 MID  129,95 €  11 de noviembre de 2022   14:15
6   DUNK LOW NEXT NATURE  109,95 €  11 de noviembre de 2022   14:15
7            CROSS WOMEN  295,00 €  14 de noviembre de 2022   14:00
8            CROSS WOMEN  295,00 €  14 de noviembre de 2022   14:00
9            CROSS WOMEN  295,00 €  14 de noviembre de 2022   14:00
10           W DUNK HIGH  119,95 €  14 de noviembre de 2022   14:15
11                 MT410   99,95 €  16 de noviembre de 2022   14:00
12                 MT410   99,95 €  16 de noviembre de 2022   14:00
13                 MT410   99,95 €  16 de noviembre de 2022   14:00
14                 MT410   99,95 €  16 de noviembre de 2022   14:00
15                 MT410   94,95 €  16 de noviembre de 2022   14:00
16                 WL574  109,95 €  18 de noviembre de 2022   14:00
17                 WS327  119,95 €  18 de noviembre de 2022   14:00

相关问题