Python -类别名称网页搜罗沃尔玛

mnemlml8  于 2023-03-06  发布在  Python
关注(0)|答案(4)|浏览(87)

我正在尝试从沃尔玛link中获取部门名称。您可以看到,首先在Departments中左侧有7个部门(巧克力饼干、饼干、黄油饼干...)。当我单击See All Departments时,又添加了9个类别,因此现在的数字是16。我正在尝试自动获取所有16个部门。我编写了以下代码;

from selenium import webdriver

n_links = []

driver = webdriver.Chrome(executable_path='D:/Desktop/demo/chromedriver.exe')
url = "https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391" 
driver.get(url)

search = driver.find_element_by_xpath("//*[@id='Departments']/div/div/ul").text
driver.find_element_by_xpath("//*[@id='Departments']/div/div/button/span").click()
search2 = driver.find_element_by_xpath("//*[@id='Departments']/div/div/div/div").text

sep = search.split('\n')
sep2 = search2.split('\n')

lngth = len(sep)
lngth2 = len(sep2)

for i in range (1,lngth):
    path = "//*[@id='Departments']/div/div/ul/li"+"["+ str(i) + "]/a"
    nav_links = driver.find_element_by_xpath(path).get_attribute('href')
    n_links.append(nav_links)
    
for i in range (1,lngth2):
    path = "//*[@id='Departments']/div/div/div/div/ul/li"+"["+ str(i) + "]/a"
    nav_links2 = driver.find_element_by_xpath(path).get_attribute('href')
    n_links.append(nav_links2)   
    
print(n_links)
print(len(n_links))

当我运行代码时,最后,我可以看到n_links数组内部的链接。但问题是;有时候有13个链接,有时候有14个。应该是16个,但我还没有看到16个,只有13或14个。我试图在search2行之前添加time.sleep(3),但没有成功。您能帮助我吗?

xkftehaa

xkftehaa1#

我认为你是使这比它更复杂。你是正确的,你可能需要等待得到的部门,如果你是点击按钮。

# This code will get all the departments shown
    departments = []
    departments = driver.find_elements_by_xpath("//li[contains(@class,'department')]") 
 
# Click on the show all departments button
    driver.find_element_by_xpath("//button[@data-automation-id='button']//span[contains(text(),'all Departments')]").click()

# Will get the departments shown
    departments = driver.find_elements_by_xpath("//li[contains(@class,'department')]")
    
# Iterate through the departments
for d in departments:
            print(d)
ldioqlga

ldioqlga2#

要打印所有产品(16),您可以尝试使用CSS选择器搜索它们:.collapsible-content > ul a, .sometimes-shown a.
在您的示例中:

from selenium import webdriver

driver = webdriver.Chrome()
url = (
    "https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391"
)
driver.get(url)

search = driver.find_element_by_xpath("//*[@id='Departments']/div/div/ul").text
driver.find_element_by_xpath("//*[@id='Departments']/div/div/button/span").click()

all_departments = [
    link.get_attribute("href")
    for link in driver.find_elements_by_css_selector(
        ".collapsible-content > ul a, .sometimes-shown a"
    )
]

print(len(all_departments))
print(all_departments)

输出:

16
['https://www.walmart.com/browse/food/chocolate-cookies/976759_976787_1001391_4007138', 'https://www.walmart.com/browse/food/cookies/976759_976787_1001391_8331066', 'https://www.walmart.com/browse/food/butter-cookies/976759_976787_1001391_7803640', 'https://www.walmart.com/browse/food/shortbread-cookies/976759_976787_1001391_8026949', 'https://www.walmart.com/browse/food/coconut-cookies/976759_976787_1001391_6970757', 'https://www.walmart.com/browse/food/healthy-cookies/976759_976787_1001391_7466302', 'https://www.walmart.com/browse/food/keebler-cookies/976759_976787_1001391_3596825', 'https://www.walmart.com/browse/food/biscotti/976759_976787_1001391_2224095', 'https://www.walmart.com/browse/food/gluten-free-cookies/976759_976787_1001391_4362193', 'https://www.walmart.com/browse/food/molasses-cookies/976759_976787_1001391_3338971', 'https://www.walmart.com/browse/food/peanut-butter-cookies/976759_976787_1001391_6460174', 'https://www.walmart.com/browse/food/pepperidge-farm-cookies/976759_976787_1001391_2410932', 'https://www.walmart.com/browse/food/snickerdoodle-cookies/976759_976787_1001391_8926167', 'https://www.walmart.com/browse/food/sugar-free-cookies/976759_976787_1001391_5314659', 'https://www.walmart.com/browse/food/tate-s-cookies/976759_976787_1001391_9480535', 'https://www.walmart.com/browse/food/vegan-cookies/976759_976787_1001391_8007359']
ffvjumwh

ffvjumwh3#

仅使用beautifulsoup

import json
import requests
from bs4 import BeautifulSoup

url = "https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
    "Accept-Language": "en-US,en;q=0.5",
}

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data = json.loads(soup.select_one("#searchContent").contents[0])

# uncomment to see all data:
# print(json.dumps(data, indent=4))

def find_departments(data):
    if isinstance(data, dict):
        if "name" in data and data["name"] == "Departments":
            yield data
        else:
            for v in data.values():
                yield from find_departments(v)
    elif isinstance(data, list):
        for v in data:
            yield from find_departments(v)

departments = next(find_departments(data), {})

for d in departments.get("values", []):
    print(
        "{:<30} {}".format(
            d["name"], "https://www.walmart.com" + d["baseSeoURL"]
        )
    )

图纸:

Chocolate Cookies              https://www.walmart.com/browse/food/chocolate-cookies/976759_976787_1001391_4007138
Cookies                        https://www.walmart.com/browse/food/cookies/976759_976787_1001391_8331066
Butter Cookies                 https://www.walmart.com/browse/food/butter-cookies/976759_976787_1001391_7803640
Shortbread Cookies             https://www.walmart.com/browse/food/shortbread-cookies/976759_976787_1001391_8026949
Coconut Cookies                https://www.walmart.com/browse/food/coconut-cookies/976759_976787_1001391_6970757
Healthy Cookies                https://www.walmart.com/browse/food/healthy-cookies/976759_976787_1001391_7466302
Keebler Cookies                https://www.walmart.com/browse/food/keebler-cookies/976759_976787_1001391_3596825
Biscotti                       https://www.walmart.com/browse/food/biscotti/976759_976787_1001391_2224095
Gluten-Free Cookies            https://www.walmart.com/browse/food/gluten-free-cookies/976759_976787_1001391_4362193
Molasses Cookies               https://www.walmart.com/browse/food/molasses-cookies/976759_976787_1001391_3338971
Peanut Butter Cookies          https://www.walmart.com/browse/food/peanut-butter-cookies/976759_976787_1001391_6460174
Pepperidge Farm Cookies        https://www.walmart.com/browse/food/pepperidge-farm-cookies/976759_976787_1001391_2410932
Snickerdoodle Cookies          https://www.walmart.com/browse/food/snickerdoodle-cookies/976759_976787_1001391_8926167
Sugar-Free Cookies             https://www.walmart.com/browse/food/sugar-free-cookies/976759_976787_1001391_5314659
Tate's Cookies                 https://www.walmart.com/browse/food/tate-s-cookies/976759_976787_1001391_9480535
Vegan Cookies                  https://www.walmart.com/browse/food/vegan-cookies/976759_976787_1001391_8007359
xu3bshqb

xu3bshqb4#

为什么不使用.visibility_of_all_elements_located

texts = []
links =[]

driver.get('https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391')
wait = WebDriverWait(driver, 60)
wait.until(EC.element_to_be_clickable((By.XPATH, "//span[text()='See all Departments']/parent::button"))).click()
elements = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "li.department-single-level a")))
for element in elements:
    #to get text
    texts.append(element.text)
    #to get link by attribute name
    links.append(element.get_attribute('href'))
    
print(texts)
print(links)

控制台输出:

[u'Chocolate Cookies', u'Cookies', u'Butter Cookies', u'Shortbread Cookies', u'Coconut Cookies', u'Healthy Cookies', u'Keebler Cookies', u'Biscotti', u'Gluten-Free Cookies', u'Molasses Cookies', u'Peanut Butter Cookies', u'Pepperidge Farm Cookies', u'Snickerdoodle Cookies', u'Sugar-Free Cookies', u"Tate's Cookies", u'Vegan Cookies']
[u'https://www.walmart.com/browse/food/chocolate-cookies/976759_976787_1001391_4007138', u'https://www.walmart.com/browse/food/cookies/976759_976787_1001391_8331066', u'https://www.walmart.com/browse/food/butter-cookies/976759_976787_1001391_7803640', u'https://www.walmart.com/browse/food/shortbread-cookies/976759_976787_1001391_8026949', u'https://www.walmart.com/browse/food/coconut-cookies/976759_976787_1001391_6970757', u'https://www.walmart.com/browse/food/healthy-cookies/976759_976787_1001391_7466302', u'https://www.walmart.com/browse/food/keebler-cookies/976759_976787_1001391_3596825', u'https://www.walmart.com/browse/food/biscotti/976759_976787_1001391_2224095', u'https://www.walmart.com/browse/food/gluten-free-cookies/976759_976787_1001391_4362193', u'https://www.walmart.com/browse/food/molasses-cookies/976759_976787_1001391_3338971', u'https://www.walmart.com/browse/food/peanut-butter-cookies/976759_976787_1001391_6460174', u'https://www.walmart.com/browse/food/pepperidge-farm-cookies/976759_976787_1001391_2410932', u'https://www.walmart.com/browse/food/snickerdoodle-cookies/976759_976787_1001391_8926167', u'https://www.walmart.com/browse/food/sugar-free-cookies/976759_976787_1001391_5314659', u'https://www.walmart.com/browse/food/tate-s-cookies/976759_976787_1001391_9480535', u'https://www.walmart.com/browse/food/vegan-cookies/976759_976787_1001391_8007359']

需要以下导入:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

相关问题