selenium Python -需要帮助网页搜罗动态网站

a11xaf1n 于 2022-12-13 发布在 Python

关注(0)|答案(1)|浏览(149)

我是一个很新的网页抓取，并希望为以下场景的任何建议：
我正在尝试使用https://www.canstar.com.au/home-loans/中的数据生成一个住房贷款列表
我主要是想得到像下面这样的列表值：

家园星金融|星星基本P&I 80%|变量
取消贷款|住房贷款LVR〈80%|变量
TicToc住房贷款|入住可变P&I|变量
银行|纯住房贷款业主自有P&I 70-80%|变量

然后将它们推入嵌套表结果= [[Homestar Finance，Star Essentials P&I 80%，Variable]等]
我的第一次尝试，我已经完全使用BeautifulSoup和实践的离线版本的网站。

import pandas as pd
from bs4 import BeautifulSoup

with open('/local/path/canstar.html', 'r') as canstar_offline :
    content = canstar_offline.read()

results = [['Affiliate', 'Product Name', 'Product Type']]
    
soup = BeautifulSoup(content, 'lxml')

for listing in soup.find_all('div', class_='table-cards-container') :
    for listing1 in listing.find_all('a') :
        if listing1.text.strip() != 'More details' and listing1.text.strip() != '' :
            results.append(listing1.text.strip().split(' | '))
   
df = pd.DataFrame(results[1:], columns=results[0]).to_dict('list')
df2 = pd.DataFrame(df)

print(df2)

我几乎得到了非常接近我想要的，但不幸的是，它不工作的实际网站，因为它看起来像我得到阻止重复的请求。
所以我又试了一次 selenium ，但现在我卡住了。
我试着使用我从BS中使用的可转移过滤逻辑，但是我不能接近我使用Selenium时的效果。

import time
from selenium.webdriver.common.by import By

url = 'https://www.canstar.com.au/home-loans'

results = []

driver = webdriver.Chrome()
driver.get(url)
# content = driver.page_source
# soup = BeautifulSoup(content)

time.sleep(3)
tables = driver.find_elements(By.CLASS_NAME, 'table-cards-container')
for table in tables :
    listing = table.find_element(By.TAG_NAME, 'a')
    print(listing.text)

这个版本（上面的）只返回一个列表（我试图通过迭代得到整个表）

import time
from selenium.webdriver.common.by import By

url = 'https://www.canstar.com.au/home-loans'

results = []

driver = webdriver.Chrome()
driver.get(url)
# content = driver.page_source
# soup = BeautifulSoup(content)

time.sleep(3)
tables = driver.find_elements(By.CLASS_NAME, 'table-cards-container')
for table in tables :
#     listing = table.find_element(By.TAG_NAME, 'a')
    print(table.text)

这个版本（上面的）看起来像是从'table-cards-container'类中获取所有文本，但是我无法过滤它来获取列表。

selenium

来源：https://stackoverflow.com/questions/74674619/python-need-help-web-scraping-dynamic-website

1条答案

按热度按时间

unftdfkk1#

我想你可以尝试这样的东西，我希望代码中的注解解释它在做什么。

# Needed libs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Initiate the driver and navigate
driver = webdriver.Chrome()
url = 'https://www.canstar.com.au/home-loans'
driver.get(url)

# We save the loans list
loans = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, "//cnslib-table-card")))

# We make a loop once per loan in the loop
for i in range(1, len(loans)):
    # With this Xpath I save the title of the loan
    loan_title = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//a)[1]"))).text
    print(loan_title)
    # With this Xpath I save the first percentaje we see for the loan
    loan_first_percentaje = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[1]"))).text
    print(loan_first_percentaje)
    # With this Xpath I save the second percentaje we see for the loan
    loan_second_percentaje = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[3]"))).text
    print(loan_second_percentaje)
    # With this Xpath I save the amount we see for the loan
    loan_amount = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, f"((//cnslib-table-card)[{i}]//span)[5]"))).text
    print(loan_amount)

赞(0）回复(0）举报 2022-12-13

我来回答

selenium Python -需要帮助网页搜罗动态网站

1条答案

相关问题

热门标签

最新问答