csv 从网页抓取文本中提取美元金额的问题

wgx48brx  于 11个月前  发布在  其他
关注(0)|答案(1)|浏览(100)

我试图从一个网页中提取美元金额,我正在使用Python和Selenium抓取。我已经实现了以下代码来从网页中抓取数据并从文本中提取金额:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import csv
import re

url = 'https://skinsmonkey.com/trade'

driver = webdriver.Firefox()
driver.get(url)
driver.maximize_window()

scroll_script = """
var element = arguments[0];
element.scrollTop += 681;  
"""

time.sleep(1)
scrollable_box = driver.find_element(By.CSS_SELECTOR, 'div.trade-inventory:nth-child(3) > div:nth-child(3) > div:nth-child(1) > div:nth-child(1)')

csv_file_path = 'output.csv'
with open(csv_file_path, mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Item Text', 'Alt Text', 'Price'])

while True:
    elements = driver.find_elements(By.CLASS_NAME, 'item-card__body')
    
    for element in elements:
        
        if element.text.strip() != "Quick View":
            item_text = element.text.replace('\n', '')
            
            price_matches = re.findall(r'\$([\d,.]+)', item_text)
            price = "Price not found"
            if price_matches:
                price = ', '.join(match[0] for match in price_matches)
        
            img_elements = element.find_elements(By.TAG_NAME, 'img')
            alt_text = ""
            for img_element in img_elements:
                alt_text = img_element.get_attribute('alt')
                if alt_text:
                    break
            
            with open(csv_file_path, mode='a', newline='', encoding='utf-8') as file:
                writer = csv.writer(file)
                writer.writerow([item_text, alt_text, price])
    
    for _ in range(2):  
        driver.execute_script(scroll_script, scrollable_box)
        time.sleep(1)  
    
    time.sleep(1)

但是,我遇到了正则表达式re.findall(r '$([\d,.]+)',item_text)不能正确地从抓取的文本中提取金额的问题。有人能帮我找出问题所在,并提供一个解决方案,以准确地提取美元金额吗?

b4lqfgs4

b4lqfgs41#

首先,在这种情况下不需要regexp。您可以通过获取包含类item-card__price的内部元素的元素来过滤带有价格的卡片元素。
然后你只需要从card、image和price元素中得到innerText属性。

elements = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//*[@class="item-card__body" and .//*[contains(@class,"item-card__price")]]')))

    for element in elements:
            item_text = element.get_property('innerText').replace('\n', '')
            price_element = element.find_element(By.CLASS_NAME, 'item-card__price')
            price = price_element.get_property('innerText')

            img_elements = element.find_elements(By.TAG_NAME, 'img')
            alt_text = ""
            for img_element in img_elements:
                alt_text = img_element.get_attribute('alt')
                if alt_text:
                    break

            with open(csv_file_path, mode='a', newline='', encoding='utf-8') as file:
                writer = csv.writer(file)
                writer.writerow([item_text, alt_text, price])

相关问题