我试图从一个网页中提取美元金额,我正在使用Python和Selenium抓取。我已经实现了以下代码来从网页中抓取数据并从文本中提取金额:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import csv
import re
url = 'https://skinsmonkey.com/trade'
driver = webdriver.Firefox()
driver.get(url)
driver.maximize_window()
scroll_script = """
var element = arguments[0];
element.scrollTop += 681;
"""
time.sleep(1)
scrollable_box = driver.find_element(By.CSS_SELECTOR, 'div.trade-inventory:nth-child(3) > div:nth-child(3) > div:nth-child(1) > div:nth-child(1)')
csv_file_path = 'output.csv'
with open(csv_file_path, mode='w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['Item Text', 'Alt Text', 'Price'])
while True:
elements = driver.find_elements(By.CLASS_NAME, 'item-card__body')
for element in elements:
if element.text.strip() != "Quick View":
item_text = element.text.replace('\n', '')
price_matches = re.findall(r'\$([\d,.]+)', item_text)
price = "Price not found"
if price_matches:
price = ', '.join(match[0] for match in price_matches)
img_elements = element.find_elements(By.TAG_NAME, 'img')
alt_text = ""
for img_element in img_elements:
alt_text = img_element.get_attribute('alt')
if alt_text:
break
with open(csv_file_path, mode='a', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow([item_text, alt_text, price])
for _ in range(2):
driver.execute_script(scroll_script, scrollable_box)
time.sleep(1)
time.sleep(1)
但是,我遇到了正则表达式re.findall(r '$([\d,.]+)',item_text)不能正确地从抓取的文本中提取金额的问题。有人能帮我找出问题所在,并提供一个解决方案,以准确地提取美元金额吗?
1条答案
按热度按时间b4lqfgs41#
首先,在这种情况下不需要regexp。您可以通过获取包含类
item-card__price
的内部元素的元素来过滤带有价格的卡片元素。然后你只需要从card、image和price元素中得到
innerText
属性。