我正在尝试从HTML导出标记项到一个TXT文件。由于某种原因,我的代码只取最后一行并将其导出到CSV。它不会刮除其他列出的项目。不知道为什么。我尝试了多种解决方案,但一无所获。
这是我的代码...
import pandas as pd
from bs4 import BeautifulSoup
import schedule
import time
#import urllib.parse
import requests
baseurl = 'https://www.soxboxmtl.com'
dataset = []
with open(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.112,share=corporate%20share/Corporate Share/Systems and Infrastructure/Engineering/jbot tests/soxboxmtl2.txt', "r") as f:
soup = BeautifulSoup(f.read(), "html.parser")
for imgurl in soup.find_all('img', class_='grid-item-image'):(imgurl['data-src'])
for name in soup.find_all('div', class_='grid-title'):(name.text)
for link in soup.find_all('a', class_='grid-item-link'):(link['href'])
for price in soup.find_all('div', class_='product-price'):(price.text)
dataset.append({'Field_01':(imgurl['data-src']),'Field_02':name.text,'Field_03':(baseurl + link['href']),'Field_04':price.text})
print(dataset)
df = pd.DataFrame(dataset).to_csv(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.112,share=corporate%20share/Corporate Share/Systems and Infrastructure/Engineering/jbot tests/soxboxmtl2.csv', index = False)
下面是一个HTML数据示例
<div class="grid-item hentry tag-paddle tag-brush tag-bristle tag-wide tag-detangle tag-kitsch tag-anti-frizz tag-black author-jill-kessner post-type-store-item article-index-45 sqs-product-quick-view-button-hover-area" data-controller="ProductListImageLoader" data-item-id="625ef30d651884142d5a2dc2" id="thumb-kitsch-paddle-hair-brush">
<a aria-label="Kitsch Paddle Hair Brush" class="grid-item-link" href="/home-bath-body/p/kitsch-paddle-hair-brush">
</a>
<figure class="grid-image" data-animation-role="image" data-test="plp-grid-image">
<div class="grid-image-wrapper has-hover-img">
<img alt="Screenshot 2022-04-19 at 1.31.04 PM.png" class="grid-item-image grid-image-cover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot+2022-04-19+at+1.31.04+PM.png" data-image-dimensions="1341x1335" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot+2022-04-19+at+1.31.04+PM.png"/>
<img alt="Screenshot 2022-04-19 at 1.31.24 PM.png" class="grid-item-image grid-image-hover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390381627-ZJU6GL0JVR2AZG3FKM84/Screenshot+2022-04-19+at+1.31.24+PM.png" data-image-dimensions="1338x1338" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390381627-ZJU6GL0JVR2AZG3FKM84/Screenshot+2022-04-19+at+1.31.24+PM.png"/>
<div class="list-quick-view-wrapper sqs-product-quick-view-button-wrapper">
<span class="sqs-product-quick-view-button" data-group="5ec69b56a188e3129c377b33" data-id="625ef30d651884142d5a2dc2" role="button" tabindex="0">Quick View</span>
</div>
</div>
</figure>
<section class="grid-meta-wrapper" data-animation-role="content">
<div class="grid-main-meta">
<div class="grid-title" data-test="plp-grid-title">
Kitsch Paddle Hair Brush
</div>
<div class="grid-prices" data-test="plp-grid-prices">
<div class="product-price">
CA$24.00
</div>
</div>
</div>
<div class="grid-meta-status" data-test="plp-grid-status">
<div class="product-scarcity">
Only 2 left in stock
</div>
</div>
</section>
</div>
<div class="grid-item hentry tag-blanket tag-plush tag-cozy-plush tag-pj-salvage tag-embroidered tag-blush tag-pink tag-luxe-plush tag-luxe author-jill-kessner post-type-store-item article-index-46 sqs-product-quick-view-button-hover-area" data-controller="ProductListImageLoader" data-item-id="635031c65ac9872b4ba44f5a" id="thumb-pj-salvage-luxe-plush-embroidered-blanket-blush">
<a aria-label="PJ Salvage Luxe Plush Embroidered Blanket - Blush" class="grid-item-link" href="/home-bath-body/p/pj-salvage-luxe-plush-embroidered-blanket-blush">
</a>
<figure class="grid-image" data-animation-role="image" data-test="plp-grid-image">
<div class="grid-image-wrapper has-hover-img">
<img alt="Screenshot 2022-10-17 at 12.03.06 AM.png" class="grid-item-image grid-image-cover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot+2022-10-17+at+12.03.06+AM.png" data-image-dimensions="891x1340" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot+2022-10-17+at+12.03.06+AM.png"/>
<img alt="Screenshot 2022-10-17 at 12.02.56 AM.png" class="grid-item-image grid-image-hover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200171128-41WP2X90CW820GH07IPH/Screenshot+2022-10-17+at+12.02.56+AM.png" data-image-dimensions="890x1339" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200171128-41WP2X90CW820GH07IPH/Screenshot+2022-10-17+at+12.02.56+AM.png"/>
<div class="list-quick-view-wrapper sqs-product-quick-view-button-wrapper">
<span class="sqs-product-quick-view-button" data-group="5ec69b56a188e3129c377b33" data-id="635031c65ac9872b4ba44f5a" role="button" tabindex="0">Quick View</span>
</div>
</div>
</figure>
<section class="grid-meta-wrapper" data-animation-role="content">
<div class="grid-main-meta">
<div class="grid-title" data-test="plp-grid-title">
PJ Salvage Luxe Plush Embroidered Blanket - Blush
</div>
<div class="grid-prices" data-test="plp-grid-prices">
<div class="product-price">
CA$118.00
</div>
</div>
</div>
<div class="grid-meta-status" data-test="plp-grid-status">
<div class="product-scarcity">
Only 1 left in stock
</div>
</div>
2条答案
按热度按时间x4shl7ld1#
这是因为for循环遍历了所有值,但总是覆盖这些值,所以只保留最后一个值,然后将其添加到
dataset
中。grid-item
类的容器元素,迭代所有这些容器,然后将数据添加到您的dataset
。这样您只需要一个for循环,更容易控制。*下面的示例使用
css selectors
,因为我更喜欢使用它们:但是你也可以用
find_all()
或者find()
来代替。检查get_text()
和它的参数,去掉中断或者空白。这将导致:
| 字段_01|字段_02|字段_03|字段_04|
| - ------|- ------|- ------|- ------|
| https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot+2022-04-19+at+1.31.04+PM.png|媚俗桨式发刷|https://www.soxboxmtl.com/home-bath-body/p/kitsch-paddle-hair-brush| 24加元|
| https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot+2022-10-17+at+12.03.06+AM.png| PJ Salvage豪华长毛绒刺绣毛毯-腮红|https://www.soxboxmtl.com/home-bath-body/p/pj-salvage-luxe-plush-embroidered-blanket-blush| 118加元|
lkaoscv72#
您当前的实现存在两个问题:
问题1
循环实际上并不处理bs4找到的数据,唯一向数据集添加数据的是对
dataset.append()
的一次调用,这会产生您所体验到的单行数据。问题2
即使循环是有效的,脚本也可能会失败,因为Pandas Dataframe 需要一致的列长度。例如,图片比标题多,所以您最终会得到长度不同的列。
溶液
除了确保我们实际上正确地追加了数据之外,我们还需要确保所有列的格式都是正确和一致的,而不是搜索任何和所有彼此没有关系的信息,而是搜索所有包含与我们的需求相关的信息的父元素。
然后,我们遍历父元素列表。在每次迭代中,我们只搜索该父元素中的可用数据,然后将其格式化以在DataFrame中使用。此DataFrame被追加到DataFrame列表中,该列表在迭代完成后被连接为单个DataFrame,最后被导出。