如何从TXT抓取HTML并将所有项目存储到CSV?

hpxqektj  于 2023-01-06  发布在  其他
关注(0)|答案(2)|浏览(118)

我正在尝试从HTML导出标记项到一个TXT文件。由于某种原因,我的代码只取最后一行并将其导出到CSV。它不会刮除其他列出的项目。不知道为什么。我尝试了多种解决方案,但一无所获。
这是我的代码...

import pandas as pd
from bs4 import BeautifulSoup
import schedule
import time
#import urllib.parse
import requests

baseurl = 'https://www.soxboxmtl.com'

dataset = []

with open(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.112,share=corporate%20share/Corporate Share/Systems and Infrastructure/Engineering/jbot tests/soxboxmtl2.txt', "r") as f:

        
        soup = BeautifulSoup(f.read(), "html.parser")
        for imgurl in soup.find_all('img', class_='grid-item-image'):(imgurl['data-src'])
        for name in soup.find_all('div', class_='grid-title'):(name.text)    
        for link in soup.find_all('a', class_='grid-item-link'):(link['href'])  
        for price in soup.find_all('div', class_='product-price'):(price.text)
       
        dataset.append({'Field_01':(imgurl['data-src']),'Field_02':name.text,'Field_03':(baseurl + link['href']),'Field_04':price.text})
        
        print(dataset)

        df = pd.DataFrame(dataset).to_csv(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.112,share=corporate%20share/Corporate Share/Systems and Infrastructure/Engineering/jbot tests/soxboxmtl2.csv', index = False)

下面是一个HTML数据示例

<div class="grid-item hentry tag-paddle tag-brush tag-bristle tag-wide tag-detangle tag-kitsch tag-anti-frizz tag-black author-jill-kessner post-type-store-item article-index-45 sqs-product-quick-view-button-hover-area" data-controller="ProductListImageLoader" data-item-id="625ef30d651884142d5a2dc2" id="thumb-kitsch-paddle-hair-brush">
    <a aria-label="Kitsch Paddle Hair Brush" class="grid-item-link" href="/home-bath-body/p/kitsch-paddle-hair-brush">
    </a>
    <figure class="grid-image" data-animation-role="image" data-test="plp-grid-image">
    <div class="grid-image-wrapper has-hover-img">
    <img alt="Screenshot 2022-04-19 at 1.31.04 PM.png" class="grid-item-image grid-image-cover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot+2022-04-19+at+1.31.04+PM.png" data-image-dimensions="1341x1335" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot+2022-04-19+at+1.31.04+PM.png"/>
    <img alt="Screenshot 2022-04-19 at 1.31.24 PM.png" class="grid-item-image grid-image-hover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390381627-ZJU6GL0JVR2AZG3FKM84/Screenshot+2022-04-19+at+1.31.24+PM.png" data-image-dimensions="1338x1338" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390381627-ZJU6GL0JVR2AZG3FKM84/Screenshot+2022-04-19+at+1.31.24+PM.png"/>
    <div class="list-quick-view-wrapper sqs-product-quick-view-button-wrapper">
    <span class="sqs-product-quick-view-button" data-group="5ec69b56a188e3129c377b33" data-id="625ef30d651884142d5a2dc2" role="button" tabindex="0">Quick View</span>
    </div>
    </div>
    </figure>
    <section class="grid-meta-wrapper" data-animation-role="content">
    <div class="grid-main-meta">
    <div class="grid-title" data-test="plp-grid-title">
            Kitsch Paddle Hair Brush
          </div>
    <div class="grid-prices" data-test="plp-grid-prices">
    <div class="product-price">
    CA$24.00
    </div>
    </div>
    </div>
    <div class="grid-meta-status" data-test="plp-grid-status">
    <div class="product-scarcity">
        Only 2 left in stock
      </div>
    </div>
    </section>
    </div>
    <div class="grid-item hentry tag-blanket tag-plush tag-cozy-plush tag-pj-salvage tag-embroidered tag-blush tag-pink tag-luxe-plush tag-luxe author-jill-kessner post-type-store-item article-index-46 sqs-product-quick-view-button-hover-area" data-controller="ProductListImageLoader" data-item-id="635031c65ac9872b4ba44f5a" id="thumb-pj-salvage-luxe-plush-embroidered-blanket-blush">
    <a aria-label="PJ Salvage Luxe Plush Embroidered Blanket - Blush" class="grid-item-link" href="/home-bath-body/p/pj-salvage-luxe-plush-embroidered-blanket-blush">
    </a>
    <figure class="grid-image" data-animation-role="image" data-test="plp-grid-image">
    <div class="grid-image-wrapper has-hover-img">
    <img alt="Screenshot 2022-10-17 at 12.03.06 AM.png" class="grid-item-image grid-image-cover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot+2022-10-17+at+12.03.06+AM.png" data-image-dimensions="891x1340" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot+2022-10-17+at+12.03.06+AM.png"/>
    <img alt="Screenshot 2022-10-17 at 12.02.56 AM.png" class="grid-item-image grid-image-hover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200171128-41WP2X90CW820GH07IPH/Screenshot+2022-10-17+at+12.02.56+AM.png" data-image-dimensions="890x1339" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200171128-41WP2X90CW820GH07IPH/Screenshot+2022-10-17+at+12.02.56+AM.png"/>
    <div class="list-quick-view-wrapper sqs-product-quick-view-button-wrapper">
    <span class="sqs-product-quick-view-button" data-group="5ec69b56a188e3129c377b33" data-id="635031c65ac9872b4ba44f5a" role="button" tabindex="0">Quick View</span>
    </div>
    </div>
    </figure>
    <section class="grid-meta-wrapper" data-animation-role="content">
    <div class="grid-main-meta">
    <div class="grid-title" data-test="plp-grid-title">
            PJ Salvage Luxe Plush Embroidered Blanket - Blush
          </div>
    <div class="grid-prices" data-test="plp-grid-prices">
    <div class="product-price">
    CA$118.00
    </div>
    </div>
    </div>
    <div class="grid-meta-status" data-test="plp-grid-status">
    <div class="product-scarcity">
        Only 1 left in stock
      </div>
    </div>
x4shl7ld

x4shl7ld1#

这是因为for循环遍历了所有值,但总是覆盖这些值,所以只保留最后一个值,然后将其添加到dataset中。

  • 建议-尝试简化并定位到包含信息的grid-item类的容器元素,迭代所有这些容器,然后将数据添加到您的dataset。这样您只需要一个for循环,更容易控制。*

下面的示例使用css selectors,因为我更喜欢使用它们:

...
soup = BeautifulSoup(f.read(), "html.parser")
for e in soup.select('.grid-item'):
    dataset.append({
        'Field_01':e.img.get('data-src'),
        'Field_02':e.select_one('.grid-title').get_text(strip=True),
        'Field_03':baseurl + e.a.get('href'),
        'Field_04':e.select_one('.product-price').get_text(strip=True)
    })

但是你也可以用find_all()或者find()来代替。检查get_text()和它的参数,去掉中断或者空白。

for e in soup.find_all('div', class_='grid-item'):
        dataset.append({
            'Field_01':e.find('img', class_='grid-item-image').get('data-src'),
            'Field_02':e.find('div', class_='grid-title').get_text(strip=True),
            'Field_03':baseurl + e.find('a', class_='grid-item-link').get('href'),
            'Field_04':e.find('div', class_='product-price').get_text(strip=True)
        })

这将导致:
| 字段_01|字段_02|字段_03|字段_04|
| - ------|- ------|- ------|- ------|
| https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot+2022-04-19+at+1.31.04+PM.png|媚俗桨式发刷|https://www.soxboxmtl.com/home-bath-body/p/kitsch-paddle-hair-brush| 24加元|
| https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot+2022-10-17+at+12.03.06+AM.png| PJ Salvage豪华长毛绒刺绣毛毯-腮红|https://www.soxboxmtl.com/home-bath-body/p/pj-salvage-luxe-plush-embroidered-blanket-blush| 118加元|

lkaoscv7

lkaoscv72#

您当前的实现存在两个问题:

问题1

循环实际上并不处理bs4找到的数据,唯一向数据集添加数据的是对dataset.append()的一次调用,这会产生您所体验到的单行数据。

问题2

即使循环是有效的,脚本也可能会失败,因为Pandas Dataframe 需要一致的列长度。例如,图片比标题多,所以您最终会得到长度不同的列。

溶液

除了确保我们实际上正确地追加了数据之外,我们还需要确保所有列的格式都是正确和一致的,而不是搜索任何和所有彼此没有关系的信息,而是搜索所有包含与我们的需求相关的信息的父元素。
然后,我们遍历父元素列表。在每次迭代中,我们只搜索该父元素中的可用数据,然后将其格式化以在DataFrame中使用。此DataFrame被追加到DataFrame列表中,该列表在迭代完成后被连接为单个DataFrame,最后被导出。

# Find all the grid-items first.
sections = soup.find_all('div', {'class': 'grid-item'}, recursive=True)

# We will append our formatted data to this list, then
# provide it to the DataFrame on creation
df_items = []

# Format and add the data from each grid-item to the DataFrame.
for section in sections:
    title = section.find('a', {'class': 'grid-item-link'})
    imgs = section.findAll('img')
    price = section.find('div', {'class': 'product-price'})

    data = {
        'Field_01': [img['data-src'] for img in imgs],
        'Field_02': [title['aria-label']],
        'Field_03': [baseurl + title['href']],
        'Field_04': [''.join(price.text.split())],
    }

    # DataFrames require all arrays to be the same length.
    # This automatically fills in any missing cells.
    df = pd.DataFrame.from_dict(data, orient='index')
    df = df.transpose()

    # Append the DataFrame to our list of DataFrames.
    df_items.append(df)

# Concatenate all dataframes.
result = pd.concat(df_items)

# Export
result.to_csv('data.csv', index=False)

相关问题