有没有办法在python中合并html文件并下载为excel文件？

rmbxnbpk 于 2023-02-06 发布在 Python

关注(0)|答案(1)|浏览(172)

我对python还很陌生，所以我的问题可能听起来很傻。我已经从这个链接下载了几个"完井"文件：https://wwwapps.emnrd.nm.gov/OCD/OCDPermitting/Reporting/Activity/WeeklyActivity.aspx。现在我想使用Python将所有文件合并到一个Excel工作表中并将其导出。到目前为止，我一直很不成功，我希望我能在这里得到答案。问题在于，文件下载的方式是用Excel打开的，但实际上是HTML格式。
我用来合并文件的代码是：

import os
from bs4 import BeautifulSoup
output_doc = BeautifulSoup()
output_doc.append(output_doc.new_tag("html"))
output_doc.html.append(output_doc.new_tag("body"))
data_folder= r'C:\Users\dtsar\OneDrive\Desktop\another well completion'
for file in os.listdir(data_folder):
    if not file.lower().endswith('.html'):
        continue

    with open(file, 'r') as html_file:
        output_doc.body.extend(BeautifulSoup(html_file.read(), "html.parser").body)

print(output_doc.prettify())

但我得到的回应是x一个月一次x一个月一次x一个月二次x一个月三次
我不明白我哪里出错了。下一步是将数据导出为excel格式，但我似乎不能在第一时间将所有文件组合在一起。有什么想法吗？

Html

来源：https://stackoverflow.com/questions/75306424/is-there-a-way-to-combine-html-files-and-download-them-as-excel-file-in-python

1条答案

按热度按时间

oknwwptz1#

所以，我想出了解决方案，将损坏的Excel文件转换为适当的.xlsx.代码如下，以防任何人需要它：

import os
import pandas as pd
from bs4 import BeautifulSoup

folder_path = r'path to the folder'

for filename in os.listdir(folder_path):
    if filename.endswith(".xls"):
        file_path = os.path.join(folder_path, filename)
        with open(file_path) as f:
            soup = BeautifulSoup(f, 'html.parser')
        tables = soup.find_all('table')
        writer = pd.ExcelWriter(file_path.replace(".xls", ".xlsx"), engine='openpyxl')
        for i, table in enumerate(tables):
            caption = table.find('caption')
            if caption:
                sheet_name = caption.get_text().strip()
            else:
                sheet_name = 'Sheet{}'.format(i+1)
            df = pd.read_html(str(table))[0]
            df.to_excel(writer, sheet_name=sheet_name, index=False)
            writer.save()

赞(0）回复(0）举报 2023-02-06

我来回答

有没有办法在python中合并html文件并下载为excel文件？

1条答案

相关问题

热门标签

最新问答