将.html文件转换为.csv文件，但.html文件包含日志，并且包含嵌套表

shstlldc 于 2023-09-28 发布在其他

关注(0)|答案(1)|浏览(118)

我试图转换一个.html文件，其中包含表格形式的日志，它有嵌套表。我把它转换成.csv文件。一个列有错误报告，并且在该列中作为新表。我想把整个表转换成纯文本。尝试在python中使用beautifulsoup来实现这一点，但还没有成功。嵌套表中的数据分散到父表的所有列中，而不粘在原始列上。我能做点什么吗？
使用python和beautifulsoup库不能给予所需的输出

csv

来源：https://stackoverflow.com/questions/76815142/converting-html-file-to-csv-file-but-html-file-contains-logs-and-it-has-neste

1条答案

按热度按时间

2lpgd9681#

将带有嵌套表的HTML文件转换为CSV，同时保留结构可能有点挑战性。BeautifulSoup是一个很好的解析HTML的库，但它可能需要额外的操作来正确处理嵌套表。
为了实现所需的输出，您可以使用BeautifulSoup沿着一些自定义Python代码来解析HTML，提取数据，并将其正确地组织成CSV格式。这里有一个循序渐进的方法来帮助你实现这一目标：
使用BeautifulSoup解析HTML文件。
1.查找父表并提取其标题。
1.查找父表中的所有行。
1.对于每一行，在相关列中查找嵌套表（如果存在）。
1.从嵌套表中提取数据并将其追加到父表中的相应单元格。
下面是一个Python代码片段，可以帮助您入门：

from bs4 import BeautifulSoup
import csv

def extract_nested_table_data(table_cell):
    # Helper function to extract the data from a nested table cell
    nested_table = table_cell.find('table')
    if not nested_table:
        return ''

    # Process the nested table and extract its data as plain text
    nested_rows = nested_table.find_all('tr')
    nested_data = []
    for row in nested_rows:
        nested_cells = row.find_all(['td', 'th'])
        nested_data.append([cell.get_text(strip=True) for cell in nested_cells])
    
    # Convert nested_data to a formatted plain text representation
    nested_text = '\n'.join(','.join(row) for row in nested_data)
    return nested_text

def convert_html_to_csv(html_filename, csv_filename):
    with open(html_filename, 'r', encoding='utf-8') as html_file:
        soup = BeautifulSoup(html_file, 'html.parser')

        parent_table = soup.find('table')
        headers = [header.get_text(strip=True) for header in parent_table.find_all('th')]

        with open(csv_filename, 'w', newline='', encoding='utf-8') as csv_file:
            csv_writer = csv.writer(csv_file)
            csv_writer.writerow(headers)

            rows = parent_table.find_all('tr')
            for row in rows[1:]:  # Skipping the header row
                cells = row.find_all(['td', 'th'])
                row_data = [cell.get_text(strip=True) for cell in cells]

                # Extract data from nested table (if it exists) and append to the row
                for idx, cell in enumerate(cells):
                    nested_data = extract_nested_table_data(cell)
                    row_data[idx] += nested_data

                csv_writer.writerow(row_data)

if __name__ == '__main__':
    html_filename = 'input.html'
    csv_filename = 'output.csv'
    convert_html_to_csv(html_filename, csv_filename)

此代码假设嵌套表数据是逗号分隔的。如果不是，您可能需要相应地调整分隔符。此外，如果嵌套表包含逗号，请考虑使用其他分隔符。
请记住，处理复杂的HTML结构可能需要进一步调整此代码，具体取决于数据的具体情况。尽管如此，这应该是处理这项任务的一个良好起点。

赞(0）回复(0）举报 2023-09-28

我来回答

将.html文件转换为.csv文件，但.html文件包含日志，并且包含嵌套表

1条答案

相关问题

热门标签

最新问答