合并CSV文件的标头,并在缺少标头时填充0

8fq7wneg  于 2023-09-27  发布在  其他
关注(0)|答案(2)|浏览(120)

你好

我想合并多个CSV文件。

我有一个参考.csv文件与所有的标题。
我有几个CSV文件,可以有不同的标题。
我想合并引用头到所有其他CSV文件。此外,我想为所有已添加的下一行填充值0。
有一个例子:
reference.csv(我的头)

a;b;c;d

File1.csv

a;c 
45;68

预期结果:
File1.csv

a;b;c;d
45;0;68;0

我试过这个:

(Also尝试Pandas,但没有工作,因为我想)

import os
import csv

# Get the directory where the Python script is located
script_directory = os.path.dirname(__file__)

# Define the folder containing the CSV files
folder_path = script_directory

# Define the path to the reference text file
reference_file_path = os.path.join(folder_path, "reference.txt")

# Read the reference columns from the text file
with open(reference_file_path, "r") as reference_file:
    reference_columns = reference_file.read().splitlines()

# Read the header from the file with the most columns
with open(os.path.join(folder_path, reference_file_path ), "r", newline="") as file:
    header_to_copy = next(csv.reader(file, delimiter=";"))

# Iterate through each CSV file and copy the header
for file_name in reference_file_path:
    file_path = os.path.join(folder_path, file_name)
    data = []

    with open(file_path, "r", newline="") as file:
        reader = csv.reader(file, delimiter=";")
        for row in reader:
            data.append(row)

    # Update the header in the current file
    original_header = data[0]
    data[0] = header_to_copy

    # Print the differences between the original header and the copied header
    differences = set(original_header) ^ set(header_to_copy)
    print(f"Processed: {file_name}")
    print("Differences:")
    print("+----------------+----------------+")
    print("| Original Header | Copied Header  |")
    print("+----------------+----------------+")

    for item in differences:
        original_present = "Yes" if item in original_header else "No"
        copied_present = "Yes" if item in header_to_copy else "No"
        print(f"| {item:<16}| {original_present:<16}| {copied_present:<16}|")

    print("+----------------+----------------+")

    # Print the position of each element in the header if it was not in the original header
    position_dict = {element: position for position, element in enumerate(header_to_copy, start=1)}
    for element in differences:
        if element in header_to_copy:
            position = position_dict[element]
            print(f"Element '{element}' is at position {position} in the header.")

    # Add a new column with the value "0" from the second row onwards at the specified position
    for i in range(1, len(data)):
        for element in differences:
            if element in header_to_copy:
                position = position_dict[element]
                data[i].insert(position - 1, "0")

    # Write the modified data back to the CSV file
    with open(file_path, "w", newline="") as file:
        writer = csv.writer(file, delimiter=";")
        writer.writerows(data)
  • 我在想用的位置或号码“;并使用它添加新的“0”值。但当有一个特定的情况下,它不工作:*

试过的Pandas:

import pandas as pd
import os

# Get the directory where the Python script is located
script_directory = os.path.dirname(__file__)

# Define the folder containing the CSV files (same location as the script)
folder_path = script_directory

# Path to the reference.csv file
reference_csv = os.path.join(script_directory, "ref", "reference.csv")

# Read the reference.csv file
reference_df = pd.read_csv(reference_csv)

# List all CSV files in the folder
csv_files = [file for file in os.listdir(folder_path) if file.endswith(".csv")]

# Initialize an empty list to store DataFrames
dfs = []

# Iterate through the CSV files in the folder
for csv_file in csv_files:
    # Construct the full path to the current CSV file
    csv_file_path = os.path.join(folder_path, csv_file)
    
    # Read the current CSV file
    current_df = pd.read_csv(csv_file_path)
    
    # Copy the header row from the reference DataFrame
    reference_header = reference_df.iloc[0].copy()
    
    # Find the positions where the headers differ
    diff_positions = [i for i, (ref_col, cur_col) in enumerate(zip(reference_header, current_df.columns)) if ref_col != cur_col]
    
    # Print the positions if there are differences
    if diff_positions:
        print(f"Differences in headers for {csv_file}:")
        print(diff_positions)
    
    # Concatenate the reference header row with the current DataFrame and fill missing columns with 0
    current_df = pd.concat([reference_header, current_df], ignore_index=True, axis=0).fillna(0)
    
    # Append the merged DataFrame to the list
    dfs.append(current_df)

# Concatenate all DataFrames in the list along columns
result_df = pd.concat(dfs, axis=1, ignore_index=True)

# Save the merged DataFrame to a new CSV file
result_df.to_csv(os.path.join(script_directory, "merged_data.csv"), index=False)

但有这个错误:

File "parsers.pyx", line 843, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 904, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 879, in pandas._libs.parsers.TextReader._tokenize_rows
  File "parsers.pyx", line 890, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "parsers.pyx", line 2058, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 51 fields in line 18153, saw 52

谢谢

yshpjwxd

yshpjwxd1#

Pandas的concat是最简单的选择,只要我们确保 * 列名 * 是相同的。在第二个文件中,标题行有尾随空格。

>>> from io import StringIO
>>> file1="""a;b;c;d"""
>>> file2="""
... a;c
... 45;68"""
>>>
>>> df_ref=pd.read_csv(StringIO(file1),sep=';')
>>> df_data=pd.read_csv(StringIO(file2),sep=';')

第二个文件中的c列包含尾随空格。

>>> df_data.columns
Index(['a', 'c '], dtype='object')

我们得把这个修剪一下这样柱子才能匹配

>>> # Trim whitespace in the column names
>>> df_data.columns=[c.strip() for c in df_data.columns]

在此之后,对pd.concat的单个调用将合并这两个嵌套框。

>>> df_combined=pd.concat([df_ref,df_data])
>>> df_combined
    a    b   c    d
0  45  NaN  68  NaN

缺失值可以使用fillna替换为0

df_combined=df_combined.fillna(0)

我们可以使用to_csv将合并后的数据写入文件

df_combined.to_csv(target_path,sep=';')

拼接大量文件

pd.concat可以处理任何 Dataframe 序列。我们可以使用Path.rglob按顺序加载所有文件并将它们连接起来。

from pathlib import Path
root=Path('/path/to/folder')

df_all=pd.concat([pd.read_csv(csv_path,sep=';') for csv_path in root.rglob("*.csv")])

为了清理数据,我们可以使用一个单独的函数:

def load_with_trim(csv_path) :
    df=pd.read_csv(csv_path,sep=';')
    df.columns=[c.strip() for c in df.columns]
    return df

all_files=[load_with_trim(csv_path) for csv_path in root.rglob("*.csv")]
df_all=pd.concat(all_files)
ijnw1ujt

ijnw1ujt2#

尝试以下步骤。

1) create a directory and put both csv file in same directory 
2) create new python file like test.py and put below code in python file
3) run test.py and output file is df_merged.csv

验证码

import pandas as pd
#read reference csv
df_reference=pd.read_csv('reference.csv', sep=';')
# trim space from header columns
df_reference.rename(columns=lambda x: x.strip(), inplace=True)

#read data csv
df_data=pd.read_csv('File1.csv', sep=';')
# trim space from header columns
df_data.rename(columns=lambda x: x.strip(), inplace=True)

# concat both csv data
df_merged = pd.concat([df_reference,df_data])
df_merged['b'] = df_merged['b'].fillna(0)
df_merged['d'] = df_merged['d'].fillna(0)
#df_merged.drop(labels=[0], axis=0, inplace=True)
print(df_merged.head())
# write merged data in csv file
df_merged.to_csv('df_merged.csv',index=False, sep=";")

相关问题