你好
我想合并多个CSV文件。
我有一个参考.csv文件与所有的标题。
我有几个CSV文件,可以有不同的标题。
我想合并引用头到所有其他CSV文件。此外,我想为所有已添加的下一行填充值0。
有一个例子:
reference.csv(我的头)
a;b;c;d
File1.csv
a;c
45;68
预期结果:
File1.csv
a;b;c;d
45;0;68;0
我试过这个:
(Also尝试Pandas,但没有工作,因为我想)
import os
import csv
# Get the directory where the Python script is located
script_directory = os.path.dirname(__file__)
# Define the folder containing the CSV files
folder_path = script_directory
# Define the path to the reference text file
reference_file_path = os.path.join(folder_path, "reference.txt")
# Read the reference columns from the text file
with open(reference_file_path, "r") as reference_file:
reference_columns = reference_file.read().splitlines()
# Read the header from the file with the most columns
with open(os.path.join(folder_path, reference_file_path ), "r", newline="") as file:
header_to_copy = next(csv.reader(file, delimiter=";"))
# Iterate through each CSV file and copy the header
for file_name in reference_file_path:
file_path = os.path.join(folder_path, file_name)
data = []
with open(file_path, "r", newline="") as file:
reader = csv.reader(file, delimiter=";")
for row in reader:
data.append(row)
# Update the header in the current file
original_header = data[0]
data[0] = header_to_copy
# Print the differences between the original header and the copied header
differences = set(original_header) ^ set(header_to_copy)
print(f"Processed: {file_name}")
print("Differences:")
print("+----------------+----------------+")
print("| Original Header | Copied Header |")
print("+----------------+----------------+")
for item in differences:
original_present = "Yes" if item in original_header else "No"
copied_present = "Yes" if item in header_to_copy else "No"
print(f"| {item:<16}| {original_present:<16}| {copied_present:<16}|")
print("+----------------+----------------+")
# Print the position of each element in the header if it was not in the original header
position_dict = {element: position for position, element in enumerate(header_to_copy, start=1)}
for element in differences:
if element in header_to_copy:
position = position_dict[element]
print(f"Element '{element}' is at position {position} in the header.")
# Add a new column with the value "0" from the second row onwards at the specified position
for i in range(1, len(data)):
for element in differences:
if element in header_to_copy:
position = position_dict[element]
data[i].insert(position - 1, "0")
# Write the modified data back to the CSV file
with open(file_path, "w", newline="") as file:
writer = csv.writer(file, delimiter=";")
writer.writerows(data)
- 我在想用的位置或号码“;并使用它添加新的“0”值。但当有一个特定的情况下,它不工作:*
试过的Pandas:
import pandas as pd
import os
# Get the directory where the Python script is located
script_directory = os.path.dirname(__file__)
# Define the folder containing the CSV files (same location as the script)
folder_path = script_directory
# Path to the reference.csv file
reference_csv = os.path.join(script_directory, "ref", "reference.csv")
# Read the reference.csv file
reference_df = pd.read_csv(reference_csv)
# List all CSV files in the folder
csv_files = [file for file in os.listdir(folder_path) if file.endswith(".csv")]
# Initialize an empty list to store DataFrames
dfs = []
# Iterate through the CSV files in the folder
for csv_file in csv_files:
# Construct the full path to the current CSV file
csv_file_path = os.path.join(folder_path, csv_file)
# Read the current CSV file
current_df = pd.read_csv(csv_file_path)
# Copy the header row from the reference DataFrame
reference_header = reference_df.iloc[0].copy()
# Find the positions where the headers differ
diff_positions = [i for i, (ref_col, cur_col) in enumerate(zip(reference_header, current_df.columns)) if ref_col != cur_col]
# Print the positions if there are differences
if diff_positions:
print(f"Differences in headers for {csv_file}:")
print(diff_positions)
# Concatenate the reference header row with the current DataFrame and fill missing columns with 0
current_df = pd.concat([reference_header, current_df], ignore_index=True, axis=0).fillna(0)
# Append the merged DataFrame to the list
dfs.append(current_df)
# Concatenate all DataFrames in the list along columns
result_df = pd.concat(dfs, axis=1, ignore_index=True)
# Save the merged DataFrame to a new CSV file
result_df.to_csv(os.path.join(script_directory, "merged_data.csv"), index=False)
但有这个错误:
File "parsers.pyx", line 843, in pandas._libs.parsers.TextReader.read_low_memory
File "parsers.pyx", line 904, in pandas._libs.parsers.TextReader._read_rows
File "parsers.pyx", line 879, in pandas._libs.parsers.TextReader._tokenize_rows
File "parsers.pyx", line 890, in pandas._libs.parsers.TextReader._check_tokenize_status
File "parsers.pyx", line 2058, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 51 fields in line 18153, saw 52
谢谢
2条答案
按热度按时间yshpjwxd1#
Pandas的concat是最简单的选择,只要我们确保 * 列名 * 是相同的。在第二个文件中,标题行有尾随空格。
第二个文件中的
c
列包含尾随空格。我们得把这个修剪一下这样柱子才能匹配
在此之后,对
pd.concat
的单个调用将合并这两个嵌套框。缺失值可以使用fillna替换为
0
:我们可以使用to_csv将合并后的数据写入文件
拼接大量文件
pd.concat
可以处理任何 Dataframe 序列。我们可以使用Path.rglob
按顺序加载所有文件并将它们连接起来。为了清理数据,我们可以使用一个单独的函数:
ijnw1ujt2#
尝试以下步骤。
验证码