对CSV文件进行分组/求和[已关闭]

qmelpv7a  于 2023-05-26  发布在  其他
关注(0)|答案(1)|浏览(205)

**关闭。**此题需要debugging details。目前不接受答复。

编辑问题以包括desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem。这将帮助其他人回答这个问题。
2天前关闭。
Improve this question
我需要一个脚本,将组/总和列1至4 CSV文件,总结列5。组/和只需要在列1 =“2”的行上。下面是输入和所需输出的示例。
输入(输入CSV):

所需输出(正确输出CSV):

我尝试使用下面的Python脚本:导入csv

input_file = r'C:\Users\User\OneDrive\Desktop\test.csv'
output_file = r'C:\Users\User\OneDrive\Documents\Flat file tests\output4.csv'

# Dictionary to store rows with unique identifiers
identifier_rows = {}

# Read the input CSV file
with open(input_file, 'r') as file:
reader = csv.reader(file)
header = next(reader)  # Read the header

for row in reader:
    if row[0] == '1':
        # Do nothing for rows where column 1 is '1'
        pass
    elif row[0] == '2':
        # Columns 2, 3, and 4 are unique identifiers
        identifier = tuple(row[1:4])
        value = float(row[4])

        if identifier in identifier_rows:
            # Update the sum of column 5 for existing rows
            identifier_rows[identifier] += value
        else:
            # Add the row to the dictionary
            identifier_rows[identifier] = value

# Write the output CSV file
with open(output_file, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(header)  # Write the header

# Write the rows with column 1 as '1'
with open(input_file, 'r') as file:
    reader = csv.reader(file)
    next(reader)  # Skip the header

    for row in reader:
        if row[0] == '1':
            writer.writerow(row)

# Write the aggregated rows with column 1 as '2'
for identifier, value in identifier_rows.items():
    writer.writerow(['2'] + list(identifier) + [str(value)])

print("New CSV file output 4 has been created successfully.")

并得到以下输出(不正确的输出):

但是它需要按每个标题行(即,列1 =“1”的行)分组。

oyxsuwqo

oyxsuwqo1#

您需要考虑将保存每个组并跟踪总和的数据结构。
对于行的第一个子集,在我看来,它看起来像:

{
    ("2", "1010", "2105", "NONE"): -200.0,
    ("2", "1050", "2105", "NONE"): -150.0,
    ("2", "1010", "2105", "COSF"): -150.2,
    ("2", "1010", "2104", "NONE"): -75.0,
    ("2", "1010", "2104", "COSF"): -75.1,
}

这是一个dict,前四个字段作为一个元组,作为值的键(和)。
您还需要存储此子集的标头:

header = ['1', 'AAA', 'BBB', 'CCC', 'DDD']

然后需要对浮点数进行排序和格式化,然后将它们作为行追加到最终行的列表中。
然后,对每个行子集重复上述操作,这些行由以“1”开头的标题分隔。
每次在输入中找到标题行时,都用分组和的最后一组更新最后一行,并重置下一组的状态变量:

...
    for row in reader:
        if row[0] == "1":
            update_final(header)
            header = row
            grouped_sums = {}
            continue
...

每隔一行应该是要分组和求和的数据:

...
        key = tuple(row[:4])
        val = float(row[4])

        sum = grouped_sums.get(key, 0.0)
        sum += val
        grouped_sums[key] = sum

您还需要处理在最后一行之后仍处于状态/全局变量中的数据:

...
    update_final(header)

这是整个剧本。我使用类型提示,它确实帮助我在这里直接获得所有列表和元组:

import csv

Row = list[str]
Group_Key = tuple[str, str, str, str]
Subset_Row = tuple[str, str, str, str, float]

# global vars to update in the main loop of the input CSV; and read from in update_final()
header: Row = []
grouped_sums: dict[Group_Key, float] = {}
# global var to update in update_final(); and read from to write to output CSV
final_rows: list[Row] = []

def update_final(header: Row):
    # initial call for first row of CSV, nothing collected/grouped/summed yet
    if len(grouped_sums) == 0:
        return

    # create intermediate list of groups to sort
    subet: list[Subset_Row] = []
    for key, sum in grouped_sums.items():
        subet.append(key + (sum,))
    subet.sort(key=lambda x: (x[1], x[2], x[4]))

    # add to final
    final_rows.append(header)
    for row in subet:
        final_rows.append(list(row[:4]) + [f"{row[4]:.3f}"])

with open("input.csv", newline="") as f:
    reader = csv.reader(f)

    for row in reader:
        if row[0] == "1":
            update_final(header)
            # reset state vars
            header = row
            grouped_sums = {}
            # no values to sum, skip to next row
            continue

        key = tuple(row[:4])
        val = float(row[4])

        sum = grouped_sums.get(key, 0.0)
        sum += val
        grouped_sums[key] = sum

    update_final(header)

with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f, lineterminator="\n")
    writer.writerows(final_rows)

我得到了最终的输出非常接近你在图像中显示;在最后一组中有一个小的排序差异:

1,AAA,BBB,CCC,DDD
2,1010,2104,COSF,-75.100
2,1010,2104,NONE,-75.000
2,1010,2105,NONE,-200.000
2,1010,2105,COSF,-150.200
2,1050,2105,NONE,-150.000
1,AAB,BBB,ССС,DDD
2,1010,2104,NONE,-75.000
2,1010,2105,NONE,-275.100
2,1050,2105,NONE,-150.000
1,AAC,BBB,CCC,DDD
2,1010,2104,COSF,-75.100
2,1010,2104,NONE,-75.000
2,1010,2105,NONE,-200.000
2,1010,2105,COSF,-150.200
2,1050,2105,NONE,-150.000

相关问题