如何从csv文件中删除重复项

flvtvl50 于 2023-01-18 发布在其他

关注(0)|答案(5)|浏览(245)

我从Hotmail下载了一个CSV文件，但其中有很多重复项。这些重复项是完整的副本，我不知道为什么我的手机会创建它们。
我想把复制品处理掉。

技术规范：

Windows XP SP 3
Python 2.7
CSV file with 400 contacts

csv

来源：https://stackoverflow.com/questions/15741564/how-to-remove-duplicates-from-a-csv-file

5条答案

按热度按时间

zd287kbt1#

更新日期：2016年

如果您乐于使用有用的more_itertools外部库：

from more_itertools import unique_everseen
with open('1.csv', 'r') as f, open('2.csv', 'w') as out_file:
    out_file.writelines(unique_everseen(f))

@IcyFlame解决方案的更高效版本

with open('1.csv', 'r') as in_file, open('2.csv', 'w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate

        seen.add(line)
        out_file.write(line)

要就地编辑同一个文件，可以使用以下代码（旧的Python 2代码）

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
    if line in seen: continue # skip duplicate

    seen.add(line)
    print line, # standard output is now redirected to the file

赞(0）回复(0）举报 2023-01-18

vawmfj5a2#

您可以使用Pandas高效地删除重复项，Pandas可以随pip一起安装，也可以随python的Anaconda distribution一起安装。
参见pandas.DataFrame.drop_duplicates

pip install pandas

守则

import pandas as pd
file_name = "my_file_with_dupes.csv"
file_name_output = "my_file_without_dupes.csv"

df = pd.read_csv(file_name, sep="\t or ,")

# Notes:
# - the `subset=None` means that every column is used 
#    to determine if two rows are different; to change that specify
#    the columns as an array
# - the `inplace=True` means that the data structure is changed and
#   the duplicate rows are gone  
df.drop_duplicates(subset=None, inplace=True)

# Write the results to a different file
df.to_csv(file_name_output, index=False)

对于编码问题，使用python标准编码中的适当类型设置encoding=...。
有关pd.read_csv的详细信息，请参见Import CSV file as a pandas DataFrame

赞(0）回复(0）举报 2023-01-18

xv8emn3q3#

您可以使用以下脚本：

前提条件：

1.csv是包含重复项的文件
2.csv是输出文件，一旦执行此脚本，该文件将消除重复项。

代码

inFile = open('1.csv','r')

outFile = open('2.csv','w')

listLines = []

for line in inFile:

    if line in listLines:
        continue

    else:
        outFile.write(line)
        listLines.append(line)

outFile.close()

inFile.close()

算法说明

在这里，我所做的是：
1.在读模式下打开一个文件。2这是有重复的文件。
1.然后在一个循环中运行，直到文件结束，我们检查该行是否已经遇到。
1.如果已经遇到了，那么我们不会将其写入输出文件。
1.如果没有，我们将把它写入输出文件，并将它添加到已经遇到的记录列表中

赞(0）回复(0）举报 2023-01-18

ehxuflar4#

我知道这是长期解决，但我有一个密切相关的问题，即我要删除重复的基础上一列。输入的csv文件是相当大的要打开我的电脑上的MS Excel/Libre Office Calc/谷歌表;147 MB，大约250万条记录。由于我不想为这么简单的事情安装整个外部库，我编写了下面的python脚本，在不到5分钟的时间内完成了这项工作。我没有专注于优化，但我相信它可以被优化，运行更快，更有效地甚至更大的文件。算法类似于上面的@IcyFlame，除了我是基于列（“CCC”）而不是整行/整行来移除重复。

import csv

with open('results.csv', 'r') as infile, open('unique_ccc.csv', 'a') as outfile:
    # this list will hold unique ccc numbers,
    ccc_numbers = []
    # read input file into a dictionary, there were some null bytes in the infile
    results = csv.DictReader(infile)
    writer = csv.writer(outfile)

    # write column headers to output file
    writer.writerow(
        ['ID', 'CCC', 'MFLCode', 'DateCollected', 'DateTested', 'Result', 'Justification']
    )
    for result in results:
        ccc_number = result.get('CCC')
        # if value already exists in the list, skip writing it whole row to output file
        if ccc_number in ccc_numbers:
            continue
        writer.writerow([
            result.get('ID'),
            ccc_number,
            result.get('MFLCode'),
            result.get('datecollected'),
            result.get('DateTested'),
            result.get('Result'),
            result.get('Justification')
        ])

        # add the value to the list to so as to be skipped subsequently
        ccc_numbers.append(ccc_number)

赞(0）回复(0）举报 2023-01-18

t5fffqht5#

@jamylak的解决方案的更高效版本：（少一条指令）

with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line not in seen: 
            seen.add(line)
            out_file.write(line)

要就地编辑同一文件，可以使用以下命令

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
    if line not in seen:
        seen.add(line)
        print line, # standard output is now redirected to the file

赞(0）回复(0）举报 2023-01-18

我来回答

如何从csv文件中删除重复项

5条答案

相关问题

热门标签

最新问答