我可以导入CSV文件并自动推断分隔符吗?

6xfqseft  于 2023-09-28  发布在  其他
关注(0)|答案(6)|浏览(84)

我想导入两种CSV文件,有的用“;”作为分隔符,其他人则使用“,”。到目前为止,我一直在下面两行之间切换:

reader=csv.reader(f,delimiter=';')

reader=csv.reader(f,delimiter=',')

有没有可能不指定这个参数,而让程序检查正确的参数?
下面的解决方案(Blender和sharth)似乎适用于逗号分隔的文件(使用Librooffice生成),但不适用于分号分隔的文件(使用MS Office生成)。以下是一个以分号分隔的文件的第一行:

ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
x3naxklr

x3naxklr1#

csv模块似乎建议使用csv嗅探器来解决这个问题。
他们给予了下面的例子,我已经为您的情况进行了调整。

with open('example.csv', 'rb') as csvfile:  # python 3: 'r',newline=""
    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,")
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)
    # ... process CSV file contents here ...

我们来试试

[9:13am][wlynch@watermelon /tmp] cat example 
#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect)

        for line in reader:
            print line

def main():
    print 'Comma Version:'
    parse('comma_separated.csv')

    print
    print 'Semicolon Version:'
    parse('semicolon_separated.csv')

    print
    print 'An example from the question (kingdom.csv)'
    parse('kingdom.csv')

if __name__ == '__main__':
    main()

和我们的样本输入

[9:13am][wlynch@watermelon /tmp] cat comma_separated.csv 
test,box,foo
round,the,bend

[9:13am][wlynch@watermelon /tmp] cat semicolon_separated.csv 
round;the;bend
who;are;you

[9:22am][wlynch@watermelon /tmp] cat kingdom.csv 
ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document

如果我们执行示例程序:

[9:14am][wlynch@watermelon /tmp] ./example 
Comma Version:
['test', 'box', 'foo']
['round', 'the', 'bend']

Semicolon Version:
['round', 'the', 'bend']
['who', 'are', 'you']

An example from the question (kingdom.csv)
['ReleveAnnee', 'ReleveMois', 'NoOrdre', 'TitreRMC', 'AdopCSRegleVote', 'AdopCSAbs', 'AdoptCSContre', 'NoCELEX', 'ProposAnnee', 'ProposChrono', 'ProposOrigine', 'NoUniqueAnnee', 'NoUniqueType', 'NoUniqueChrono', 'PropoSplittee', 'Suite2LecturePE', 'Council PATH', 'Notes']
['1999', '1', '1', '1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC', 'U', '', '', '31999D0083', '1998', '577', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']
['1999', '1', '2', '1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes', 'U', '', '', '31999D0081', '1998', '184', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']

另外,我还需要注意我使用的是什么版本的python。

[9:20am][wlynch@watermelon /tmp] python -V
Python 2.7.2
hxzsmxv2

hxzsmxv22#

给定一个同时处理,(逗号)和|(竖线)分隔的CSV文件,格式良好,我尝试了以下方法(如https://docs.python.org/2/library/csv.html#csv.Sniffer所示):

dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=',|')

然而,在A|-delimited文件时,返回“无法确定分隔符”异常。我们有理由推测,如果每一行都有相同数量的分隔符(不包括引号中的分隔符),嗅探启发式算法可能会工作得最好。因此,我没有阅读文件的前1024个字节,而是尝试完整地读取前两行:

temp_lines = csvfile.readline() + '\n' + csvfile.readline()
dialect = csv.Sniffer().sniff(temp_lines, delimiters=',|')

到目前为止,这对我来说很好。

oaxa6hgo

oaxa6hgo3#

为了解决这个问题,我创建了一个函数,它读取文件(头)的第一行并检测分隔符。

def detectDelimiter(csvFile):
    with open(csvFile, 'r') as myCsvfile:
        header=myCsvfile.readline()
        if header.find(";")!=-1:
            return ";"
        if header.find(",")!=-1:
            return ","
    #default delimiter (MS Office export)
    return ";"
azpvetkf

azpvetkf4#

如果你使用DictReader,你可以这样做:

#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.DictReader(csvfile, dialect=dialect)

        for line in reader:
            print(line['ReleveAnnee'])

我在Python 3.5中使用了这个,它是这样工作的。

djmepvbi

djmepvbi5#

我不认为有一个完美的通用解决方案(我可能使用,作为分隔符的原因之一是我的一些数据字段需要能够包含; ...)。一个简单的启发式决定可能是简单地阅读第一行(或更多行),计算它包含多少,;字符(可能忽略引号内的字符,如果创建.csv文件的任何东西正确且一致地引用条目),并猜测两者中更频繁的是正确的分隔符。

dzjeubhm

dzjeubhm6#

如果csv.Sniffer不符合您的需求,请遵循@twalberg的想法,这里有两种可能的实现来识别正确的分隔符,但不仅仅是检查常见的,;|分隔符,而是普遍识别类csv文件中的任何类型的奇怪分隔符。

天真的方法

下面的代码读取csv文件的前10行,获取任何非字母数字字符,并计算频率。
此代码完全基于大数定律,即最常用的非字母数字字符一般应作为分隔符。

import re
from collections import Counter
def delimiter_detector(file_path):
    sample_data = ""
    with open(file_path, 'r') as file:
        # Not loading the whole CSV into memory, just the first 10 rows
        i = 0
        while i < 10:
            try:
                sample_data += file.readline()
                i += 1
            except StopIteration:
                break

    non_alnum_chars = re.findall(r'[^a-zA-Z0-9]', sample_data)
    delimiters_frequency = Counter(non_alnum_chars)
    if len(delimiters_frequency) == 0:
        return None

    # Find and return the most common delimiter
    most_common_delimiter = delimiters_frequency.most_common(1)[0][0]
    return most_common_delimiter

print(delimiter_detector('test.csv'))

当然,如果我们有5列由|(4次/行)分隔,但在接下来的9行中的每一行中也有5+ . char,这将打破。{'|': 10*4, '.': 9*5}

不那么幼稚的方法

因此,更好的方法是首先检查和计数标题/第一行中的所有特殊字符,然后在后续行中执行相同的操作。
在识别第一行中的特殊字符之后,然后检查这些字符中,哪一个的频率在其余行中匹配得最多。
继续上面的例子,即使在最坏的情况下,如果我们在标题行中有4个|和4个.,这意味着|.可以作为定界符,通过检查接下来的n行,通常'|':4将是最常见的,而.和其他特殊字符将有所不同。

import re
from collections import Counter

def frequency_counter(sample_data):
    non_alnum_chars = re.findall(r'[^a-zA-Z0-9]', sample_data)
    return dict(Counter(non_alnum_chars))
def delimiter_detector(file_path):
    possible_delimiters = []

    with open(file_path, 'r') as file:
        # Not loading the whole CSV into memory, just the first 10 rows
        i = 0
        while i < 10:
            try:
                freqeunt_nonalpha = frequency_counter(file.readline().strip())
                possible_delimiters.append(freqeunt_nonalpha)
                i += 1
            except StopIteration:
                break

    if len(possible_delimiters) == 0:
        return None

    # Find the most common delimiter in the header row
    potential_delimiters = []
    header_row = possible_delimiters[0]
    # adding potential delimiter to the list if it's in the header row and the frequencies are equal
    for data_row in possible_delimiters[1:]:
        for data_row_delim in data_row:
            if data_row_delim in header_row:
                # if the header and data row frequencies are equal, it's a potential delimiter
                if header_row[data_row_delim] == data_row[data_row_delim]:
                    potential_delimiters.append(data_row_delim)

    # identify the most common potential delimiter
    most_common_delimiter = Counter(potential_delimiters).most_common()
    print(most_common_delimiter)
    return most_common_delimiter[0][0][0]

print(delimiter_detector('test.csv'))

在第一种天真的方法失败的地方,这种方法将起作用。

c1|c2|c3|c4|c5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.

唯一会失败的情况是,如果另一个特殊字符出现在标题行中,并且也出现在其他行中,与实际定界符字符的次数完全相同

c.1|c.2|c.3|c.4|c.5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.

在这种情况下,.|一样是有效的分隔符。然而,如果存在甚至一个行,其中计数与标题行中的计数不完全相同,则后一种方法将成功地标识正确的分隔符。因此,增加要检查的行数可能是值得的。

c.1|c.2|c.3|c.4|c.5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.
a.|b.|c.|d..|e.

需要考虑的其他事项

上述方法假设CSV文件具有标题行。如果文件没有标头,则该方法将第一个数据行视为标头,从而导致分隔符检测中的潜在错误。
另一件经常给我带来问题的事情是文件编码。特别是,如果你在Windows上工作,Excel喜欢用Windows-1250编码保存它。因此,请确保在阅读文件时定义了正确的编码。要检测编码,可以使用chardet库。

相关问题