python 使用Chardet查找超大文件的编码

inkz8wg9 于 2023-03-21 发布在 Python

关注(0)|答案(4)|浏览(182)

我尝试使用Chardet来推导一个非常大的文件（〉400万行）的制表符分隔格式的编码。
目前，我的脚本可能由于文件的大小而遇到了困难。我想将其缩小到加载文件的前x行，但当我尝试使用readline()时遇到了困难。
目前的脚本是：

import chardet
import os
filepath = os.path.join(r"O:\Song Pop\01 Originals\2017\FreshPlanet_SongPop_0517.txt")
rawdata = open(filepath, 'rb').readline()

print(rawdata)
result = chardet.detect(rawdata)
print(result)

我尝试使用简单的循环多次调用readline()，但效果不太好（可能是因为脚本是以二进制格式打开文件的）。
一行上的输出为{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
我想知道增加它读取的行数是否会提高编码的可信度。
任何帮助都将不胜感激。

python

来源：https://stackoverflow.com/questions/46037058/using-chardet-to-find-encoding-of-very-large-file

4条答案

按热度按时间

enyaitl31#

我对Chardet并没有什么特别的经验，但是在调试我自己的问题时遇到了这篇文章，并且很惊讶它没有任何答案。很抱歉，如果这对OP有任何帮助已经太晚了，但是对于其他偶然发现这个问题的人来说：
我不确定阅读更多的文件是否会改善猜测的编码类型，但你需要做的就是测试它：

import chardet
testStr = b''
count = 0
with open('Huge File!', 'rb') as x:
    line = x.readline()
    while line and count < 50:  #Set based on lines you'd want to check
        testStr = testStr + line
        count = count + 1
        line = x.readline()
print(chardet.detect(testStr))

在我的例子中，我有一个文件，我相信有多种编码格式，并写了以下测试它“逐行”。

import chardet
with open('Huge File!', 'rb') as x:
    line = x.readline()
    curChar = chardet.detect(line)
    print(curChar)
    while line:
        if curChar != chardet.detect(line):
            curChar = chardet.detect(line)
            print(curChar)
        line = x.readline()

赞(0）回复(0）举报 2023-03-21

57hvy0tb2#

另一个关于UniversalDetector的例子：

#!/usr/bin/env python
from chardet.universaldetector import UniversalDetector

def detect_encode(file):
    detector = UniversalDetector()
    detector.reset()
    with open(file, 'rb') as f:
        for row in f:
            detector.feed(row)
            if detector.done: break

    detector.close()
    return detector.result

if __name__ == '__main__':
    print(detect_encode('example_file.csv'))

当置信度= 1.0时中断。对于非常大的文件很有用。

赞(0）回复(0）举报 2023-03-21

yfjy0ee73#

另一个不使用python-magic包将文件加载到内存的示例

import magic

def detect(
    file_path,
):
    return magic.Magic(
        mime_encoding=True,
    ).from_file(file_path)

赞(0）回复(0）举报 2023-03-21

7gyucuyw4#

import chardet

with open(filepath, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))
result

赞(0）回复(0）举报 2023-03-21

我来回答

python 使用Chardet查找超大文件的编码

4条答案

相关问题

热门标签

最新问答