sqlite 如何修复'' UnicodeDecodeError:“charmap”编解码器无法解码位置29815中的字节0x9d:字符Map到< undefined>''?

slsn1g29  于 2023-08-06  发布在  SQLite
关注(0)|答案(6)|浏览(164)

目前,我正试图让一个Python 3程序通过Spyder IDE/GUI对一个充满信息的文本文件进行一些操作。然而,当我试图读取文件时,我得到以下错误:

File "<ipython-input-13-d81e1333b8cd>", line 77, in <module>
    parser(f)

  File "<ipython-input-13-d81e1333b8cd>", line 18, in parser
    data = infile.read()

  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 29815: character maps to <undefined>

字符串
程序代码如下:

import os

os.getcwd()

import glob
import re
import sqlite3
import csv

def parser(file):

    # Open a TXT file. Store all articles in a list. Each article is an item
    # of the list. Split articles based on the location of such string as
    # 'Document PRN0000020080617e46h00461'

    articles = []
    with open(file, 'r') as infile:
        data = infile.read()
    start = re.search(r'\n HD\n', data).start()
    for m in re.finditer(r'Document [a-zA-Z0-9]{25}\n', data):
        end = m.end()
        a = data[start:end].strip()
        a = '\n   ' + a
        articles.append(a)
        start = end

    # In each article, find all used Intelligence Indexing field codes. Extract
    # content of each used field code, and write to a CSV file.

    # All field codes (order matters)
    fields = ['HD', 'CR', 'WC', 'PD', 'ET', 'SN', 'SC', 'ED', 'PG', 'LA', 'CY', 'LP',
              'TD', 'CT', 'RF', 'CO', 'IN', 'NS', 'RE', 'IPC', 'IPD', 'PUB', 'AN']

    for a in articles:
        used = [f for f in fields if re.search(r'\n   ' + f + r'\n', a)]
        unused = [[i, f] for i, f in enumerate(fields) if not re.search(r'\n   ' + f + r'\n', a)]
        fields_pos = []
        for f in used:
            f_m = re.search(r'\n   ' + f + r'\n', a)
            f_pos = [f, f_m.start(), f_m.end()]
            fields_pos.append(f_pos)
        obs = []
        n = len(used)
        for i in range(0, n):
            used_f = fields_pos[i][0]
            start = fields_pos[i][2]
            if i < n - 1:
                end = fields_pos[i + 1][1]
            else:
                end = len(a)
            content = a[start:end].strip()
            obs.append(content)
        for f in unused:
            obs.insert(f[0], '')
        obs.insert(0, file.split('/')[-1].split('.')[0])  # insert Company ID, e.g., GVKEY
        # print(obs)
        cur.execute('''INSERT INTO articles
                       (id, hd, cr, wc, pd, et, sn, sc, ed, pg, la, cy, lp, td, ct, rf,
                       co, ina, ns, re, ipc, ipd, pub, an)
                       VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
                       ?, ?, ?, ?, ?, ?, ?, ?)''', obs)

# Write to SQLITE
conn = sqlite3.connect('factiva.db')
with conn:
    cur = conn.cursor()
    cur.execute('DROP TABLE IF EXISTS articles')
    # Mirror all field codes except changing 'IN' to 'INC' because it is an invalid name
    cur.execute('''CREATE TABLE articles
                   (nid integer primary key, id text, hd text, cr text, wc text, pd text,
                   et text, sn text, sc text, ed text, pg text, la text, cy text, lp text,
                   td text, ct text, rf text, co text, ina text, ns text, re text, ipc text,
                   ipd text, pub text, an text)''')
    for f in glob.glob('*.txt'):
        print(f)
        parser(f)

# Write to CSV to feed Stata
with open('factiva.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    with conn:
        cur = conn.cursor()
        cur.execute('SELECT * FROM articles WHERE hd IS NOT NULL')
        colname = [desc[0] for desc in cur.description]
        writer.writerow(colname)
        for obs in cur.fetchall():
            writer.writerow(obs)

piztneat

piztneat1#

正如您在https://en.wikipedia.org/wiki/Windows-1252中看到的,代码0x9D在CP1252中没有定义。
“错误”是例如在open函数中:你没有指定编码,所以python(仅在windows中)将使用一些系统编码。一般来说,如果您读取的文件可能不是在同一台机器上创建的,那么最好指定编码。
我建议在你的open上也写一个代码来写csv。最好是明确的。
我不知道原来的文件格式,但添加到打开, encoding='utf-8'通常是一件好事(这是Linux和MacOs的默认设置)。

7jmck4yq

7jmck4yq2#

在open语句中添加编码例如:

f=open("filename.txt","r",encoding='utf-8')

字符串

mec1mxoz

mec1mxoz3#

上面的方法对我不起作用,试试这个:, errors='ignore'工作奇迹!

vddsk6oq

vddsk6oq4#

如果你不需要解码的话,你也可以尝试file = open(filename, 'rb') 'rb' translates来读取二进制。如果你只是想上传到一个网站

b91juud3

b91juud35#

errors='ignore'解决了我的头痛:
如何在目录和子目录中找到单词“昏迷”=

import os
rootdir=('K:\\0\\000.THU.EEG.nedc_tuh_eeg\\000edf.01_tcp_ar\\01_tcp_ar\\')
for folder, dirs, files in os.walk(rootdir):
    for file in files:
        if file.endswith('.txt'):
            fullpath = os.path.join(folder, file)
            with open(fullpath, 'r', errors='ignore') as f:
                for line in f:
                    if "coma" in line:
                        print(fullpath)
                        break

字符串

z8dt9xmd

z8dt9xmd6#

我不相信编码**<errors=='ignore'>是一个好主意,即使它工作。因为你不知道还有什么可以忽略,你应该寻找方法绕过这个问题,而不切断文件的碎片。
我也有这个问题,当
我试图追加html作为文本到一个文件**。您可以像我一样尝试,首先以bytes类型返回内容,然后通过使用'utf-8'解码将其转换为string

converted_file = binary_file.decode('utf-8')

字符串

相关问题