我有两个文件夹,每个文件夹包含各种.txt文件中的单词,一个文件夹名为“good”,而另一个文件夹名为“bad”,我想写一个Python脚本,将所有数据导入到dataframe中,dataframe将具有“Id”列,“word”列和“label”列。根据文件夹名称,label列将是“good”或“bad”。
我已经写了下面的Python脚本,但我似乎有文件编码类型的问题,我已经安装了'cahrdet'库来检测文件编码类型,但我仍然得到这个错误:
UnicodeDecodeError: 'cp949' codec can't decode byte 0xb7 in position 1400: illegal multibyte sequence
good_path = "myfolder/good"
bad_path = "myfolder/bad"
ids = []
words = []
labels = []
for filename in os.listdir(good_path):
with open(os.path.join(good_path, filename), "rb") as f:
result = chardet.detect(f.read())
encoding = result["encoding"]
with open(os.path.join(good_path, filename), "r", encoding=encoding) as f:
word_content = f.read()
ids.append(filename)
words.append(word_content)
labels.append("good")
for filename in os.listdir(bad_path):
with open(os.path.join(bad_path, filename), "rb") as f:
result = chardet.detect(f.read())
encoding = result["encoding"]
with open(os.path.join(bad_path, filename), "r", encoding=encoding) as f:
word_content = f.read()
ids.append(filename)
words.append(word_content)
labels.append("bad")
# Create a dataframe from the lists
df = pd.DataFrame({"Id": ids, "words": words, "label": labels})
2条答案
按热度按时间czq61nw11#
您可以尝试直接将编码设置为UTF-8
Python 3完全支持
x6h2sr282#
谢谢大家,我能够过滤掉(字符)编码错误的文本文件,并使用try-except块来捕获UnicodeDecodeError异常。