LP3THW ex17为什么在powershell中使用python复制文件时，文本文件末尾会出现乱码？

qqrboqgw 于 2023-03-08 发布在 Shell

关注(0)|答案(1)|浏览(159)

我正在艰难地学习Learn Python 3，并且我正在学习示例17。
我准确地键入了书中的代码（包括注解），然后在Powershell中运行程序。
文本文件的文件大小为46位。
这就是我的输出与书不同的地方。（除了奇怪的粘性）书的输出说文件是21位长。
我用这个命令创建了这个文件（也来自书中）。
echo“这是一个测试文件。”〉test.txt
这是直接复制粘贴。
test.txt的内容（2行）：
这是一个文本文件。
文本1.txt的内容（为2行）：
这是一个文本文件。
因此，包括返回，在复制文件的第一行末尾有一点额外的粘性。
这是我用的密码。

from sys import argv
from os.path import exists

script, from_file, to_file = argv

print(f"Copying from {from_file} to {to_file}")

# we could do these two on one line, how?
in_file = open(from_file)
indata = in_file.read()

print(f"The input file is {len(indata)} bytes long")

print(f"Does the output file exist? {exists(to_file)}")
print("Ready, hit RETURN to continue, CTRL-C to abort.")
input()

out_file = open(to_file, 'w')
out_file.write(indata)

print("Alright, all done.")

out_file.close()
in_file.close()

这是PowerShell命令和结果。

PS D:\Pythonlearn\lpthw> python ex17cp.py test.txt test1.txt
Copying from test.txt to test1.txt
The input file is 46 bytes long
Does the output file exist? True
Ready, hit RETURN to continue, CTRL-C to abort.

Alright, all done.
PS D:\Pythonlearn\lpthw> cat test1.txt
This is a text file.਍ഀ
PS D:\Pythonlearn\lpthw> cat test.txt
This is a text file.

这本书假设Python 3.6。我使用3.9.13希望我能解决我遇到的任何问题。但是我在网上找不到我理解的关于这个问题的任何东西。我甚至不能识别我正在看的是否与这个问题有关。无论我使用什么关键词。
我想要四个答案。

1: Is this a Python or a PowerShell problem?

2: How can I fix the code so it doesn't do this.

3: Why does that fix the problem.

4: What caused the problem in the first place?

powershell

来源：https://stackoverflow.com/questions/75497063/lp3thw-ex17-why-is-there-garbled-text-at-the-end-of-a-text-file-when-copying-fil

1条答案

按热度按时间

jq6vz3qz1#

在 Windows PowerShell 中，
"This is a test file." > test.txt
生成使用“Unicode”（UTF-16 LE）编码的输出文件，因为>运算符实际上是Out-File cmdlet的别名。（请注意，更明显的是，PowerShell (Core)现在在 all cmdlet中默认为无BOM的UTF-8编码。）
默认情况下，很少有应用程序和非PowerShell API能够识别这种编码，python也不例外：默认情况下，它期望“ANSI”编码，即由系统的活动ANSI传统代码页指定的编码（其本身与期望控制台应用程序做的事情，即使用系统的活动 OEM 传统代码页，有偏差）。
因此，python * 误解了 * 文件test.txt的内容，并将 * 每个字节 * 视为自己的字符（而在UTF-16 LE中，单个字符由（至少）* 两个 * 字节编码）。
虽然Python * 大多数情况下 * 会保留输入字节的原样，因此也会在 * write * 时传递它们，但它会应用 * 特殊的换行符处理 *，这是问题的根源，实际上会导致一个 corrupted 的输出文件：

遇到python认为是 stand-alone CR或LF字符时，它将其转换为适合Windows的CRLF换行符 * 序列 *。
由UTF-16 LE编码文件的错误解释产生的NUL字节导致它 * 不 * 将来自输入文件的00 0D 00 0A字节序列识别为CRLF字符序列，因此导致它将其转换为字节序列00 0D 0A 00 0D 0A（0D和0A * 每个 * 转换为 ANSI0D 0A字节序列），这是导致文件损坏的原因：
当PowerShell --它基于输入文件的BOM（Unicode签名）* 支持 * UTF-16 LE--试图解释结果文件时，由换行符“fix”导致的重写文件中的 * 意外 * UTF-16 LE字节序列变成了任意Unicode字符。
解决方案：
或者：在PowerShell端使用“ANSI”编码创建test.txt：
在 Windows PowerShell 中，只需使用Set-Content，它 * 默认情况下 * 使用该编码：

"This is a test file." | Set-Content test.txt

遗憾的是，在 PowerShell（Core）7+ 中，解决方案更为复杂：

"This is a test file." |
  Set-Content -Encoding ([cultureinfo]::CurrentUICulture.TextInfo.AnsiCodePage) test.txt

这种请求“ANSI”编码的复杂方式应该是不必要的，这是GitHub issue #6562的主题。
或者：让Python显式使用UTF-16 LE编码：
将参数encoding='utf-16le'添加到open()调用中。

赞(0）回复(0）举报 2023-03-08

我来回答

LP3THW ex17为什么在powershell中使用python复制文件时，文本文件末尾会出现乱码？

1条答案

相关问题

热门标签

最新问答