LP3THW ex17 Why is there garbled text at the end of a text file when copying files with python in powershell?

huangapple go评论89阅读模式
英文:

LP3THW ex17 Why is there garbled text at the end of a text file when copying files with python in powershell?

问题

1: 这是一个Python问题。

2: 你可以通过在打开文件时指定编码来解决这个问题,例如使用open(from_file, encoding='utf-8')

3: 这个修复方法解决了问题,因为它明确指定了文件的编码格式为UTF-8,以便正确读取和写入文件的内容。

4: 问题的根本原因是文件的编码格式。默认情况下,Python 3使用UTF-8编码,但在你的情况下,文件似乎包含了一些非UTF-8字符,导致读取和写入时出现问题。通过明确指定编码格式,你可以解决这个问题。

英文:

I am following Learn Python 3 the hard way, and I am on example 17.

I typed the code in the book exactly (comments included) and then ran the program in Powershell.

The file size of the text file is 46 bits.

This is where my output differs from the book. (besides the weird gunk) the books output says the file is 21 bits long.

I created the file with this command (also from the book.)

echo "This is a test file." > test.txt

This is a direct copy paste.

Contents of test.txt (is 2 lines):

This is a text file.

contents of text1.txt (is 2 lines):

This is a text file.਍ഀ

So including the returns there's a bit extra gunk at the end of the first line of the copied file.

Here's the code I used.

from sys import argv
from os.path import exists

script, from_file, to_file = argv

print(f"Copying from {from_file} to {to_file}")

# we could do these two on one line, how?
in_file = open(from_file)
indata = in_file.read()

print(f"The input file is {len(indata)} bytes long")

print(f"Does the output file exist? {exists(to_file)}")
print("Ready, hit RETURN to continue, CTRL-C to abort.")
input()

out_file = open(to_file, 'w')
out_file.write(indata)

print("Alright, all done.")

out_file.close()
in_file.close()

And Here's the PowerShell commands and result.

PS D:\Pythonlearn\lpthw> python ex17cp.py test.txt test1.txt
Copying from test.txt to test1.txt
The input file is 46 bytes long
Does the output file exist? True
Ready, hit RETURN to continue, CTRL-C to abort.

Alright, all done.
PS D:\Pythonlearn\lpthw> cat test1.txt
This is a text file.਍ഀ
PS D:\Pythonlearn\lpthw> cat test.txt
This is a text file.

The book assumes Python3.6. I'm using 3.9.13 in hopes I'll be able to resolve any problems I come across. However I cant find anything I understand about this problem online. I can't even recognize if what I'm looking at is related to this problem. No matter what keywords I use.

I'd like four answers please.

1: Is this a Python or a PowerShell problem?

2: How can I fix the code so it doesn't do this.

3: Why does that fix the problem.

4: What caused the problem in the first place?

答案1

得分: 0

Windows PowerShell中,

"This is a test file." > test.txt

会生成一个输出文件,使用"Unicode" (UTF-16LE)编码,因为>运算符实际上是Out-File cmdlet的别名。 (请注意,更明智的是,PowerShell (Core) 现在默认使用无BOM的UTF-8编码,对 所有 cmdlets 生效。)

很少有应用程序和非PowerShell API默认识别此编码,python也不例外:它默认使用"ANSI"编码,即系统的活动ANSI遗留代码页指定的编码(这本身是与控制台应用程序的预期行为相背离的,即使用系统的活动_OEM_遗留代码页)。

因此,python 误解 文件test.txt的内容,并将 每个字节 视为自己的字符(而在UTF-16LE中,一个字符由至少 两个 字节编码)。

虽然Python 大多数情况下 保持输入字节不变,因此在 写入 时也将它们通过,但它应用了 特殊的换行处理,这是问题的根本原因,实际上导致了 损坏 的输出文件:

  • 在遇到python认为是 独立的 CR或LF字符时,它将其转换为Windows适当的CRLF换行 序列

  • 由于对UTF-16LE编码文件的误解导致的NUL字节使它无法将输入文件中的00 0D 00 0A字节序列识别为CRLF字符序列,因此导致它将其转换为字节序列00 0D 0A 00 0D 0A0D0A 各自 被翻译为 ANSI 0D 0A字节序列),这就是文件损坏的原因:

    • 当PowerShell - 基于输入文件的 BOM (Unicode签名) - 尝试解释换行的“修复”导致的重新编写文件中的意外UTF-16LE字节序列时,最终会转换为任意的Unicode字符。

解决方案

  • 要么:在PowerShell中使用"ANSI"编码创建test.txt

    • Windows PowerShell中,只需使用 Set-Content,它默认使用该编码:

      "This is a test file." | Set-Content test.txt
      
    • 在*PowerShell (Core) 7+*中,解决方案更加复杂,不幸的是:

      "This is a test file." |
        Set-Content -Encoding ([cultureinfo]::CurrentUICulture.TextInfo.AnsiCodePage) test.txt
      
      • 这种繁琐的方式请求"ANSI"编码本不应该是必要的,这是 GitHub issue #6562 的主题。
  • 要么:让Python明确使用UTF-16LE编码:

    • open() 调用中添加参数 encoding='utf-16le'
英文:

<!-- language-all: sh -->

In Windows PowerShell,

&quot;This is a test file.&quot; &gt; test.txt

produces an output file that uses "Unicode" (UTF-16LE) encoding, because the &gt; operator is, in effect, an alias of the Out-File cmdlet. (Note that - more sensibly - PowerShell (Core) now defaults to BOM-less UTF-8 encoding, across all cmdlets.)

Few applications and non-PowerShell APIs recognize this encoding by default, and python is no exception: it expects "ANSI" encoding by default, i.e. the encoding specified by the system's active ANSI legacy code page (which is in itself a deviation from what console applications are expected to do, namely to use the system's active OEM legacy code page).

Therefore, python misinterprets the content of file test.txt and treats each byte as its own character (whereas in UTF-16LE a single character is encoded by (at least) two bytes).

While Python mostly preserves the input bytes as-is and therefore also passes them through on writing, it applies special newline handling, which is the root of the problem and effectively results in a corrupted output file:

  • On encountering what python thinks is a stand-alone CR or LF character, it translates it to a Windows-appropriate CRLF newline sequence.

  • The NUL bytes that result from the misinterpretation of the UTF-16LE-encoded file cause it to not recognize the 00 0D 00 0A byte sequence from the input file as a CRLF character sequence, and therefore causes it to transform it to byte sequence 00 0D 0A 00 0D 0A (0D and 0A each where translated to ANSI 0D 0A byte sequences), which is what caused the file corruption:

    • When PowerShell - which is UTF-16LE-aware based on an input file's BOM (Unicode signature) - tries to interpret the resulting file, the accidental UTF-16LE byte sequences in the rewritten file resulting from the newline "fix" turned into arbitrary Unicode characters.

Solutions:

  • Either: Create test.txt on the PowerShell side with "ANSI" encoding:

    • In Windows PowerShell, simply use Set-Content, which uses that encoding by default:

      &quot;This is a test file.&quot; | Set-Content test.txt
      
    • In PowerShell (Core) 7+, the solution is more complicated, unfortunately:

      &quot;This is a test file.&quot; |
        Set-Content -Encoding ([cultureinfo]::CurrentUICulture.TextInfo.AnsiCodePage) test.txt
      
      • This convoluted way of requesting "ANSI" encoding shouldn't be necessary, which is the subject of GitHub issue #6562.
  • Or: Make Python explicitly use UTF-16LE encoding:

    • Add argument encoding=&#39;utf-16le&#39; to the open() calls.

huangapple
  • 本文由 发表于 2023年2月19日 07:34:43
  • 转载请务必保留本文链接:https://go.coder-hub.com/75497063.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定