2023年2月19日 07:34:43go评论110阅读模式

英文:

LP3THW ex17 Why is there garbled text at the end of a text file when copying files with python in powershell?

问题

1: 这是一个Python问题。

2: 你可以通过在打开文件时指定编码来解决这个问题，例如使用open(from_file, encoding='utf-8')。

3: 这个修复方法解决了问题，因为它明确指定了文件的编码格式为UTF-8，以便正确读取和写入文件的内容。

4: 问题的根本原因是文件的编码格式。默认情况下，Python 3使用UTF-8编码，但在你的情况下，文件似乎包含了一些非UTF-8字符，导致读取和写入时出现问题。通过明确指定编码格式，你可以解决这个问题。

英文:

I am following Learn Python 3 the hard way, and I am on example 17.

I typed the code in the book exactly (comments included) and then ran the program in Powershell.

The file size of the text file is 46 bits.

This is where my output differs from the book. (besides the weird gunk) the books output says the file is 21 bits long.

I created the file with this command (also from the book.)

echo "This is a test file." > test.txt

This is a direct copy paste.

Contents of test.txt (is 2 lines):

This is a text file.

contents of text1.txt (is 2 lines):

This is a text file.਍ഀ

So including the returns there's a bit extra gunk at the end of the first line of the copied file.

Here's the code I used.

from sys import argv
from os.path import exists

script, from_file, to_file = argv

print(f&quot;Copying from {from_file} to {to_file}&quot;)

# we could do these two on one line, how?
in_file = open(from_file)
indata = in_file.read()

print(f&quot;The input file is {len(indata)} bytes long&quot;)

print(f&quot;Does the output file exist? {exists(to_file)}&quot;)
print(&quot;Ready, hit RETURN to continue, CTRL-C to abort.&quot;)
input()

out_file = open(to_file, &#39;w&#39;)
out_file.write(indata)

print(&quot;Alright, all done.&quot;)

out_file.close()
in_file.close()

And Here's the PowerShell commands and result.

PS D:\Pythonlearn\lpthw&gt; python ex17cp.py test.txt test1.txt
Copying from test.txt to test1.txt
The input file is 46 bytes long
Does the output file exist? True
Ready, hit RETURN to continue, CTRL-C to abort.

Alright, all done.
PS D:\Pythonlearn\lpthw&gt; cat test1.txt
This is a text file.਍ഀ
PS D:\Pythonlearn\lpthw&gt; cat test.txt
This is a text file.

The book assumes Python3.6. I'm using 3.9.13 in hopes I'll be able to resolve any problems I come across. However I cant find anything I understand about this problem online. I can't even recognize if what I'm looking at is related to this problem. No matter what keywords I use.

I'd like four answers please.

1: Is this a Python or a PowerShell problem?

2: How can I fix the code so it doesn&#39;t do this.

3: Why does that fix the problem.

4: What caused the problem in the first place?

答案1

得分: 0

在Windows PowerShell中，

"This is a test file." > test.txt

会生成一个输出文件，使用"Unicode" (UTF-16LE)编码，因为>运算符实际上是Out-File cmdlet的别名。 (请注意，更明智的是，PowerShell (Core) 现在默认使用无BOM的UTF-8编码，对所有 cmdlets 生效。)

很少有应用程序和非PowerShell API默认识别此编码，python也不例外：它默认使用"ANSI"编码，即系统的活动ANSI遗留代码页指定的编码（这本身是与控制台应用程序的预期行为相背离的，即使用系统的活动_OEM_遗留代码页）。

因此，python 误解文件test.txt的内容，并将 每个字节 视为自己的字符（而在UTF-16LE中，一个字符由至少两个字节编码）。

虽然Python 大多数情况下 保持输入字节不变，因此在写入时也将它们通过，但它应用了 特殊的换行处理，这是问题的根本原因，实际上导致了损坏的输出文件：

在遇到python认为是 独立的 CR或LF字符时，它将其转换为Windows适当的CRLF换行序列。
由于对UTF-16LE编码文件的误解导致的NUL字节使它无法将输入文件中的00 0D 00 0A字节序列识别为CRLF字符序列，因此导致它将其转换为字节序列00 0D 0A 00 0D 0A（0D和0A 各自被翻译为 ANSI 0D 0A字节序列），这就是文件损坏的原因：
- 当PowerShell - 基于输入文件的 BOM (Unicode签名) - 尝试解释换行的“修复”导致的重新编写文件中的意外UTF-16LE字节序列时，最终会转换为任意的Unicode字符。

解决方案：

要么：在PowerShell中使用"ANSI"编码创建test.txt：
- 在Windows PowerShell中，只需使用 Set-Content，它默认使用该编码：
```
&quot;This is a test file.&quot; | Set-Content test.txt
```
- 在*PowerShell (Core) 7+*中，解决方案更加复杂，不幸的是：
```
&quot;This is a test file.&quot; |
  Set-Content -Encoding ([cultureinfo]::CurrentUICulture.TextInfo.AnsiCodePage) test.txt
```
  - 这种繁琐的方式请求"ANSI"编码本不应该是必要的，这是 GitHub issue #6562 的主题。
要么：让Python明确使用UTF-16LE编码：
- 在 open() 调用中添加参数 encoding='utf-16le'。

英文:

In Windows PowerShell,

"This is a test file." > test.txt

produces an output file that uses "Unicode" (UTF-16LE) encoding, because the > operator is, in effect, an alias of the Out-File cmdlet. (Note that - more sensibly - PowerShell (Core) now defaults to BOM-less UTF-8 encoding, across all cmdlets.)

Few applications and non-PowerShell APIs recognize this encoding by default, and python is no exception: it expects "ANSI" encoding by default, i.e. the encoding specified by the system's active ANSI legacy code page (which is in itself a deviation from what console applications are expected to do, namely to use the system's active OEM legacy code page).

Therefore, python misinterprets the content of file test.txt and treats each byte as its own character (whereas in UTF-16LE a single character is encoded by (at least) two bytes).

While Python mostly preserves the input bytes as-is and therefore also passes them through on writing, it applies special newline handling, which is the root of the problem and effectively results in a corrupted output file:

On encountering what python thinks is a stand-alone CR or LF character, it translates it to a Windows-appropriate CRLF newline sequence.
The NUL bytes that result from the misinterpretation of the UTF-16LE-encoded file cause it to not recognize the 00 0D 00 0A byte sequence from the input file as a CRLF character sequence, and therefore causes it to transform it to byte sequence 00 0D 0A 00 0D 0A (0D and 0A each where translated to ANSI 0D 0A byte sequences), which is what caused the file corruption:
- When PowerShell - which is UTF-16LE-aware based on an input file's BOM (Unicode signature) - tries to interpret the resulting file, the accidental UTF-16LE byte sequences in the rewritten file resulting from the newline "fix" turned into arbitrary Unicode characters.

Solutions:

Either: Create test.txt on the PowerShell side with "ANSI" encoding:
- In Windows PowerShell, simply use Set-Content, which uses that encoding by default:
```
&quot;This is a test file.&quot; | Set-Content test.txt
```
- In PowerShell (Core) 7+, the solution is more complicated, unfortunately:
```
&quot;This is a test file.&quot; |
  Set-Content -Encoding ([cultureinfo]::CurrentUICulture.TextInfo.AnsiCodePage) test.txt
```
  - This convoluted way of requesting "ANSI" encoding shouldn't be necessary, which is the subject of GitHub issue #6562.
Or: Make Python explicitly use UTF-16LE encoding:
- Add argument encoding='utf-16le' to the open() calls.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

LP3THW ex17 Why is there garbled text at the end of a text file when copying files with python in powershell?

问题

答案1

提取JSON值并使用它

如何在使用ebooklib时将HTML文件插入章节内容中？

如何优化我的基于OpenAI的聊天机器人的Python自然语言处理处理时间？

Django管理界面显示相关字段的计数

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论