2023年3月4日 04:03:24go评论180阅读模式

英文:

Python str vs unicode on Windows, Python 2.7, why does 'á' become '\xa0'

问题

背景

我正在使用Windows计算机。我知道Python 2.不再受支持，但我仍然在学习Python 2.7.16。我还安装了Python 3.7.1。我知道在Python 3. 中"unicode已更名为str"。

我使用Git Bash作为我的主要Shell。

我阅读了这个问题。我感觉我理解了Unicode（代码点）和编码（不同的编码系统；字节）之间的区别。

问题

当我评估'á'时，我希望得到'\xc3\xa1' 就像这个答案中显示的。
当我评估len('á')时，我希望得到2，就像这个答案中显示的。

但我没有得到预期的结果。
当在Git Bash中运行C:\Python27\python.exe时...:

Python 2.7.16 (v2.7.16:413a49145e, Mar  4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32

&gt;&gt;&gt; &#39;&#225;&#39;
&#39;\xa0&#39;
# 预期结果为 '&#39;\xc3\xa1&#39;'

&gt;&gt;&gt; len(&#39;&#225;&#39;) 
1
# 预期结果为 2

# 为了参考再多一个:
&gt;&gt;&gt; &#39;&#224;&#39;
&#39'\x85&#39;
# 预期结果为 '&#39;\xc3\xa0&#39;'

你能帮我理解为什么我得到了上面显示的输出吗？

具体而言，为什么'á'变成了'\xa0'？

我尝试过的事情

我可以使用unicode对象来获得我期望的结果：

&gt;&gt;&gt; u&#39;&#225;&#39;.encode(&#39;utf-8&#39;)
&#39;\xc3\xa1&#39;
&gt;&gt;&gt; len(u&#39;&#225;&#39;.encode(&#39;utf-8&#39;))
2

我可以打开IDLE，我得到不同的结果--不是预期的结果，但至少我理解这些结果。

Python 2.7.16 (v2.7.16:413a49145e, Mar  4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
&gt;&gt;&gt; &#39;&#225;&#39;
&#39;\xe1&#39;
&gt;&gt;&gt; len(&#39;&#225;&#39;)
1
&gt;&gt;&gt; &#39;&#224;&#39;
&#39;\xe0&#39;

IDLE的结果令人意外，但我仍然理解这些结果； Martijn Peters解释了为什么'á'在Latin 1编码中变成了'\xe1'。

那么，为什么IDLE给出了与直接运行我的Git Bash Python 2.7.1可执行文件不同的结果？换句话说，如果IDLE使用Latin 1来编码我的输入，那么我的Git Bash Python 2.7.1可执行文件使用什么编码，以致'á'变成了'\xa0'？

我在想什么

是我的默认编码有问题吗？我太害怕更改默认编码。

&gt;&gt;&gt; import sys; sys.getdefaultencoding()
&#39;ascii&#39;

我感觉是我的终端的编码有问题吗？（我使用git bash）我应该尝试更改PYTHONIOENCODING环境变量吗？

我尝试检查git bash的locale，结果是：

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

此外，我正在使用交互式Python，我是否应该尝试使用文件呢？

# -*- coding: utf-8 -*- 设置源文件的编码，而不是输出编码。

我知道升级到Python 3是一个解决方案，但我仍然好奇为什么我的Python 2.7.16行为不同。

英文:

Background

I'm using a Windows machine. I know Python 2.* is not supported anymore, but I'm still learning Python 2.7.16. I also have Python 3.7.1. I know in Python 3.* "unicode was renamed to str"

I use Git Bash as my main shell.

I read this question. I feel like I understand the difference between Unicode (code points) and encodings (different encoding systems; bytes).

Question

When I evaluate 'á', I expect to get '\xc3\xa1' as shown in this answer
When I evaluate len('á'), I expect to get 2, as shown in this answer

But I don't get expected results.
When running git bash C:\Python27\python.exe...:

Python 2.7.16 (v2.7.16:413a49145e, Mar  4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32

&gt;&gt;&gt; &#39;&#225;&#39;
&#39;\xa0&#39;
#&#39;\xc3\xa1&#39; expected

&gt;&gt;&gt; len(&#39;&#225;&#39;) 
1
#2 expected

# one more for reference:
&gt;&gt;&gt; &#39;&#224;&#39;
&#39;\x85&#39;
#&#39;\xc3\xa0&#39; expected

Can you help me understand why I get the output shown above?

Specifically why does 'á' become '\xa0'?

What I tried

I can use unicode object to get the results I expect:

&gt;&gt;&gt; u&#39;&#225;&#39;.encode(&#39;utf-8&#39;)
&#39;\xc3\xa1&#39;
&gt;&gt;&gt; len(u&#39;&#225;&#39;.encode(&#39;utf-8&#39;))
2

I can open IDLE and I get different results -- not expected results, but at least I understand these results.

Python 2.7.16 (v2.7.16:413a49145e, Mar  4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
&gt;&gt;&gt; &#39;&#225;&#39;
&#39;\xe1&#39;
&gt;&gt;&gt; len(&#39;&#225;&#39;)
1
&gt;&gt;&gt; &#39;&#224;&#39;
&#39;\xe0&#39;

The IDLE results are unexpected but I still understand the results; Martijn Peters explains why 'á' become '\xe1' in the Latin 1 encoding.

So why does IDLE give different results from running my Git Bash Python 2.7.1 executable directly? In other words, if IDLE is using Latin 1 to encoding for my input, what encoding is used by my Git Bash Python 2.7.1. executable, such that 'á' becomes '\xa0'

What I'm wondering

Is my default encoding the problem? I'm too scared to change the default encoding.

&gt;&gt;&gt; import sys; sys.getdefaultencoding()
&#39;ascii&#39;

I feel like it's my terminal's encoding that's the problem? (I use git bash) Should I try to change the PYTHONIOENCODING environment variable?

I try to check the git bash locale, the result is:

LANG=en_US.UTF-8
LC_CTYPE=&quot;en_US.UTF-8&quot;
LC_NUMERIC=&quot;en_US.UTF-8&quot;
LC_TIME=&quot;en_US.UTF-8&quot;
LC_COLLATE=&quot;en_US.UTF-8&quot;
LC_MONETARY=&quot;en_US.UTF-8&quot;
LC_MESSAGES=&quot;en_US.UTF-8&quot;
LC_ALL=

Also I'm using interactive Python , should I try a file instead, using this?

# -*- coding: utf-8 -*- sets the source file&#39;s encoding, not the output encoding.

I know upgrading to Python 3 is a solution., but I'm still curious about why my Python 2.7.16 behaves differently.

答案1

得分: 0

以下是你要翻译的内容：

"Thanks @dan04, @MarkTolonen and @ (see the comments to the question above). As @MarkTolonen says:

> command prompt uses the default OEM code page (cp437 for US Windows ....)"

这似乎是来自 code page 437 的值，我尝试编码：

>>> 'á' #-> '\xa0' 在 code page 437 中预期的值
>>> 'à' #-> '\x85' 在 code page 437 中预期的值

我在下面的截图中突出显示了这些值。
$Python 2.7中，在Windows上，为什么’á’变成’\xa0’？$

我使用了 @MarkTolonen 的建议运行 chcp 命令来获取或设置我的 shell/终端使用的编码。chcp 是 "change code page" 的缩写。如果你使用的是 Git Bash，请使用 chcp.com。果然，当我运行 chcp 时，输出是 Active code page: 437：

$Python 2.7中，在Windows上，为什么’á’变成’\xa0’？$

然后，我尝试了 @juanpa.arrivillaga 的建议，使用一个文件。首先，我尝试了一个明确使用 437 编码的文件。

我添加了 “magic comment” 以指定编码 437：# -*- coding: cp437 -*-，但这还不足以编码文件。"Magic comment" 解释给 Python 如何解码文件。
我还必须更改文件的编码（告诉我的编辑器 VS Code 如何编码成 CP437）。

一旦我对一个 Python 文件做了这两件事（使用 CP437 编码和解码），我得到与我的初始问题相同的 "意外" 结果，这证实了 CP437 确实是我的 终端/Shell 使用的编码。

$Python 2.7中，在Windows上，为什么’á’变成’\xa0’？$

通常情况下，你必须同时编码和包含"decode magic comment"，并确保你的终端使用相同的编码！

如果我包含 CP437 "magic comment" 而不使用 CP437 编码（VS Code 的默认编码是 UTF-8），'á' 的长度为 2，就像 UTF-8 一样！（请注意，结果在我的 CP437 终端中打印出来，所以看起来很奇怪；我看到的字符是 ├，在 CP437 中是 \xc3！）
如果我使用 CP437 编码但不包含魔法注释，我会得到一个错误：(SyntaxError: Non-ASCII character '\xa0' in file 437_encoding.py on line 4)。

$Python 2.7中，在Windows上，为什么’á’变成’\xa0’？$

如果我使用 UTF-8 编码，并包含 UTF-8 "magic comment"，并将我的终端更改为使用 UTF-8（chcp.com 65001），那么我会得到我期望的结果！

$Python 2.7中，在Windows上，为什么’á’变成’\xa0’？$

最后，如果我尝试 @MarkTolonen 的建议来使用 sys.stdout.encoding，它会告诉我结果是 'cp437'！

请注意 sys.stdout.encoding（对我来说的值是 cp437）...
不同于 sys.getdefaultencoding()（对我来说的值是 ascii...）

$Python 2.7中，在Windows上，为什么’á’变成’\xa0’？$

如果我尝试在使用 chcp.com 将代码页更改为 UTF-8（值为 65001）时检查 sys.stdout.encoding，我会得到一个错误 LookupError: unknown encoding: cp65001，这在这里有更详细的描述。

[![Git Bash 终端的截图，使用 chcp.com 65001 更改终端编

英文:

Thanks @dan04, @MarkTolonen and @ (see the comments to the question above). As @MarkTolonen says:

> command prompt uses the default OEM code page (cp437 for US Windows ....)"

This seems clear from checking code page 437 for the values I'm trying to encode:

&gt;&gt;&gt; &#39;&#225;&#39; #-&gt; &#39;\xa0&#39; expected in code page 437
&gt;&gt;&gt; &#39;&#224;&#39; #-&gt; &#39;\x85&#39; expected in code page 437

I highlight those values in the screenshot below.
$Python 2.7中，在Windows上，为什么’á’变成’\xa0’？$

I used @MarkTolonen's suggestion of running the chcp command to get or set the encoding used by my shell/terminal. chcp is short for "change code page". If you're using Git Bash, use chcp.com instead. Sure enough, when I run chcp, the output is Active code page: 437:

$Python 2.7中，在Windows上，为什么’á’变成’\xa0’？$

Then I tried @juanpa.arrivillaga's suggestion of using a file. First I tried a file that explicitly used the 437 code page.

I added the "magic comment" to specify encoding 437: # -*- coding: cp437 -*-, but that's not enough to encode the file. The magic comment explains to Python how to decode the file.
I also had to change the encoding of the file (tell my editor, VS Code, how to encode in CP437).

Once I do both those things with a Python file (encode and decode with CP437), I get the same "unexpected" results as my OP, which confirms that CP437 is indeed the encoding used by my terminal/shell.

$Python 2.7中，在Windows上，为什么’á’变成’\xa0’？$

In general you must both encode and include the "decode magic comment", and make sure your shell uses the same encoding!

If I include the cp437 "magic comment" without encoding in CP437 (VS Code default encoding is UTF-8), the length of 'á' is 2; as in UTF-8! (Note the results are printed in my CP437 shell so they look strange; I see character ├ , which is \xc3 in CP437!)
If I encode in CP437 but I don't include the magic comment, I get an error: (SyntaxError: Non-ASCII character '\xa0' in file 437_encoding.py on line 4)

$Python 2.7中，在Windows上，为什么’á’变成’\xa0’？$

If I encode in utf-8, and I include the "magic comment" for utf-8, and I change my shell to use utf-8 (chcp.com 65001), then I get the results I expect!

$Python 2.7中，在Windows上，为什么’á’变成’\xa0’？$

Finally, if I try @MarkTolonen's suggestion to use sys.stdout.encoding, it will tell me the results 'cp437'!

Please note sys.stdout.encoding (which for me had the value cp437)...
is not the same as sys.getdefaultencoding() (which for me had the value ascii...

$Python 2.7中，在Windows上，为什么’á’变成’\xa0’？$

And if I try to check sys.stdout.encoding when I used chcp.com to change the code page to UTF-8 (value 65001), I get an error LookupError: unknown encoding: cp65001 which is described in more detail here

$Python 2.7中，在Windows上，为什么’á’变成’\xa0’？$

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Python 2.7中，在Windows上，为什么’á’变成’\xa0’？

问题

答案1

计算百分比列表以获得最终百分比。

Python 父类数据访问继承

运行一个带有额外参数的Python程序。

如何将数据框转换为数据集/对象

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论