英文:
Python str vs unicode on Windows, Python 2.7, why does 'á' become '\xa0'
问题
背景
我正在使用Windows计算机。我知道Python 2.不再受支持,但我仍然在学习Python 2.7.16。我还安装了Python 3.7.1。我知道在Python 3. 中"unicode
已更名为str
"。
我使用Git Bash作为我的主要Shell。
我阅读了这个问题。我感觉我理解了Unicode(代码点)和编码(不同的编码系统;字节)之间的区别。
问题
- 当我评估
'á'
时,我希望得到'\xc3\xa1'
就像这个答案中显示的。 - 当我评估
len('á')
时,我希望得到2
,就像这个答案中显示的。
但我没有得到预期的结果。
当在Git Bash中运行C:\Python27\python.exe时...:
Python 2.7.16 (v2.7.16:413a49145e, Mar 4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
>>> 'á'
'\xa0'
# 预期结果为 ''\xc3\xa1''
>>> len('á')
1
# 预期结果为 2
# 为了参考再多一个:
>>> 'à'
''\x85'
# 预期结果为 ''\xc3\xa0''
你能帮我理解为什么我得到了上面显示的输出吗?
具体而言,为什么'á'
变成了'\xa0'
?
我尝试过的事情
我可以使用unicode
对象来获得我期望的结果:
>>> u'á'.encode('utf-8')
'\xc3\xa1'
>>> len(u'á'.encode('utf-8'))
2
我可以打开IDLE,我得到不同的结果--不是预期的结果,但至少我理解这些结果。
Python 2.7.16 (v2.7.16:413a49145e, Mar 4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
>>> 'á'
'\xe1'
>>> len('á')
1
>>> 'à'
'\xe0'
IDLE的结果令人意外,但我仍然理解这些结果; Martijn Peters解释了为什么'á'
在Latin 1编码中变成了'\xe1'
。
那么,为什么IDLE给出了与直接运行我的Git Bash Python 2.7.1可执行文件不同的结果?换句话说,如果IDLE使用Latin 1来编码我的输入,那么我的Git Bash Python 2.7.1可执行文件使用什么编码,以致'á'
变成了'\xa0'
?
我在想什么
是我的默认编码有问题吗?我太害怕更改默认编码。
>>> import sys; sys.getdefaultencoding()
'ascii'
我感觉是我的终端的编码有问题吗?(我使用git bash)我应该尝试更改PYTHONIOENCODING
环境变量吗?
我尝试检查git bash的locale
,结果是:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
此外,我正在使用交互式Python,我是否应该尝试使用文件呢?
# -*- coding: utf-8 -*- 设置源文件的编码,而不是输出编码。
我知道升级到Python 3是一个解决方案,但我仍然好奇为什么我的Python 2.7.16行为不同。
英文:
Background
I'm using a Windows machine. I know Python 2.* is not supported anymore, but I'm still learning Python 2.7.16. I also have Python 3.7.1. I know in Python 3.* "unicode
was renamed to str
"
I use Git Bash as my main shell.
I read this question. I feel like I understand the difference between Unicode (code points) and encodings (different encoding systems; bytes).
Question
- When I evaluate
'á'
, I expect to get'\xc3\xa1'
as shown in this answer - When I evaluate
len('á')
, I expect to get2
, as shown in this answer
But I don't get expected results.
When running git bash C:\Python27\python.exe...:
Python 2.7.16 (v2.7.16:413a49145e, Mar 4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
>>> 'á'
'\xa0'
#'\xc3\xa1' expected
>>> len('á')
1
#2 expected
# one more for reference:
>>> 'à'
'\x85'
#'\xc3\xa0' expected
Can you help me understand why I get the output shown above?
Specifically why does 'á'
become '\xa0'
?
What I tried
I can use unicode
object to get the results I expect:
>>> u'á'.encode('utf-8')
'\xc3\xa1'
>>> len(u'á'.encode('utf-8'))
2
I can open IDLE and I get different results -- not expected results, but at least I understand these results.
Python 2.7.16 (v2.7.16:413a49145e, Mar 4 2019, 01:37:19) [MSC v.1500 64 bit (AMD64)] on win32
>>> 'á'
'\xe1'
>>> len('á')
1
>>> 'à'
'\xe0'
The IDLE results are unexpected but I still understand the results; Martijn Peters explains why 'á'
become '\xe1'
in the Latin 1 encoding.
So why does IDLE give different results from running my Git Bash Python 2.7.1 executable directly? In other words, if IDLE is using Latin 1 to encoding for my input, what encoding is used by my Git Bash Python 2.7.1. executable, such that 'á'
becomes '\xa0'
What I'm wondering
Is my default encoding the problem? I'm too scared to change the default encoding.
>>> import sys; sys.getdefaultencoding()
'ascii'
I feel like it's my terminal's encoding that's the problem? (I use git bash) Should I try to change the PYTHONIOENCODING
environment variable?
I try to check the git bash locale
, the result is:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
Also I'm using interactive Python , should I try a file instead, using this?
# -*- coding: utf-8 -*- sets the source file's encoding, not the output encoding.
I know upgrading to Python 3 is a solution., but I'm still curious about why my Python 2.7.16 behaves differently.
答案1
得分: 0
以下是你要翻译的内容:
"Thanks @dan04, @MarkTolonen and @ (see the comments to the question above). As @MarkTolonen says:
> command prompt uses the default OEM code page (cp437 for US Windows ....)"
这似乎是来自 code page 437 的值,我尝试编码:
>>> 'á' #-> '\xa0' 在 code page 437 中预期的值
>>> 'à' #-> '\x85' 在 code page 437 中预期的值
我使用了 @MarkTolonen 的建议运行 chcp 命令 来获取或设置我的 shell/终端使用的编码。chcp
是 "change code page" 的缩写。如果你使用的是 Git Bash,请使用 chcp.com
。果然,当我运行 chcp
时,输出是 Active code page: 437
:
然后,我尝试了 @juanpa.arrivillaga 的建议,使用一个文件。首先,我尝试了一个明确使用 437 编码的文件。
- 我添加了 “magic comment” 以指定编码 437:
# -*- coding: cp437 -*-
,但这还不足以编码文件。"Magic comment" 解释给 Python 如何 解码 文件。 - 我还必须更改文件的编码(告诉我的编辑器 VS Code 如何 编码 成 CP437)。
一旦我对一个 Python 文件 做了这两件事(使用 CP437 编码和解码),我得到与我的初始问题相同的 "意外" 结果,这证实了 CP437 确实是我的 终端/Shell 使用的编码。
通常情况下,你必须同时编码和包含"decode magic comment",并确保你的终端使用相同的编码!
- 如果我包含 CP437 "magic comment" 而不使用 CP437 编码(VS Code 的默认编码是 UTF-8),
'á'
的长度为 2,就像 UTF-8 一样!(请注意,结果在我的 CP437 终端中打印出来,所以看起来很奇怪;我看到的字符是├
,在 CP437 中是\xc3
!) - 如果我使用 CP437 编码但不包含魔法注释,我会得到一个错误:
(SyntaxError: Non-ASCII character '\xa0' in file 437_encoding.py on line 4)
。
如果我使用 UTF-8 编码,并包含 UTF-8 "magic comment",并将我的终端更改为使用 UTF-8(chcp.com 65001
),那么我会得到我期望的结果!
最后,如果我尝试 @MarkTolonen 的建议来使用 sys.stdout.encoding
,它会告诉我结果是 'cp437'!
- 请注意
sys.stdout.encoding
(对我来说的值是cp437
)... - 不同于
sys.getdefaultencoding()
(对我来说的值是ascii
...)
如果我尝试在使用 chcp.com
将代码页更改为 UTF-8(值为 65001)时检查 sys.stdout.encoding
,我会得到一个错误 LookupError: unknown encoding: cp65001
,这在这里有更详细的描述。
[![Git Bash 终端的截图,使用 chcp.com 65001
更改终端编
英文:
Thanks @dan04, @MarkTolonen and @ (see the comments to the question above). As @MarkTolonen says:
> command prompt uses the default OEM code page (cp437 for US Windows ....)"
This seems clear from checking code page 437 for the values I'm trying to encode:
>>> 'á' #-> '\xa0' expected in code page 437
>>> 'à' #-> '\x85' expected in code page 437
I highlight those values in the screenshot below.
I used @MarkTolonen's suggestion of running the chcp
command to get or set the encoding used by my shell/terminal. chcp
is short for "change code page". If you're using Git Bash, use chcp.com
instead. Sure enough, when I run chcp
, the output is Active code page: 437
:
Then I tried @juanpa.arrivillaga's suggestion of using a file. First I tried a file that explicitly used the 437 code page.
- I added the "magic comment" to specify encoding 437:
# -*- coding: cp437 -*-
, but that's not enough to encode the file. The magic comment explains to Python how to decode the file. - I also had to change the encoding of the file (tell my editor, VS Code, how to encode in CP437).
Once I do both those things with a Python file (encode and decode with CP437), I get the same "unexpected" results as my OP, which confirms that CP437 is indeed the encoding used by my terminal/shell.
In general you must both encode and include the "decode magic comment", and make sure your shell uses the same encoding!
- If I include the cp437 "magic comment" without encoding in CP437 (VS Code default encoding is UTF-8), the length of
'á'
is 2; as in UTF-8! (Note the results are printed in my CP437 shell so they look strange; I see character├
, which is\xc3
in CP437!) - If I encode in CP437 but I don't include the magic comment, I get an error:
(SyntaxError: Non-ASCII character '\xa0' in file 437_encoding.py on line 4)
If I encode in utf-8, and I include the "magic comment" for utf-8, and I change my shell to use utf-8 (chcp.com 65001
), then I get the results I expect!
Finally, if I try @MarkTolonen's suggestion to use sys.stdout.encoding
, it will tell me the results 'cp437'!
- Please note
sys.stdout.encoding
(which for me had the valuecp437
)... - is not the same as
sys.getdefaultencoding()
(which for me had the valueascii
...
And if I try to check sys.stdout.encoding
when I used chcp.com
to change the code page to UTF-8 (value 65001), I get an error LookupError: unknown encoding: cp65001
which is described in more detail here
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论