英文:
Python write bytes to file using redirect of print
问题
using perl,
$ perl -e 'print "\xca"' > out
now $ xxd out
we have
00000000: ca
But with Python, I tried
$ python3 -c 'print("\xca", end="")' > out
$ xxd out
what I got is
00000000: c38a
I'm not sure what is going on.
英文:
using perl,
$ perl -e 'print "\xca"' > out
now $ xxd out
we have
00000000: ca
But with Python, I tried
$ python3 -c 'print("\xca", end="")' > out
$ xxd out
what I got is
00000000: c38a
I'm not sure what is going on.
答案1
得分: 3
在Python中,一个str
对象是一系列Unicode码点。它在显示在屏幕上时取决于您的sys.stdout
的编码方式。这是基于您的区域设置(或可能会受到各种环境变量的影响,但默认情况下是您的区域设置)选择的。因此,您的区域设置必须设置为UTF-8。这也是我的默认设置:
(py311) Juans-MBP:~ juan$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
(py311) Juans-MBP:~ juan$ python -c "print('\xca', end='')" | xxd
00000000: c38a
然而,如果我覆盖我的区域设置并告诉它使用en_US.ISO8859-1
(latin-1),一个单字节的编码,我们会得到您期望的结果:
(py311) Juans-MBP:~ juan$ LC_ALL="en_US.ISO8859-1" python -c "print('\xca', end='')" | xxd
00000000: ca
解决方案是如果您想要原始字节,请使用原始字节。在Python源代码中执行此操作的方法是使用字节文字(或字符串文字,然后使用.encode
方法)。我们可以使用sys.stdout.buffer
中的原始缓冲区:
(py311) Juans-MBP:~ juan$ python -c "import sys; sys.stdout.buffer.write(b'\xca')" | xxd
00000000: ca
或者通过将字符串编码为字节对象:
(py311) Juans-MBP:~ juan$ python -c "import sys; sys.stdout.buffer.write('\xca'.encode('latin'))" | xxd
00000000: ca
英文:
So in Python, a str
object is a series of unicode code points. How this is printed to the screen depends on the encoding of your sys.stdout
. This is picked based on your locale (or possibly various environment variables can affect this, but by default, it is your locale). So yours must be set to UTF-8. That's my default too:
(py311) Juans-MBP:~ juan$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
(py311) Juans-MBP:~ juan$ python -c "print('\xca', end='')" | xxd
00000000: c38a
However, if I override my locale and tell it to use en_US.ISO8859-1
(latin-1), a single-byte encoding, we get what you expect:
(py311) Juans-MBP:~ juan$ LC_ALL="en_US.ISO8859-1" python -c "print('\xca', end='')" | xxd
00000000: ca
The solution is to work with raw bytes if you want raw bytes. The way to do that in Python source code is to use a bytes literal (or a string literal and then .encode
it). We can use the raw buffer at sys.stdout.buffer
:
(py311) Juans-MBP:~ juan$ python -c "import sys; sys.stdout.buffer.write(b'\xca')" | xxd
00000000: ca
Or by encoding a string to a bytes object:
(py311) Juans-MBP:~ juan$ python -c "import sys; sys.stdout.buffer.write('\xca'.encode('latin'))" | xxd
00000000: ca
答案2
得分: 1
> 在Python中,\xca 被解释为UTF-8编码中的两个字节字符串,这就是为什么当一个值被写入文件时,它会自动以c3 8a
的形式存储两个字节到文件中。
>
> 但是在Perl中,\xca 被解释为单字节,其十六进制值为0xca,因此当这个值被存储到文件中时,它将不会被编码。
你可以查看更多细节
英文:
> In python \xca is interpreted as a two-byte string in the UTF-8
> encoding and that's why when a value is written inside a file it
> automatically stored two bytes in the file as c3 8a
>
> But in perl \xca is interpreted as a single byte with the hexadecimal
> value 0xca and for that when the value is stored inside the file it will save
> without encoding.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论