`.decode(“utf-16”)` 用ASCII编码的字符串有时崩溃的原因是什么?

huangapple go评论78阅读模式
英文:

Why `.decode("utf-16")` with ASCII encoded string sometime crash?

问题

code = b"""print((lambda Ru,Ro,Iu,Io,IM,Sx,Sy:reduce(lambda x,y:x+'\n'+y,map(lambda y,
Iu=Iu,Io=Io,Ru=Ru,Ro=Ro,Sy=Sy,L=lambda yc,Iu=Iu,Io=Io,Ru=Ru,Ro=Ro,i=IM,
Sx=Sx,Sy=Sy:reduce(lambda x,y:x+y,map(lambda x,xc=Ru,yc=yc,Ru=Ru,Ro=Ro,
i=i,Sx=Sx,F=lambda xc,yc,x,y,k,f=lambda xc,yc,x,y,k,f:(k<=0)or (x*x+y*y
>=4.0) or 1+f(xc,yc,x*x-y*y+xc,2.0*x*y+yc,k-1,f):f(xc,yc,x,y,k,f):chr(
64+F(Ru+x*(Ro-Ru)/Sx,yc,0,0,i)),range(Sx))):L(Iu+y*(Io-Iu)/Sy),range(Sy
))))(-2.1, 0.7, -1.2, 1.2, 30, 80, 24))"""

shorter_code = code.decode("u16")  # 在这里崩溃
code_back = shorter_code.encode("u16")[2:]
英文:

I wanted to show how we can reduce the number of character required to code a script in Python using encoding conversion, and I took the Mandelbrot set obfuscated example from the Python programming FAQ as an example.

code = b"""print((lambda Ru,Ro,Iu,Io,IM,Sx,Sy:reduce(lambda x,y:x+'\n'+y,map(lambda y,
Iu=Iu,Io=Io,Ru=Ru,Ro=Ro,Sy=Sy,L=lambda yc,Iu=Iu,Io=Io,Ru=Ru,Ro=Ro,i=IM,
Sx=Sx,Sy=Sy:reduce(lambda x,y:x+y,map(lambda x,xc=Ru,yc=yc,Ru=Ru,Ro=Ro,
i=i,Sx=Sx,F=lambda xc,yc,x,y,k,f=lambda xc,yc,x,y,k,f:(k<=0)or (x*x+y*y
>=4.0) or 1+f(xc,yc,x*x-y*y+xc,2.0*x*y+yc,k-1,f):f(xc,yc,x,y,k,f):chr(
64+F(Ru+x*(Ro-Ru)/Sx,yc,0,0,i)),range(Sx))):L(Iu+y*(Io-Iu)/Sy),range(Sy
))))(-2.1, 0.7, -1.2, 1.2, 30, 80, 24))"""

shorter_code = code.decode("u16")  # crash here
print(shorter_code)
code_back = shorter_code.encode("u16")[2:]
print(code_back)
print(code_back == code)

However, the code crashed unexpectedly during execution.

Traceback (most recent call last):
  File "C:\Users\lancet\AppData\Roaming\JetBrains\PyCharm2022.3\scratches\scratch_24.py", line 9, in <module>
    shorter_code = code.decode("u16")
                   ^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x29 in position 472: truncated data

I already did this kind of tricks for challenges in CodinGame code golf mode with success. So I tried with another example from the documentation, the First 10 Fibonacci numbers example, with success.

code = b"""print(list(map(lambda x,f=lambda x,f:(f(x-1,f)+f(x-2,f)) if x>1 else 1:
f(x,f), range(10))))"""

shorter_code = code.decode("u16")
print(shorter_code)
# 牰湩⡴楬瑳洨灡氨浡摢⁡ⱸ㵦慬扭慤砠昬⠺⡦⵸ⰱ⥦昫砨㈭昬⤩椠⁦㹸‱汥敳ㄠ਺⡦ⱸ⥦‬慲杮⡥〱⤩⤩
code_back = shorter_code.encode("u16")[2:]
print(code_back)
# b'print(list(map(lambda x,f=lambda x,f:(f(x-1,f)+f(x-2,f)) if x>1 else 1:\nf(x,f), range(10))))'
print(code_back == code)
# True

Why the first string is considered truncated?

答案1

得分: 3

需要偶数长度的字符串,将尾随的空格添加到字符串

ASCII字符被表示为具有最高有效位设置为0的8位字节。由于UTF-16字符被表示为16位字节,所以您需要具有偶数个8位字节来在UTF-16中解码它。如果不是偶数个字节,字符串的最后一个字节将丢失8字节的数据并被视为截断。

曼德勃罗特代码的长度为473,斐波那契代码的长度为92

要修复脚本,您需要一个偶数长度的字符串,只需添加一个尾随的空格。

英文:

Even size string required, add trailing white-space to the string

An ASCII character are represented as 8-bit bytes with the most significant bit set to 0. Since UTF-16 characters are represented as 16-bit bytes, you need an even number of 8-bit bytes to decode it in UTF-16. If you don't, the last byte of the string will miss 8-byte of data and considered truncated.

The length of the Mandelbrot code is 473, and the length of the Fibonacci code is 92.

To fix the script, you need a string with an even length, so just add a trailing white-space.

code = b"""print((lambda Ru,Ro,Iu,Io,IM,Sx,Sy:reduce(lambda x,y:x+'\n'+y,map(lambda y,
Iu=Iu,Io=Io,Ru=Ru,Ro=Ro,Sy=Sy,L=lambda yc,Iu=Iu,Io=Io,Ru=Ru,Ro=Ro,i=IM,
Sx=Sx,Sy=Sy:reduce(lambda x,y:x+y,map(lambda x,xc=Ru,yc=yc,Ru=Ru,Ro=Ro,
i=i,Sx=Sx,F=lambda xc,yc,x,y,k,f=lambda xc,yc,x,y,k,f:(k<=0)or (x*x+y*y
>=4.0) or 1+f(xc,yc,x*x-y*y+xc,2.0*x*y+yc,k-1,f):f(xc,yc,x,y,k,f):chr(
64+F(Ru+x*(Ro-Ru)/Sx,yc,0,0,i)),range(Sx))):L(Iu+y*(Io-Iu)/Sy),range(Sy
))))(-2.1, 0.7, -1.2, 1.2, 30, 80, 24)) """

print(len(code))
# 474
shorter_code = code.decode("u16")
print(shorter_code)
# 牰湩⡴氨浡摢⁡畒刬Ɐ畉䤬Ɐ䵉匬ⱸ祓爺摥捵⡥慬扭慤砠礬砺✫✊礫洬灡氨浡摢⁡ⱹ䤊㵵畉䤬㵯潉刬㵵畒刬㵯潒匬㵹祓䰬氽浡摢⁡捹䤬㵵畉䤬㵯潉刬㵵畒刬㵯潒椬䤽ⱍ匊㵸硓匬㵹祓爺摥捵⡥慬扭慤砠礬砺礫洬灡氨浡摢⁡ⱸ捸刽Ⱶ捹礽Ᵽ畒刽Ⱶ潒刽Ɐ椊椽匬㵸硓䘬氽浡摢⁡捸礬ⱣⱸⱹⱫ㵦慬扭慤砠Ᵽ捹砬礬欬昬⠺㱫〽漩⁲砨砪礫礪㸊㐽〮
牯ㄠ昫砨Ᵽ捹砬砪礭礪砫Ᵽ⸲⨰⩸⭹捹欬ㄭ昬㨩⡦捸礬ⱣⱸⱹⱫ⥦挺牨ਨ㐶䘫刨⭵⩸刨ⵯ畒⼩硓礬Ᵽⰰⰰ⥩Ⱙ慲杮⡥硓⤩㨩⡌畉礫⠪潉䤭⥵匯⥹爬湡敧匨੹⤩⤩⴨⸲ⰱ〠㜮‬ㄭ㈮‬⸱ⰲ㌠ⰰ㠠ⰰ㈠⤴

code_back = shorter_code.encode("u16")[2:]
print(code_back)
b"print((lambda Ru,Ro,Iu,Io,IM,Sx,Sy:reduce(lambda x,y:x+'\n'+y,map(lambda y,\nIu=Iu,Io=Io,Ru=Ru,Ro=Ro,Sy=Sy,L=lambda yc,Iu=Iu,Io=Io,Ru=Ru,Ro=Ro,i=IM,\nSx=Sx,Sy=Sy:reduce(lambda x,y:x+y,map(lambda x,xc=Ru,yc=yc,Ru=Ru,Ro=Ro,\ni=i,Sx=Sx,F=lambda xc,yc,x,y,k,f=lambda xc,yc,x,y,k,f:(k<=0)or (x*x+y*y\n>=4.0) or 1+f(xc,yc,x*x-y*y+xc,2.0*x*y+yc,k-1,f):f(xc,yc,x,y,k,f):chr(\n64+F(Ru+x*(Ro-Ru)/Sx,yc,0,0,i)),range(Sx))):L(Iu+y*(Io-Iu)/Sy),range(Sy\n))))(-2.1, 0.7, -1.2, 1.2, 30, 80, 24)) "
print(code_back == code)
# True

huangapple
  • 本文由 发表于 2023年2月10日 03:58:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/75403816.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定