英文:
subprocess.run command with non-utf-8 characters (UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb)
问题
对不起,我无法处理代码部分的翻译。以下是你要翻译的非代码部分:
编码一直让我感到困惑,所以希望这不是一个完全愚蠢的问题。
我有一个Python脚本,调用metaflac
来比较文件的FLAC指纹与文件的FLAC指纹。最近,我遇到了文件名中包含的»
字符(https://bytetool.web.app/en/ascii/code/0xbb/)。这与我处理文件名字符串的方式失败了,所以我正在尝试解决这个问题。我首先想到的是需要将其处理为字节对象。但是,当我这样做然后调用subprocess.run
时,我收到了UnicodeDecodeError
错误。
以下是给我带来错误的代码片段:
def test():
directory = b'<redacted>'
ffp_open = open(directory + b'<redacted>.ffp','rb')
ffp_lines = ffp_open.readlines()
print(ffp_lines)
for line in ffp_lines:
if not line.startswith(b';') and b':' in line:
txt = line.split(b':')
ffp_cmd = b'/usr/bin/metaflac --show-md5sum \\' + directory + b'/' + txt[0]+ b'\\''
print(ffp_cmd)
get_ffp_process = subprocess.run(ffp_cmd, stdout=PIPE, stderr=PIPE, universal_newlines=True,shell=True)
对于这段代码,我得到了以下输出(为了更容易理解,进行了缩短):
[b'01 - Intro.flac:eee7ca01db887168ce8312e7a3bdf8d6\r\n', b'04 - Song title \xbb Other Song \xbb.flac:98d2d03f47790d234052c6c9a2ca5cfd\r\n']
b"/usr/bin/metaflac --show-md5sum '<redacted>/01 - Intro.flac'"
b"/usr/bin/metaflac --show-md5sum '<redacted>/04 - Song title \xbb Other Song \xbb.flac'"
get_ffp_process = subprocess.run(ffp_cmd, stdout=PIPE, stderr=PIPE, universal_newlines=True,shell=True)
File "<redacted>/python/lib/python3.9/subprocess.py", line 507, in run
stdout, stderr = process.communicate(input, timeout=timeout)
File "<redacted>/python/lib/python3.9/subprocess.py", line 1134, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File "<redacted>/python/lib/python3.9/subprocess.py", line 2021, in _communicate
stderr = self._translate_newlines(stderr,
File "<redacted>/python/lib/python3.9/subprocess.py", line 1011, in _translate_newlines
data = data.decode(encoding, errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 85: invalid start byte
如果我直接在命令行上运行它,它就可以正常工作(使用制表符填充文件名):
metaflac --show-md5sum 04\ -\ Song\ title\ »\ Other Song\ ».flac
98d2d03f47790d234052c6c9a2ca5cfd
通过nano查看的FFP文件如下:
01 - Intro.flac:eee7ca01db887168ce8312e7a3bdf8d6
04 - Song title � Other Song �.flac:98d2d03f47790d234052c6c9a2ca5cfd
我无法控制文件名,所以我尽量灵活地处理它们,这也是我认为字节对象最合适的原因。我会感激任何指导。谢谢!
英文:
Encoding honestly continues to confuse me, so hopefully this isn't a totally daft question.
I have a python script that calls metaflac
to compare the flac fingerprints in a file to the flac fingerprints of a file. Recently I came across files with » (https://bytetool.web.app/en/ascii/code/0xbb/) in the file name. This failed with how I was dealing with the file name strings, so I'm trying to work around that. My first thought was that I needed to deal with this as bytes objects. But when I do that and then call subprocess.run
, I get a UnicodeDecodeError
Here's the snippet of code that is give me errors:
def test():
directory = b'<redacted>'
ffp_open = open(directory + b'<redacted>.ffp','rb')
ffp_lines = ffp_open.readlines()
print(ffp_lines)
for line in ffp_lines:
if not line.startswith(b';') and b':' in line:
txt = line.split(b':')
ffp_cmd = b'/usr/bin/metaflac --show-md5sum \'' + directory + b'/' + txt[0]+ b'\''
print(ffp_cmd)
get_ffp_process = subprocess.run(ffp_cmd, stdout=PIPE, stderr=PIPE, universal_newlines=True,shell=True)
With that, I get the following output (shortened to make more sense):
[b'01 - Intro.flac:eee7ca01db887168ce8312e7a3bdf8d6\r\n', b'04 - Song title \xbb Other Song \xbb.flac:98d2d03f47790d234052c6c9a2ca5cfd\r\n']
b"/usr/bin/metaflac --show-md5sum '<redacted>/01 - Intro.flac'"
b"/usr/bin/metaflac --show-md5sum '<redacted>/04 - Song title \xbb Other Song \xbb.flac'"
get_ffp_process = subprocess.run(ffp_cmd, stdout=PIPE, stderr=PIPE, universal_newlines=True,shell=True)
File "<redacted>/python/lib/python3.9/subprocess.py", line 507, in run
stdout, stderr = process.communicate(input, timeout=timeout)
File "<redacted>/python/lib/python3.9/subprocess.py", line 1134, in communicate
stdout, stderr = self._communicate(input, endtime, timeout)
File "<redacted>/python/lib/python3.9/subprocess.py", line 2021, in _communicate
stderr = self._translate_newlines(stderr,
File "<redacted>/python/lib/python3.9/subprocess.py", line 1011, in _translate_newlines
data = data.decode(encoding, errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 85: invalid start byte
If I run this directly on the command line it works just fine (using tabs to fill in the file name):
metaflac --show-md5sum 04\ -\ Song\ title\ »\ Other Song\ ».flac
98d2d03f47790d234052c6c9a2ca5cfd
The FFP file, through nano, looks like this:
01 - Intro.flac:eee7ca01db887168ce8312e7a3bdf8d6
04 - Song title � Other Song �.flac:98d2d03f47790d234052c6c9a2ca5cfd
I have no control over the file names, so I'm trying to be as flexible as possible to handle them, which is why I thought a bytes object would be best. I'd appreciate any direction. Thanks!
答案1
得分: 1
我相信使用"latin1"或"cp1252"的编码将成功解码。此外,处理字符串比处理字节更容易,所以这是我的建议:
import pathlib
import subprocess
directory = pathlib.Path("/tmp")
with open(directory / "data.ffp", "r", encoding="latin1") as stream:
for line in stream:
if line.startswith(";"):
continue
if ":" not in line:
continue
file_name, expected_md5sum = line.strip().split(":")
print(f"{name=}")
print(f"{expected_md5sum=}")
command = [
"/usr/bin/metaflac",
"--show-md5sum",
str(directory / file_name)
]
print(f"{command=}")
# 现在你可以运行该命令。我假设该命令将返回一个MD5校验和。
completed_process = subprocess.run(
command,
encoding="latin1",
capture_output=True,
)
# 现在,completed_process.stdout将以字符串形式保存输出,而不是字节。
这是一个示例输出:
name='04 - Song title » Other Song ».flac'
expected_md5sum='eee7ca01db887168ce8312e7a3bdf8d6\n'
command=['/usr/bin/metaflac', '--show-md5sum', '/tmp/01 - Intro.flac']
name='04 - Song title » Other Song ».flac'
expected_md5sum='98d2d03f47790d234052c6c9a2ca5cfd\n'
command=['/usr/bin/metaflac', '--show-md5sum', '/tmp/04 - Song title » Other Song ».flac']
由于我的系统没有metaflac
命令,我无法进行测试。请谅解如果出现任何错误。如果发现错误,请在评论中发表,我会尝试修复它。
英文:
I believe coding of "latin1" or "cp1252" will do decode that successfully. Also, it is easier to deal with strings than with bytes, so here is my suggestion:
import pathlib
import subprocess
directory = pathlib.Path("/tmp")
with open(directory / "data.ffp", "r", encoding="latin1") as stream:
for line in stream:
if line.startswith(";"):
continue
if ":" not in line:
continue
file_name, expected_md5sum = line.strip().split(":")
print(f"{name=}")
print(f"{expected_md5sum=}")
command = [
"/usr/bin/metaflac",
"--show-md5sum",
str(directory / file_name)
]
print(f"{command=}")
# Now you can run the command. I assume that the command will return a MD5 sum back.
completed_process = subprocess.run(
command,
encoding="latin1",
capture_output=True,
)
# Now, completed_process.stdout will hold the output
# as a string, not bytes.
Here is a sample output:
name='04 - Song title » Other Song ».flac'
expected_md5sum='eee7ca01db887168ce8312e7a3bdf8d6\n'
command=['/usr/bin/metaflac', '--show-md5sum', '/tmp/01 - Intro.flac']
name='04 - Song title » Other Song ».flac'
expected_md5sum='98d2d03f47790d234052c6c9a2ca5cfd\n'
command=['/usr/bin/metaflac', '--show-md5sum', '/tmp/04 - Song title » Other Song ».flac']
Since my system does not have the metaflac
command, I cannot test it. Please forgive any error that come up. If an error found, please post in the comment and I will try to fix it.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论