英文:
File. exe in win cannot identify the file encoding, the file seems to be corrupted, what can be done?
问题
对于一些文件,Python 的 chardet 库的 `chardet.detect(f.read())['encoding']` 返回 None。
```PYTHON
path=r"C:\A chinese novel.TXT"
with codecs.open(path, 'rb') as f:
encoding=chardet.detect(f.read())
print(encoding)
# RETURN {'encoding': None, 'confidence': 0.0, 'language': None}
我会使用 os.popen("file -bi \"%s\" | gawk -F'[ =]' '{print $3}'" % f).read()
查看文件编码,编译器提示文件编码为 unknown - 8 bit
'file xxx.txt' 输出 xxx.txt: Non-ISO extended-ASCII text, with very long lines (560), with CRLF line terminator
这是一个了解情况的 GIF 链接:https://i.imgur.com/5kvmnRL.gif
然而,Notepad++ 可以正常打开,Notepad 显示文件是 GB2312 编码,字符显示基本正常。
文件可能已损坏,因此成为了 chardet 库无法识别的混合编码文件?
ChatGPT 建议我使用 iconv 对损坏的文件重新编码,但文本编辑器(Notepad++)在打开之前无法确认文件的编码。在 Windows10 中有更可靠的方法来识别文件编码吗?
<details>
<summary>英文:</summary>
For some files, python's chardet library of `chardet.detect(f.read())['encoding']` returns None.
```PYTHON
path=r"C:\A chinese novel.TXT"
with codecs.open(path, 'rb') as f:
encoding=chardet.detect(f.read())
print(encoding)
# RETURN {'encoding': None, 'confidence': 0.0, 'language': None}
I'll use os.popen("file -bi \"%s\" | gawk -F'[ =]' '{print $3}'" % f).read()
view file coding, the compiler hints file encoding is unknown - 8 bit
'file xxx.txt' output xxx.txt: Non-ISO extended-ASCII text, with very long lines (560), with CRLF line terminator
Here's the GIf link to understand the situation: https://i.imgur.com/5kvmnRL.gif
However, Notepad++ can be opened normally, Notepad shows that the file is GB2312 encoding, and the character display is basically normal.
The file may become corrupted and so a mixed-encoding file that the chardet library cannot recognize?
Chatgpt suggested that I use iconv to re-encode the bad file, but the text editor (Notepad++) could not confirm which encoding the file is before opening. Is there a more reliable way to identify file encodings by python in windows10?
答案1
得分: 0
-
chardet
:一个非常流行的用于检测编码的Python包。 -
cchardet
:一个使用C++编写的Python模块,类似于chardet
包。 -
File-magic
:一个使用Python包装的libmagic库,用于识别文件类型和编码。
import chardet
import cchardet
import magic
# chardet
with open('your_file_path', 'rb') as f:
rawdata = f.read()
result = chardet.detect(rawdata)
encoding = result['encoding']
print(encoding)
# cchardet
with open('your_file_path', 'rb') as f:
rawdata = f.read()
result = cchardet.detect(rawdata)
encoding = result['encoding']
print(encoding)
# file-magic
with magic.Magic() as m:
file_type = m.id_filename('your_file_path')
print(file_type)
经验证,cchardet 的识别效果很好。它可以成功输出正确的编码格式。
英文:
-
chardet
: A very popular Python package for detecting encoding. -
cchardet
: A Python module written in C++, similar to the chardet package. -
File-magic
: A Python-wrapped libmagic library that recognizes file types and encodings.
import chardet
import cchardet
import magic
# chardet
with open('your_file_path', 'rb') as f:
rawdata = f.read()
result = chardet.detect(rawdata)
encoding = result['encoding']
print(encoding)
# cchardet
with open('your_file_path', 'rb') as f:
rawdata = f.read()
result = cchardet.detect(rawdata)
encoding = result['encoding']
print(encoding)
# file-magic
with magic.Magic() as m:
file_type = m.id_filename('your_file_path')
print(file_type)
After verification, cchardet recognition effect is good. It can successfully output the correct encoding format.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论