File. exe在Windows中无法识别文件编码,文件似乎已损坏,可以采取什么措施?

huangapple go评论61阅读模式
英文:

File. exe in win cannot identify the file encoding, the file seems to be corrupted, what can be done?

问题

对于一些文件Python 的 chardet 库的 `chardet.detect(f.read())['encoding']` 返回 None。

```PYTHON
path=r"C:\A chinese novel.TXT"
with codecs.open(path, 'rb') as f:
    encoding=chardet.detect(f.read())
    print(encoding)
# RETURN {'encoding': None, 'confidence': 0.0, 'language': None}

我会使用 os.popen("file -bi \"%s\" | gawk -F'[ =]' '{print $3}'" % f).read() 查看文件编码,编译器提示文件编码为 unknown - 8 bit

'file xxx.txt' 输出 xxx.txt: Non-ISO extended-ASCII text, with very long lines (560), with CRLF line terminator

这是一个了解情况的 GIF 链接:https://i.imgur.com/5kvmnRL.gif

然而,Notepad++ 可以正常打开,Notepad 显示文件是 GB2312 编码,字符显示基本正常。

文件可能已损坏,因此成为了 chardet 库无法识别的混合编码文件?

ChatGPT 建议我使用 iconv 对损坏的文件重新编码,但文本编辑器(Notepad++)在打开之前无法确认文件的编码。在 Windows10 中有更可靠的方法来识别文件编码吗?


<details>
<summary>英文:</summary>

For some files, python&#39;s chardet library of `chardet.detect(f.read())[&#39;encoding&#39;]` returns None.

```PYTHON
path=r&quot;C:\A chinese novel.TXT&quot;
with codecs.open(path, &#39;rb&#39;) as f:
    encoding=chardet.detect(f.read())
    print(encoding)
# RETURN {&#39;encoding&#39;: None, &#39;confidence&#39;: 0.0, &#39;language&#39;: None}

I'll use os.popen(&quot;file -bi \&quot;%s\&quot; | gawk -F&#39;[ =]&#39; &#39;{print $3}&#39;&quot; % f).read() view file coding, the compiler hints file encoding is unknown - 8 bit

'file xxx.txt' output xxx.txt: Non-ISO extended-ASCII text, with very long lines (560), with CRLF line terminator

Here's the GIf link to understand the situation: https://i.imgur.com/5kvmnRL.gif

However, Notepad++ can be opened normally, Notepad shows that the file is GB2312 encoding, and the character display is basically normal.

The file may become corrupted and so a mixed-encoding file that the chardet library cannot recognize?

Chatgpt suggested that I use iconv to re-encode the bad file, but the text editor (Notepad++) could not confirm which encoding the file is before opening. Is there a more reliable way to identify file encodings by python in windows10?

答案1

得分: 0

  • chardet:一个非常流行的用于检测编码的Python包。

  • cchardet:一个使用C++编写的Python模块,类似于chardet包。

  • File-magic:一个使用Python包装的libmagic库,用于识别文件类型和编码。

import chardet
import cchardet
import magic

# chardet
with open('your_file_path', 'rb') as f:
    rawdata = f.read()
    result = chardet.detect(rawdata)
    encoding = result['encoding']
    print(encoding)

# cchardet
with open('your_file_path', 'rb') as f:
    rawdata = f.read()
    result = cchardet.detect(rawdata)
    encoding = result['encoding']
    print(encoding)

# file-magic
with magic.Magic() as m:
    file_type = m.id_filename('your_file_path')
    print(file_type)

经验证,cchardet 的识别效果很好。它可以成功输出正确的编码格式。

英文:
  • chardet: A very popular Python package for detecting encoding.

  • cchardet: A Python module written in C++, similar to the chardet package.

  • File-magic: A Python-wrapped libmagic library that recognizes file types and encodings.

import chardet
import cchardet
import magic

# chardet
with open(&#39;your_file_path&#39;, &#39;rb&#39;) as f:
    rawdata = f.read()
    result = chardet.detect(rawdata)
    encoding = result[&#39;encoding&#39;]
    print(encoding)

# cchardet
with open(&#39;your_file_path&#39;, &#39;rb&#39;) as f:
    rawdata = f.read()
    result = cchardet.detect(rawdata)
    encoding = result[&#39;encoding&#39;]
    print(encoding)

# file-magic
with magic.Magic() as m:
    file_type = m.id_filename(&#39;your_file_path&#39;)
    print(file_type)

After verification, cchardet recognition effect is good. It can successfully output the correct encoding format.

huangapple
  • 本文由 发表于 2023年2月18日 10:15:21
  • 转载请务必保留本文链接:https://go.coder-hub.com/75490773.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定