2023年2月18日 10:15:21go评论88阅读模式

英文:

File. exe in win cannot identify the file encoding, the file seems to be corrupted, what can be done?

问题

对于一些文件，Python 的 chardet 库的 `chardet.detect(f.read())[&#39;encoding&#39;]` 返回 None。

```PYTHON
path=r&quot;C:\A chinese novel.TXT&quot;
with codecs.open(path, &#39;rb&#39;) as f:
    encoding=chardet.detect(f.read())
    print(encoding)
# RETURN {&#39;encoding&#39;: None, &#39;confidence&#39;: 0.0, &#39;language&#39;: None}

我会使用 os.popen("file -bi \"%s\" | gawk -F'[ =]' '{print $3}'" % f).read() 查看文件编码，编译器提示文件编码为 unknown - 8 bit

'file xxx.txt' 输出 xxx.txt: Non-ISO extended-ASCII text, with very long lines (560), with CRLF line terminator

这是一个了解情况的 GIF 链接：https://i.imgur.com/5kvmnRL.gif

然而，Notepad++ 可以正常打开，Notepad 显示文件是 GB2312 编码，字符显示基本正常。

文件可能已损坏，因此成为了 chardet 库无法识别的混合编码文件？

ChatGPT 建议我使用 iconv 对损坏的文件重新编码，但文本编辑器（Notepad++）在打开之前无法确认文件的编码。在 Windows10 中有更可靠的方法来识别文件编码吗？


<details>
<summary>英文:</summary>

For some files, python&#39;s chardet library of `chardet.detect(f.read())[&#39;encoding&#39;]` returns None.

```PYTHON
path=r&quot;C:\A chinese novel.TXT&quot;
with codecs.open(path, &#39;rb&#39;) as f:
    encoding=chardet.detect(f.read())
    print(encoding)
# RETURN {&#39;encoding&#39;: None, &#39;confidence&#39;: 0.0, &#39;language&#39;: None}

I'll use os.popen("file -bi \"%s\" | gawk -F'[ =]' '{print $3}'" % f).read() view file coding, the compiler hints file encoding is unknown - 8 bit

'file xxx.txt' output xxx.txt: Non-ISO extended-ASCII text, with very long lines (560), with CRLF line terminator

Here's the GIf link to understand the situation: https://i.imgur.com/5kvmnRL.gif

However, Notepad++ can be opened normally, Notepad shows that the file is GB2312 encoding, and the character display is basically normal.

The file may become corrupted and so a mixed-encoding file that the chardet library cannot recognize?

Chatgpt suggested that I use iconv to re-encode the bad file, but the text editor (Notepad++) could not confirm which encoding the file is before opening. Is there a more reliable way to identify file encodings by python in windows10?

答案1

得分: 0

chardet：一个非常流行的用于检测编码的Python包。
cchardet：一个使用C++编写的Python模块，类似于chardet包。
File-magic：一个使用Python包装的libmagic库，用于识别文件类型和编码。

import chardet
import cchardet
import magic

# chardet
with open('your_file_path', 'rb') as f:
    rawdata = f.read()
    result = chardet.detect(rawdata)
    encoding = result['encoding']
    print(encoding)

# cchardet
with open('your_file_path', 'rb') as f:
    rawdata = f.read()
    result = cchardet.detect(rawdata)
    encoding = result['encoding']
    print(encoding)

# file-magic
with magic.Magic() as m:
    file_type = m.id_filename('your_file_path')
    print(file_type)

经验证，cchardet 的识别效果很好。它可以成功输出正确的编码格式。

英文:

chardet: A very popular Python package for detecting encoding.
cchardet: A Python module written in C++, similar to the chardet package.
File-magic: A Python-wrapped libmagic library that recognizes file types and encodings.

import chardet
import cchardet
import magic

# chardet
with open(&#39;your_file_path&#39;, &#39;rb&#39;) as f:
    rawdata = f.read()
    result = chardet.detect(rawdata)
    encoding = result[&#39;encoding&#39;]
    print(encoding)

# cchardet
with open(&#39;your_file_path&#39;, &#39;rb&#39;) as f:
    rawdata = f.read()
    result = cchardet.detect(rawdata)
    encoding = result[&#39;encoding&#39;]
    print(encoding)

# file-magic
with magic.Magic() as m:
    file_type = m.id_filename(&#39;your_file_path&#39;)
    print(file_type)

After verification, cchardet recognition effect is good. It can successfully output the correct encoding format.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

File. exe在Windows中无法识别文件编码，文件似乎已损坏，可以采取什么措施？

问题

答案1

Matplotlib的每周柱状图在宽度小于1.0时太细，在宽度大于等于1.0时太粗。

Evaluating forward references with typing.get_type_hints in Python for a class defined inside another method/class

Selenium “find_element_by_xpath”已不再使用。如何更新依赖该函数的变量？

How can I remove a column in a pandas DataFrame and make another column clickable, redirecting to a URL in the same row?

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论