2023年6月15日 17:24:29go评论100阅读模式

英文:

Trouble with writing to a csv using utf-8 encoding

问题

我试图分析一些Facebook Messenger数据，但在UTF-8编码方面遇到了问题。

导入 os
导入 json
导入 datetime
从 tqdm 导入 tqdm
导入 csv
从 datetime 导入 datetime
目录 = "facebook-100071636101603/messages/inbox"
文件夹 = os.listdir(目录)
如果 ".DS_Store" 在 文件夹 中:
    文件夹.remove(".DS_Store")
对于 文件夹 中的 每个文件夹:
    打印(文件夹)
    对于 文件名 在 os.listdir(os.path.join(目录, 文件夹)):
        如果 文件名 以 "message" 开头:
            数据 = json.load(open(os.path.join(目录, 文件夹, 文件名), "r"))
            对于 消息 在 数据["messages"]:
                尝试:
                    日期 = datetime.fromtimestamp(消息["timestamp_ms"] / 1000).strftime("%Y-%m-%d %H:%M:%S")
                    发件人 = 消息["sender_name"]
                    内容 = 消息["content"]
                    使用 'output.csv'，'w'，encoding="utf-8" 打开 as csv_文件:
                        写入器 = csv.writer(csv_文件)
                        写入器.writerow([日期, 发件人, 内容])
                除 KeyError 以外:
                    继续

这个脚本可以运行，但输出的CSV文件中没有显示带重音符号的字符。我对这方面很陌生，所以尝试不多。我阅读了Python CSV文档并找到了这段说明：链接。但似乎不起作用。

编辑：
这是我得到的输出，但应该是 Jørn 而不是 JÃ¸rn，以及 quête，而不是 quÃªte。

英文:

I'm trying to ananalyse some facebook messenger data and I'm having trouble with utf-8 encoding.

import os
import json
import datetime
from tqdm import tqdm
import csv
from datetime import datetime 
directory = &quot;facebook-100071636101603/messages/inbox&quot;
folders = os.listdir(directory)
if &quot;.DS_Store&quot; in folders:
    folders.remove(&quot;.DS_Store&quot;)
for folder in tqdm(folders):
    print(folder)
    for filename in os.listdir(os.path.join(directory,folder)):
        if filename.startswith(&quot;message&quot;):
            data = json.load(open(os.path.join(directory,folder,filename), &quot;r&quot;))
            for message in data[&quot;messages&quot;]:
                try:
                    date = datetime.fromtimestamp(message[&quot;timestamp_ms&quot;] / 1000).strftime(&quot;%Y-%m-%d %H:%M:%S&quot;)
                    sender = message[&quot;sender_name&quot;]
                    content = message[&quot;content&quot;]
                    with open(&#39;output.csv&#39;, &#39;w&#39;, encoding=&quot;utf-8&quot;) as csv_file:
                        writer = csv.writer(csv_file)
                        writer.writerow([date,sender,content])
                except KeyError:
                    pass

This script works but the output csv doesn't show the accentuated characters.

I'm very knew to this so I haven't tried a lot. I've read the Python csv documentation and found this passage:
> Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getencoding()). To decode a file using a different encoding, use the encoding argument of open:
>
> import csv
> with open('some.csv', newline='', encoding='utf-8') as f:
> reader = csv.reader(f)
> for row in reader:
> print(row)

But this doesn't seems to work.

Edit :
This is the output I'm getting but it should be Jørn and not JÃ¸rn and quête, not quÃªte.

答案1

得分: 0

Try adding encoding="utf-8" to this line:

json.load(open(os.path.join(directory, folder, filename), "r", encoding="utf-8"))

This will ensure that every file you import is in the utf-8 encoding format

EDIT:

You need to install ftfy using pip install ftfy. This package will fix your broken encoding.

Change sender and content to fix the encoding using ftfy by writing this:

import ftfy
# Your other code
sender = message["sender_name"]
content = message["content"]
sender = ftfy.fix_text(sender)
content = ftfy.fix_text(content)

You can use ftfy.fix_text(string) for any other broken encoding as well.

英文:

Try adding encoding="utf-8 to this line:

json.load(open(os.path.join(directory,folder,filename), &quot;r&quot;, encoding=&quot;utf-8&quot;))

This will ensure that every file you import is in the utf-8 encoding format

EDIT:

You need to install ftfy using pip install ftfy. This package will fix your broken encoding.
Change sender and content to fix the encoding using ftfy by writing this:

import ftfy
# Your other code
sender = message[&quot;sender_name&quot;]
content = message[&quot;content&quot;]
sender = ftfy.fix_text(sender)
content = ftfy.fix_text(content)

You can use ftfy.fix_text(string) for any other broken encoding as well.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Trouble with writing to a csv using utf-8 encoding 写入CSV文件时使用UTF-8编码遇到问题

问题

答案1

如何强制 ElementTree 在特定目录中查找 XML 文件？

Pandas 根据条件使用 .groupby 和 .mean()。

Python Image/Button click via selenium

如何在Jupyter Lite中安装pmdarima？

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。