Trouble with writing to a csv using utf-8 encoding 写入CSV文件时使用UTF-8编码遇到问题

huangapple go评论68阅读模式
英文:

Trouble with writing to a csv using utf-8 encoding

问题

我试图分析一些Facebook Messenger数据,但在UTF-8编码方面遇到了问题。

导入 os
导入 json
导入 datetime
从 tqdm 导入 tqdm
导入 csv
从 datetime 导入 datetime

目录 = "facebook-100071636101603/messages/inbox"
文件夹 = os.listdir(目录)

如果 ".DS_Store" 在 文件夹 中:
    文件夹.remove(".DS_Store")

对于 文件夹 中的 每个文件夹:
    打印(文件夹)
    对于 文件名 在 os.listdir(os.path.join(目录, 文件夹)):
        如果 文件名 以 "message" 开头:
            数据 = json.load(open(os.path.join(目录, 文件夹, 文件名), "r"))
            对于 消息 在 数据["messages"]:
                尝试:
                    日期 = datetime.fromtimestamp(消息["timestamp_ms"] / 1000).strftime("%Y-%m-%d %H:%M:%S")
                    发件人 = 消息["sender_name"]
                    内容 = 消息["content"]
                    使用 'output.csv''w'encoding="utf-8" 打开 as csv_文件:
                        写入器 = csv.writer(csv_文件)
                        写入器.writerow([日期, 发件人, 内容])

KeyError 以外:
                    继续

这个脚本可以运行,但输出的CSV文件中没有显示带重音符号的字符。我对这方面很陌生,所以尝试不多。我阅读了Python CSV文档并找到了这段说明:链接。但似乎不起作用。

编辑:
这是我得到的输出,但应该是 Jørn 而不是 Jørn,以及 quête,而不是 quête。

英文:

I'm trying to ananalyse some facebook messenger data and I'm having trouble with utf-8 encoding.

import os
import json
import datetime
from tqdm import tqdm
import csv
from datetime import datetime 

directory = "facebook-100071636101603/messages/inbox"
folders = os.listdir(directory)

if ".DS_Store" in folders:
    folders.remove(".DS_Store")

for folder in tqdm(folders):
    print(folder)
    for filename in os.listdir(os.path.join(directory,folder)):
        if filename.startswith("message"):
            data = json.load(open(os.path.join(directory,folder,filename), "r"))
            for message in data["messages"]:
                try:
                    date = datetime.fromtimestamp(message["timestamp_ms"] / 1000).strftime("%Y-%m-%d %H:%M:%S")
                    sender = message["sender_name"]
                    content = message["content"]
                    with open('output.csv', 'w', encoding="utf-8") as csv_file:
                        writer = csv.writer(csv_file)
                        writer.writerow([date,sender,content])

                except KeyError:
                    pass

This script works but the output csv doesn't show the accentuated characters.

I'm very knew to this so I haven't tried a lot. I've read the Python csv documentation and found this passage:
> Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getencoding()). To decode a file using a different encoding, use the encoding argument of open:
>
> import csv
> with open('some.csv', newline='', encoding='utf-8') as f:
> reader = csv.reader(f)
> for row in reader:
> print(row)

But this doesn't seems to work.

Edit :
This is the output I'm getting but it should be Jørn and not Jørn and quête, not quête.

答案1

得分: 0

Try adding encoding="utf-8" to this line:

json.load(open(os.path.join(directory, folder, filename), "r", encoding="utf-8"))

This will ensure that every file you import is in the utf-8 encoding format

EDIT:

You need to install ftfy using pip install ftfy. This package will fix your broken encoding.

Change sender and content to fix the encoding using ftfy by writing this:

import ftfy
# Your other code
sender = message["sender_name"]
content = message["content"]
sender = ftfy.fix_text(sender)
content = ftfy.fix_text(content)

You can use ftfy.fix_text(string) for any other broken encoding as well.

英文:

Try adding encoding="utf-8 to this line:

json.load(open(os.path.join(directory,folder,filename), "r", encoding="utf-8"))

This will ensure that every file you import is in the utf-8 encoding format

EDIT:

You need to install ftfy using pip install ftfy. This package will fix your broken encoding.
Change sender and content to fix the encoding using ftfy by writing this:

import ftfy
# Your other code
sender = message["sender_name"]
content = message["content"]
sender = ftfy.fix_text(sender)
content = ftfy.fix_text(content)

You can use ftfy.fix_text(string) for any other broken encoding as well.

huangapple
  • 本文由 发表于 2023年6月15日 17:24:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76481033.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定