英文:
Trouble with writing to a csv using utf-8 encoding
问题
我试图分析一些Facebook Messenger数据,但在UTF-8编码方面遇到了问题。
导入 os
导入 json
导入 datetime
从 tqdm 导入 tqdm
导入 csv
从 datetime 导入 datetime
目录 = "facebook-100071636101603/messages/inbox"
文件夹 = os.listdir(目录)
如果 ".DS_Store" 在 文件夹 中:
文件夹.remove(".DS_Store")
对于 文件夹 中的 每个文件夹:
打印(文件夹)
对于 文件名 在 os.listdir(os.path.join(目录, 文件夹)):
如果 文件名 以 "message" 开头:
数据 = json.load(open(os.path.join(目录, 文件夹, 文件名), "r"))
对于 消息 在 数据["messages"]:
尝试:
日期 = datetime.fromtimestamp(消息["timestamp_ms"] / 1000).strftime("%Y-%m-%d %H:%M:%S")
发件人 = 消息["sender_name"]
内容 = 消息["content"]
使用 'output.csv','w',encoding="utf-8" 打开 as csv_文件:
写入器 = csv.writer(csv_文件)
写入器.writerow([日期, 发件人, 内容])
除 KeyError 以外:
继续
这个脚本可以运行,但输出的CSV文件中没有显示带重音符号的字符。我对这方面很陌生,所以尝试不多。我阅读了Python CSV文档并找到了这段说明:链接。但似乎不起作用。
编辑:
这是我得到的输出,但应该是 Jørn 而不是 Jørn,以及 quête,而不是 quête。
英文:
I'm trying to ananalyse some facebook messenger data and I'm having trouble with utf-8 encoding.
import os
import json
import datetime
from tqdm import tqdm
import csv
from datetime import datetime
directory = "facebook-100071636101603/messages/inbox"
folders = os.listdir(directory)
if ".DS_Store" in folders:
folders.remove(".DS_Store")
for folder in tqdm(folders):
print(folder)
for filename in os.listdir(os.path.join(directory,folder)):
if filename.startswith("message"):
data = json.load(open(os.path.join(directory,folder,filename), "r"))
for message in data["messages"]:
try:
date = datetime.fromtimestamp(message["timestamp_ms"] / 1000).strftime("%Y-%m-%d %H:%M:%S")
sender = message["sender_name"]
content = message["content"]
with open('output.csv', 'w', encoding="utf-8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow([date,sender,content])
except KeyError:
pass
This script works but the output csv doesn't show the accentuated characters.
I'm very knew to this so I haven't tried a lot. I've read the Python csv documentation and found this passage:
> Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getencoding()). To decode a file using a different encoding, use the encoding argument of open:
>
> import csv
> with open('some.csv', newline='', encoding='utf-8') as f:
> reader = csv.reader(f)
> for row in reader:
> print(row)
But this doesn't seems to work.
Edit :
This is the output I'm getting but it should be Jørn and not Jørn and quête, not quête.
答案1
得分: 0
Try adding encoding="utf-8"
to this line:
json.load(open(os.path.join(directory, folder, filename), "r", encoding="utf-8"))
This will ensure that every file you import is in the utf-8 encoding format
EDIT:
You need to install ftfy using pip install ftfy
. This package will fix your broken encoding.
Change sender
and content
to fix the encoding using ftfy by writing this:
import ftfy
# Your other code
sender = message["sender_name"]
content = message["content"]
sender = ftfy.fix_text(sender)
content = ftfy.fix_text(content)
You can use ftfy.fix_text(string)
for any other broken encoding as well.
英文:
Try adding encoding="utf-8
to this line:
json.load(open(os.path.join(directory,folder,filename), "r", encoding="utf-8"))
This will ensure that every file you import is in the utf-8 encoding format
EDIT:
You need to install ftfy using pip install ftfy
. This package will fix your broken encoding.
Change sender
and content
to fix the encoding using ftfy by writing this:
import ftfy
# Your other code
sender = message["sender_name"]
content = message["content"]
sender = ftfy.fix_text(sender)
content = ftfy.fix_text(content)
You can use ftfy.fix_text(string)
for any other broken encoding as well.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论