英文:
Spliting a text csv file into another csv with the variables the text represents with Python
问题
I have translated the content you provided. Here it is:
我有一个类似下面的csv文件。每行都只是文本,我想要将每行分割成它实际代表的三个变量。该文件显示了客户在特定日期发表的评论以及他们的识别号码:每行都只是文本,显示了客户在某个日期关于他们所在银行的评论。因此,我想要将这个csv文件转换成另一个csv文件,而不是文本行,而是三个变量(`customer_id`,`date`,`comments`)。第一行的内容显示了这三个潜在的变量名/列,我想要生成的最终版本应该像这样:
|customer_id | date | comments |
|:---------- |:--------:| -----------------------------------------:|
| 216604 |2022-08-22| Overall, this bank is satisfactory.,,, |
| 259276 |2022-11-23| Easy to find zhe bank ' s branches ,,, |
| 58770 |2022-03-13| ,,, |
| 318031 |2022-08-08| ,,, |
| 380865 |2022-11-20| considering a different bank.. |
我是Python的绝对初学者,刚刚开始一个月。所以,这可能是一个简单的任务,但我就是找不到将后者转换为像这样的三列文件的方法:
|customer_id | date | comments |
|:---------- |:--------:| -----------------------------------------:|
| 216604 |2022-08-22| Overall, this bank is satisfactory.,,, |
| 259276 |2022-11-23| Easy to find zhe bank ' s branches ,,, |
| 58770 |2022-03-13| ,,, |
| 318031 |2022-08-08| ,,, |
| 380865 |2022-11-20| considering a different bank.. |
换句话说,我必须将原始文本分成三个字段:一个`ID`,一个`date`类型,以及评论的文本语料。
任何建议都非常欢迎。
谢谢。
英文:
I have a csv file like the following (see below). Each line is just text and I would like to split each line into the three variables that it actually represents. The file shows the comments made by customers in a specific date and their identification number: each line is just text showing what a customer on a date is commenting regarding the bank they have. So, I would like to transform this csv file into another csv file which, instead of text lines has three variables (customer_id
, date
, comments
). The content of the first row shows these three prospective variable names/columns that the final version I want to generate should have, like this:
"customer_id date comments",,, |
"216604 2022-08-22 Overall", this bank is satisfactory.,, |
"259276 2022-11-23 Easy to find zhe bank ' s branches ",,, |
"58770 2022-03-13 ",,, |
"318031 2022-08-08 ",,, |
"380865 2022-11-20 considering a different bank..",,, |
I'm an absolute beginner in Python. Just started a month ago. So, this may be probably a simple task, but I just can't find the way to convert the latter into a three column file like this:
customer_id | date | comments |
---|---|---|
216604 | 2022-08-22 | Overall, this bank is satisfactory.,,, |
259276 | 2022-11-23 | Easy to find zhe bank ' s branches ,,, |
58770 | 2022-03-13 | ,,, |
318031 | 2022-08-08 | ,,, |
380865 | 2022-11-20 | considering a different bank.. |
Or, in other words. I have to separate the original text into three fields: an ID
, a date
type, and a text with the corpus of the comments.
Any suggestion is very welcome.
Thank you.
答案1
得分: 2
你需要拆分文本,将要分离的部分分别存储在不同的变量中,然后可以按照需要进行处理。试试这样做,并根据需要进行修改:
line = "380865 2022-11-20 Seriously considerin switching to a rival bank.."
sp = line.split(" ")
id, date, text = sp[0], sp[1], " ".join(sp[2:])
print(id)
print(date)
print(text)
英文:
You need to split the text, to separate the parts you want into distinct variables, which you can then work on as you wish. Give this a try, and modify as required:
line = "380865 2022-11-20 Seriously considerin switching to a rival bank.."
sp = line.split(" ")
id, date, text = sp[0], sp[1], " ".join(sp[2:])
print(id)
print(date)
print(text)
答案2
得分: 1
customer_id date comments
0 216604 2022-08-22 Overall, this bank is satisfactory,
1 259276 2022-11-23 Easy to find the bank's branches
2 380865 2022-11-20 Seriously considering switching to a rival bank
英文:
As your text does not have a simple structure (it contains spaces to separate and inside one of the fields), I share this code in case it helps you. I have included comments in the code itself explaining each step, if they are not enough, don't hesitate to ask!
First of all, you need to install pandas
and regex
module:
pip install pandas
pip install regex
import regex as re
import pandas as pd
def split_line(line):
# We split the text by date (element with common structure
# in all entries YYYYY-MM-DD) using regex.
date_pattern = r"[0-9]{4}\-[0-9]{2}\-[0-9]{2}"
# We search the fields `customer_id` and `comments` by
# splitting the text with date pattern
customer_id, comments = re.split(date_pattern, line)
# We search the date number using the regex search
date = re.search(date_pattern, line).group(0)
return {
"customer_id": customer_id.strip(),
"date": date.strip(),
"comments": comments.strip()
}
if __name__ == "__main__":
# If you have the text as a python variable of type docstring
text = """"customer_id date comments
216604 2022-08-22 Overal, this bank is satisfactory,
259276 2022-11-23 Easy to find zhe bank ' s branches
380865 2022-11-20 Seriously considerin switching to a rival bank
"""
all_lines = text.split("\n")[1:]
# If you have the text as a .txt file
# with open("path/to/txt/file", "r") as f:
# all_lines = f.readlines()[1:]
# Note that we index the text lines from [1:] to remove the header
all_parsed_lanes = []
for line in all_lines:
#We measure the length of the line, eliminating spaces with .strip()
#to verify that it is not an empty line.
if len(line.strip()) > 0:
extracted_fields = split_line(line)
all_parsed_lanes.append(extracted_fields)
# We convert the list of dictionaries into a ordered and redeable
# dataframe using pandas module.
df = pd.DataFrame(all_parsed_lanes)
print(df)
Which returns as output:
customer_id date comments
0 216604 2022-08-22 Overall, this bank is satisfactory,
1 259276 2022-11-23 Easy to find zhe bank ' s branches
2 380865 2022-11-20 Seriously considering switching to a rival bank
答案3
得分: 1
以下是您要翻译的内容:
您的问题相当模糊。下次请说明您已经尝试过什么以及您的具体目标是什么。例如,您希望输出为嵌套列表还是字典。以下是适用于您特定问题的代码。但通常这样的文件应该有一些分隔符来区分哪个值属于哪个列。
该代码首先读取文件的行,并在空格处分割它。这将创建一个列表,其中前两个值是您的ID和日期。然后将列表的其余部分再次连接到注释中。
data = []
filename = "yourfile.txt"
with open(filename) as f:
header = f.readline()[:-1]
header = header.split(" ")
data.append(header)
for line in f.readlines():
line = line[:-1].split(" ")
v1 = line[0]
v2 = line[1]
v3 = " ".join(line[2:])
data.append([v1, v2, v3])
第二个代码块将文件保存为带制表符分隔符的文件。这也可以更改为分号。
filename = "output.csv"
with open(filename, "w") as f:
for line in data:
for val in line:
f.write(val)
f.write("\t")
f.write("\n")
英文:
Your question is quite vague. For the next time please what you have already tried and what your specific goal is. Do you want your output as nested lists or as a dictionary for example. Here is a code which would work for your specific problem. But normally such a file should have some seperators to distinguish which value belongs to which column.
The code reads first the lines of your file and splits it at the spaces. This creates a list where the two first values are your ID and date. The rest of the list is then joined to the comment again.
data = []
filename = "yourfile.txt"
with open(filename) as f:
header = f.readline()[:-1]
header = header.split(" ")
data.append(header)
for line in f.readlines():
line = line[:-1].split(" ")
v1 = line[0]
v2 = line[1]
v3 = " ".join(line[2:])
data.append([v1, v2, v3])
The second block saves the file with tabs as a seperator. This can be changed to a semicolon as well.
filename = "output.csv"
with open(filename, "w") as f:
for line in data:
for val in line:
f.write(val)
f.write("\t")
f.write("\n")
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论