Spliting a text csv file into another csv with the variables the text represents with Python

huangapple go评论50阅读模式

Spliting a text csv file into another csv with the variables the text represents with Python


I have translated the content you provided. Here it is:


|customer_id |  date     | comments                                  |
|:---------- |:--------:| -----------------------------------------:|
| 216604     |2022-08-22| Overall, this bank is satisfactory.,,,    |
| 259276     |2022-11-23| Easy to find zhe bank ' s branches ,,,    |
| 58770      |2022-03-13| ,,,                                       |
| 318031     |2022-08-08| ,,,                                       |
| 380865     |2022-11-20| considering a different bank..            |


|customer_id |  date     | comments                                  |
|:---------- |:--------:| -----------------------------------------:|
| 216604     |2022-08-22| Overall, this bank is satisfactory.,,,    |
| 259276     |2022-11-23| Easy to find zhe bank ' s branches ,,,    |
| 58770      |2022-03-13| ,,,                                       |
| 318031     |2022-08-08| ,,,                                       |
| 380865     |2022-11-20| considering a different bank..            |




I have a csv file like the following (see below). Each line is just text and I would like to split each line into the three variables that it actually represents. The file shows the comments made by customers in a specific date and their identification number: each line is just text showing what a customer on a date is commenting regarding the bank they have. So, I would like to transform this csv file into another csv file which, instead of text lines has three variables (customer_id, date, comments). The content of the first row shows these three prospective variable names/columns that the final version I want to generate should have, like this:

"customer_id date comments",,,
"216604 2022-08-22 Overall", this bank is satisfactory.,,
"259276 2022-11-23 Easy to find zhe bank ' s branches ",,,
"58770 2022-03-13 ",,,
"318031 2022-08-08 ",,,
"380865 2022-11-20 considering a different bank..",,,

I'm an absolute beginner in Python. Just started a month ago. So, this may be probably a simple task, but I just can't find the way to convert the latter into a three column file like this:

customer_id date comments
216604 2022-08-22 Overall, this bank is satisfactory.,,,
259276 2022-11-23 Easy to find zhe bank ' s branches ,,,
58770 2022-03-13 ,,,
318031 2022-08-08 ,,,
380865 2022-11-20 considering a different bank..

Or, in other words. I have to separate the original text into three fields: an ID, a date type, and a text with the corpus of the comments.

Any suggestion is very welcome.

Thank you.


得分: 2


line = "380865 2022-11-20 Seriously considerin switching to a rival bank.."

sp = line.split(" ")
id, date, text = sp[0], sp[1], " ".join(sp[2:])


You need to split the text, to separate the parts you want into distinct variables, which you can then work on as you wish. Give this a try, and modify as required:

line = "380865 2022-11-20 Seriously considerin switching to a rival bank.."

sp = line.split(" ")
id, date, text = sp[0], sp[1], " ".join(sp[2:])



得分: 1

  customer_id        date                                        comments
0      216604  2022-08-22              Overall, this bank is satisfactory,
1      259276  2022-11-23              Easy to find the bank's branches
2      380865  2022-11-20  Seriously considering switching to a rival bank

As your text does not have a simple structure (it contains spaces to separate and inside one of the fields), I share this code in case it helps you. I have included comments in the code itself explaining each step, if they are not enough, don't hesitate to ask!

First of all, you need to install pandas and regex module:

pip install pandas
pip install regex
import regex as re
import pandas as pd 

def split_line(line):
    # We split the text by date (element with common structure 
    # in all entries YYYYY-MM-DD) using regex.

    date_pattern = r"[0-9]{4}\-[0-9]{2}\-[0-9]{2}"

    # We search the fields `customer_id` and `comments` by
    # splitting the text with date pattern
    customer_id, comments = re.split(date_pattern, line)

    # We search the date number using the regex search
    date = re.search(date_pattern, line).group(0)

    return {
        "customer_id": customer_id.strip(),
        "date": date.strip(),
        "comments": comments.strip()

if __name__ == "__main__":

    # If you have the text as a python variable of type docstring
    text = """"customer_id date comments
    216604 2022-08-22 Overal, this bank is satisfactory,
    259276 2022-11-23 Easy to find zhe bank ' s branches
    380865 2022-11-20 Seriously considerin switching to a rival bank
    all_lines = text.split("\n")[1:]

    # If you have the text as a .txt file 
    # with open("path/to/txt/file", "r") as f:
    #     all_lines = f.readlines()[1:]

    # Note that we index the text lines from [1:] to remove the header 
    all_parsed_lanes = []
    for line in all_lines:
        #We measure the length of the line, eliminating spaces with .strip() 
        #to verify that it is not an empty line. 

        if len(line.strip()) > 0:
            extracted_fields = split_line(line)

    # We convert the list of dictionaries into a ordered and redeable
    # dataframe using pandas module.
    df = pd.DataFrame(all_parsed_lanes)

Which returns as output:

  customer_id        date                                        comments
0      216604  2022-08-22              Overall, this bank is satisfactory,
1      259276  2022-11-23              Easy to find zhe bank ' s branches
2      380865  2022-11-20  Seriously considering switching to a rival bank


得分: 1




data = []
filename = "yourfile.txt"
with open(filename) as f:
    header = f.readline()[:-1]
    header = header.split(" ")
    for line in f.readlines():
        line = line[:-1].split(" ")
        v1 = line[0]
        v2 = line[1]
        v3 = " ".join(line[2:])
        data.append([v1, v2, v3])


filename = "output.csv"
with open(filename, "w") as f:
    for line in data:
        for val in line:

Your question is quite vague. For the next time please what you have already tried and what your specific goal is. Do you want your output as nested lists or as a dictionary for example. Here is a code which would work for your specific problem. But normally such a file should have some seperators to distinguish which value belongs to which column.

The code reads first the lines of your file and splits it at the spaces. This creates a list where the two first values are your ID and date. The rest of the list is then joined to the comment again.

data = []
filename = "yourfile.txt"
with open(filename) as f:
    header = f.readline()[:-1]
    header = header.split(" ")
    for line in f.readlines():
        line = line[:-1].split(" ")
        v1 = line[0]
        v2 = line[1]
        v3 = " ".join(line[2:])
        data.append([v1, v2, v3])

The second block saves the file with tabs as a seperator. This can be changed to a semicolon as well.

filename = "output.csv"
with open(filename, "w") as f:
    for line in data:
        for val in line:

  • 本文由 发表于 2023年5月11日 18:30:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/76226652.html



:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:
