分隔列表,数据混乱,pandas 和 pyparsing 无法解析

huangapple go评论66阅读模式
英文:

Delimited list, data is messy, pandas and pyparsing can't parse

问题

编辑 3:我在记事本中打开了文件(纯文本格式),它显示数据行用双引号括起来,导致 Pandas 将其解释为单个字符串。有趣的是,在 Excel 中没有显示这一点,当我将数据从 Excel 复制并粘贴到这篇帖子中时,双引号也没有复制过来。感谢你们所有人的想法,对浪费时间感到抱歉。

编辑 2:这里是实际的 [测试文件链接。][1]

编辑 1:我正在使用 Python 版本 3.11.4 和 Pandas 版本 2.0.2。我还应该指出,该文件被保存为 Microsoft Excel 逗号分隔值文件。

免责声明:我不是程序员,这是我第一次发帖。

我尝试使用 pandas 将 .csv 文件读取到一个数据框中进行分析。pandas 中的 .read_csv 属性可以解析标题,但似乎无法解析数据。相反,它给我提供了适当数量的带有适当标题的列,但将所有数据连接成一个长字符串放在第一列中,并将 NaN 分配给所有其他索引。我尝试使用 pyparsing 解析第一行中的字符串。但它返回了一个错误,让我相信问题出现在数据的第一列中的连字符上,尽管后来的行中该字段是空白的,所以我不能确定。值得注意的是,csv 模块似乎可以读取该文件,但我不清楚如何将 _csv.reader 对象转换为可用于分析的工作数据框。我在下面包含了在这里讨论的代码和输出。它都很基础,但我已经查阅了文档,似乎找不到正确的解决方案。

我一直在使用一个测试文件,其中包含前 5 行数据,但实际的数据集有超过 1500 万行,因此任何解决方案都需要“大数据集友好”。

数据片段

gateway_din;charger_vin;bucket;bucket_dt;post;operative_status;alert_codes;alert_codes_cabinet
STST-SM-12964;GF2191430006CD;1682503200000;"2023-04-26 10:00:00";2;Operative;;
STST-SM-13783;GF2193050000CM;1682199000000;"2023-04-22 21:30:00";1;Operative;;
STST-SM-2978;GF2200050000W5;1681243200000;"2023-04-11 20:00:00";2;Operative;;
STST-SM-3227;GF2200190001EC;1680750900000;"2023-04-06 03:15:00";1;Operative;;
STST-SM-3184;GF22002500014D;1682155800000;"2023-04-22 09:30:00";0;Operative;;

尝试使用 Pandas 读取 CSV

import pandas as pd
import pyparsing as pp
import csv

Test = pd.read_csv("C:/.../Test.csv", delimiter=';', header=0)

终端输出

Test
                                         gateway_din  ...  alert_codes_cabinet
0  STST-SM-12964;GF2191430006CD;1682503200000;"20...  ...                  NaN
1  STST-SM-13783;GF2193050000CM;1682199000000;"20...  ...                  NaN
2  STST-SM-2978;GF2200050000W5;1681243200000;"202...  ...                  NaN
3  STST-SM-3227;GF2200190001EC;1680750900000;"202...  ...                  NaN
4  STST-SM-3184;GF22002500014D;1682155800000;"202...  ...                  NaN

[5 rows x 8 columns]
Test.iat[0,0]
'STST-SM-12964;GF2191430006CD;1682503200000;"2023-04-26 10:00:00";2;Operative;;'

Test.iat[0,1]
nan

尝试在终端中使用 PyParse 解析第一列中的字符串

ParseTest = pp.DelimitedList(';').parse_string(Test.iat[0,0])
Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    ParseTest = pp.DelimitedList(';').parse_string(Test.iat[0,0])
  File "C:\Users\dschell\AppData\Roaming\Python\Python311\site-packages\pyparsing\core.py", line 1190, in parse_string
    raise exc.with_traceback(None)
pyparsing.exceptions.ParseException: , found 'STST'  (at char 0), (line:1, col:1)

尝试使用 CSV 模块读取 .CSV 文件

with open('C:/Users/dschell/Desktop/Reliability and Usability/Survey Data/Test.csv', newline='') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=';', quotechar='"')
    for row in spamreader:
        print(', '.join(row))

终端输出

gateway_din, charger_vin, bucket, bucket_dt, post, operative_status, alert_codes, alert_codes_cabinet
STST-SM-12964;GF2191430006CD;1682503200000;&quot;2023-04-26 10:00:00&quot;;2;Operative;;
STST-SM-13783;GF2193050000CM;1682199000000;&quot;2023-04-22 21:30:00&quot;;1;Operative;;
STST-SM-2978;GF2200050000W5;1681243200000;&quot;2023-04-11 20:00:00&quot;;2;Operative;;
STST-SM-3227;GF2200190001EC;1680750900000;&quot;2023-04-06

<details>
<summary>英文:</summary>

Edit 3: I opened the file in notepad (plain text format) it showed that the data lines were enclosed with double quotes, causing Pandas to interpret it as a single string. What&#39;s interesting is that that didn&#39;t show in Excel and the double quotes didn&#39;t port over when I copied and pasted the data from Excel into this post. Thanks for all of your thoughts and sorry for the waste of time. 

Edit 2: Here is a link to the actual [test file.][1]

Edit 1: I am using Python version 3.11.4 and Pandas version 2.0.2. I should also that the file is saved as a Microsoft Excel Comma Separated Values File.

Disclaimer: I am not a programmer and this is my first time posting. 

I am attempting to use pandas to read a .csv to a dataframe for analysis. The .read_csv attribute in pandas can parse the header but cannot seem to parse the data. Instead it gives me the appropriate number of columns with the appropriate headers but concatenates all the data into one long str in the first column and assigns NaN to all other indexes. I tried using pyparsing to parse the string in the first row. But it returned an error that makes me believe the problem lies with the hyphens in the first column of data although there are later rows where that field is blank, so I can&#39;t be certain. It&#39;s worth noting that the csv module does seem to be able to read the file but I am unclear on how to convert the _csv.reader object to a workable dataframe for analysis. I&#39;ve included the code and outputs discussed here below. It&#39;s all pretty basic, but I&#39;ve poured over the documentation and can&#39;t seem to find the right solution. 

I&#39;ve been working with a test file that is the first 5 lines of data but the actual dataset is &gt;15 million lines long, so any solution will need to be &quot;large dataset friendly&quot;

 
**DATA SNIPPET**

gateway_din;charger_vin;bucket;bucket_dt;post;operative_status;alert_codes;alert_codes_cabinet
STST-SM-12964;GF2191430006CD;1682503200000;"2023-04-26 10:00:00";2;Operative;;
STST-SM-13783;GF2193050000CM;1682199000000;"2023-04-22 21:30:00";1;Operative;;
STST-SM-2978;GF2200050000W5;1681243200000;"2023-04-11 20:00:00";2;Operative;;
STST-SM-3227;GF2200190001EC;1680750900000;"2023-04-06 03:15:00";1;Operative;;
STST-SM-3184;GF22002500014D;1682155800000;"2023-04-22 09:30:00";0;Operative;;


 

**Attempt to Read CSV using Pandas**

import pandas as pd
import pyparsing as pp
import csv

Test = pd.read_csv("C:/.../Test.csv", delimiter=';', header=0)



**Output in terminal**

Test
gateway_din ... alert_codes_cabinet
0 STST-SM-12964;GF2191430006CD;1682503200000;"20... ... NaN
1 STST-SM-13783;GF2193050000CM;1682199000000;"20... ... NaN
2 STST-SM-2978;GF2200050000W5;1681243200000;"202... ... NaN
3 STST-SM-3227;GF2200190001EC;1680750900000;"202... ... NaN
4 STST-SM-3184;GF22002500014D;1682155800000;"202... ... NaN

[5 rows x 8 columns]


Test.iat[0,0]
'STST-SM-12964;GF2191430006CD;1682503200000;"2023-04-26 10:00:00";2;Operative;;'

Test.iat[0,1]
nan


**Attempt to parse string in Column 1 using PyParse in the terminal**

ParseTest = pp.DelimitedList(';').parse_string(Test.iat[0,0])
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
ParseTest = pp.DelimitedList(';').parse_string(Test.iat[0,0])
File "C:\Users\dschell\AppData\Roaming\Python\Python311\site-packages\pyparsing\core.py", line 1190, in parse_string
raise exc.with_traceback(None)
pyparsing.exceptions.ParseException: , found 'STST' (at char 0), (line:1, col:1)



**Attempt to read .CSV using CSV module**

with open('C:/Users/dschell/Desktop/Reliability and Usability/Survey Data/Test.csv', newline='') as csvfile:
spamreader = csv.reader(csvfile, delimiter=';', quotechar='"')
for row in spamreader:
print(', '.join(row))



**Output in terminal**

gateway_din, charger_vin, bucket, bucket_dt, post, operative_status, alert_codes, alert_codes_cabinet
STST-SM-12964;GF2191430006CD;1682503200000;"2023-04-26 10:00:00";2;Operative;;
STST-SM-13783;GF2193050000CM;1682199000000;"2023-04-22 21:30:00";1;Operative;;
STST-SM-2978;GF2200050000W5;1681243200000;"2023-04-11 20:00:00";2;Operative;;
STST-SM-3227;GF2200190001EC;1680750900000;"2023-04-06 03:15:00";1;Operative;;
STST-SM-3184;GF22002500014D;1682155800000;"2023-04-22 09:30:00";0;Operative;;



  [1]: https://www.dropbox.com/s/yoh3v5f6044ztl1/Test.csv?dl=0

</details>


# 答案1
**得分**: 1

看起来你的数据中有一些额外的隐藏引号。在记事本中打开你发送的文件,第2行如下:

"STST-SM-12964;GF2191430006CD;1682503200000;""2023-04-26 10:00:00"";2;Operative;;"


你可以看到在日期时间部分之前和之后有额外的引号。去掉额外的引号后,如下所示:

STST-SM-12964;GF2191430006CD;1682503200000;"2023-04-26 10:00:00";2;Operative;;


看起来它似乎可以正常工作。

<details>
<summary>英文:</summary>

It seems like your data has some extra hidden quotes. Opening the file you sent in notepad, line 2 has this

    &quot;STST-SM-12964;GF2191430006CD;1682503200000;&quot;&quot;2023-04-26 10:00:00&quot;&quot;;2;Operative;;&quot;

As you can see there&#39;s extra quotes before and after the date-time section. With the extra quotes removed like this

    STST-SM-12964;GF2191430006CD;1682503200000;&quot;2023-04-26 10:00:00&quot;;2;Operative;;

 it seems to work correctly.

</details>



huangapple
  • 本文由 发表于 2023年6月29日 03:52:12
  • 转载请务必保留本文链接:https://go.coder-hub.com/76576315.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定