pandas的read_csv在每行的第一个和最后一个项目上显示引号。

huangapple go评论69阅读模式
英文:

pandas read_csv displays quotation marks on the first and last item of every row

问题

我的 .csv 文件看起来像这样:

"col1","col2"
"1","text1"
"2","This a "TEXT". However, I cannot parse it."

即它包含在一个值内的逗号和引号。在 read_csv() 函数中使用 sep 参数会在每一行的开头和结尾产生引号:

import pandas as pd

df = pd.read_csv('test.csv', sep = '","', engine = 'python')
df

结果会是这样:

      "col1      col2"
0      "1          text1"
1      "2          This a "TEXT". However, I cannot parse it.""

我该如何才能正确读取我的文件?

英文:

My .csv file looks like this:<br>

&quot;col1&quot;,&quot;col2&quot;
&quot;1&quot;,&quot;text1&quot;
&quot;2&quot;,&quot;This a &quot;TEXT&quot;. However, I cannot parse it.&quot;

I.e. it contains commas and quotation marks within a value.
Using sep parameter in read_csv() function gives quotation marks in the beginning and ending on each line:

import pandas as pd

df = pd.read_csv(&#39;test.csv&#39;, sep = &#39;&quot;,&quot;&#39;, engine = &#39;python&#39;)
df
 

    &quot;col1 	col2&quot;
0 	&quot;1 	    text1&quot;
1 	&quot;2 	    This a &quot;TEXT&quot;. However, I cannot parse it.&quot;

What can I do to read my file correctly?

答案1

得分: 0

问题在于您的CSV中的逗号和引号都没有进行转义。使用&quot;,&quot;作为分隔符是一个巧妙的解决方法,但它会在开头和结尾保留引号。

df.columns = ['col1', 'col2']
df['col1'] = df['col1'].str[1:].astype(int)
df['col2'] = df['col2'].str[:-1]

   col1                                        col2
0     1                                       text1
1     2  This a &quot;TEXT&quot;. However, I cannot parse it.

另一种方法是,如果不是寻找&quot;,&quot;,而是在引号上使用前瞻和后顾:

df = pd.read_csv('test.csv', sep = r'(?<="),(?=")', engine = 'python')

(df.applymap(lambda x: x.strip('"')) # 删除所有值开头和结尾的引号
    .rename(columns = lambda x: x.strip('"')) # 对列名进行相同处理
    .assign(col1 = lambda x: x.col1.astype(int)) # 将col1更改为整数列)
英文:

The issue is that none of the commas or quotation marks within your CSV are escaped. Using &quot;,&quot; as the delimeter is a smart way around it, but it leaves the quotation marks on the start and end.

df.columns = [&#39;col1&#39;, &#39;col2&#39;]
df[&#39;col1&#39;] = df[&#39;col1&#39;].str[1:].astype(int)
df[&#39;col2&#39;] = df[&#39;col2&#39;].str[:-1]

   col1                                        col2
0     1                                       text1
1     2  This a &quot;TEXT&quot;. However, I cannot parse it.

Here's another way, if instead of looking for &quot;,&quot;, you instead had a lookahead and a lookbehind for quotation marks:

df = pd.read_csv(&#39;test.csv&#39;, sep = r&#39;(?&lt;=\&quot;),(?=\&quot;)&#39;, engine = &#39;python&#39;)

(df.applymap(lambda x: x.strip(&#39;&quot;&#39;)) # remove quotation marks from the start and end of all values
    .rename(columns = lambda x: x.strip(&#39;&quot;&#39;)) # same with column names
    .assign(col1 = lambda x: x.col1.astype(int)) # change col1 to be a column of ints)

答案2

得分: 0

根据您的有趣想法,您还可以将第一个和最后一个引号作为分隔符,然后删除不需要的列:

data = io.StringIO('''"col1","col2"
"1","text1"
"2","This a &quot;TEXT&quot;. However, I cannot parse it."
''')

df = pd.read_csv(data, sep=r'"|,\s*|^"|"$', engine='python').iloc[:, 1:-1]

输出:

   col1                                        col2
0     1                                       text1
1     2  This a &quot;TEXT&quot;. However, I cannot parse it.

优点是您可以直接获取正确的数据类型(如果需要):

df.dtypes

col1     int64
col2    object
dtype: object

正则表达式演示

英文:

Building on your interesting idea, you can also add the first and last quotes as separators, then drop the unwanted columns:

data = io.StringIO(&#39;&#39;&#39;&quot;col1&quot;,&quot;col2&quot;
&quot;1&quot;,&quot;text1&quot;
&quot;2&quot;,&quot;This a &quot;TEXT&quot;. However, I cannot parse it.&quot;
&#39;&#39;&#39;)

df = pd.read_csv(data, sep=r&#39;&quot;,&quot;|^&quot;|&quot;$&#39;, engine=&#39;python&#39;).iloc[:, 1:-1]

Output:

   col1                                        col2
0     1                                       text1
1     2  This a &quot;TEXT&quot;. However, I cannot parse it.

The advantage is that you directly get the correct types (if needed):

df.dtypes

col1     int64
col2    object
dtype: object

regex demo

huangapple
  • 本文由 发表于 2023年7月10日 18:03:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76652688.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定