2023年7月10日 18:03:32go评论98阅读模式

英文:

pandas read_csv displays quotation marks on the first and last item of every row

问题

我的 .csv 文件看起来像这样：

"col1","col2"
"1","text1"
"2","This a "TEXT". However, I cannot parse it."

即它包含在一个值内的逗号和引号。在 read_csv() 函数中使用 sep 参数会在每一行的开头和结尾产生引号：

import pandas as pd
df = pd.read_csv('test.csv', sep = '","', engine = 'python')
df

结果会是这样：

      "col1      col2"
0      "1          text1"
1      "2          This a "TEXT". However, I cannot parse it.""

我该如何才能正确读取我的文件？

英文:

My .csv file looks like this:<br>

&quot;col1&quot;,&quot;col2&quot;
&quot;1&quot;,&quot;text1&quot;
&quot;2&quot;,&quot;This a &quot;TEXT&quot;. However, I cannot parse it.&quot;

I.e. it contains commas and quotation marks within a value.
Using sep parameter in read_csv() function gives quotation marks in the beginning and ending on each line:

import pandas as pd
df = pd.read_csv(&#39;test.csv&#39;, sep = &#39;&quot;,&quot;&#39;, engine = &#39;python&#39;)
df
 
    &quot;col1 	col2&quot;
0 	&quot;1 	    text1&quot;
1 	&quot;2 	    This a &quot;TEXT&quot;. However, I cannot parse it.&quot;

What can I do to read my file correctly?

答案1

得分: 0

问题在于您的CSV中的逗号和引号都没有进行转义。使用","作为分隔符是一个巧妙的解决方法，但它会在开头和结尾保留引号。

df.columns = ['col1', 'col2']
df['col1'] = df['col1'].str[1:].astype(int)
df['col2'] = df['col2'].str[:-1]
   col1                                        col2
0     1                                       text1
1     2  This a &quot;TEXT&quot;. However, I cannot parse it.

另一种方法是，如果不是寻找","，而是在引号上使用前瞻和后顾：

df = pd.read_csv('test.csv', sep = r'(?<="),(?=")', engine = 'python')
(df.applymap(lambda x: x.strip('"')) # 删除所有值开头和结尾的引号
    .rename(columns = lambda x: x.strip('"')) # 对列名进行相同处理
    .assign(col1 = lambda x: x.col1.astype(int)) # 将col1更改为整数列)

英文:

The issue is that none of the commas or quotation marks within your CSV are escaped. Using "," as the delimeter is a smart way around it, but it leaves the quotation marks on the start and end.

df.columns = [&#39;col1&#39;, &#39;col2&#39;]
df[&#39;col1&#39;] = df[&#39;col1&#39;].str[1:].astype(int)
df[&#39;col2&#39;] = df[&#39;col2&#39;].str[:-1]
   col1                                        col2
0     1                                       text1
1     2  This a &quot;TEXT&quot;. However, I cannot parse it.

Here's another way, if instead of looking for ",", you instead had a lookahead and a lookbehind for quotation marks:

df = pd.read_csv(&#39;test.csv&#39;, sep = r&#39;(?&lt;=\&quot;),(?=\&quot;)&#39;, engine = &#39;python&#39;)
(df.applymap(lambda x: x.strip(&#39;&quot;&#39;)) # remove quotation marks from the start and end of all values
    .rename(columns = lambda x: x.strip(&#39;&quot;&#39;)) # same with column names
    .assign(col1 = lambda x: x.col1.astype(int)) # change col1 to be a column of ints)

答案2

得分: 0

根据您的有趣想法，您还可以将第一个和最后一个引号作为分隔符，然后删除不需要的列：

data = io.StringIO('''"col1","col2"
"1","text1"
"2","This a &quot;TEXT&quot;. However, I cannot parse it."
''')
df = pd.read_csv(data, sep=r'"|,\s*|^"|"$', engine='python').iloc[:, 1:-1]

输出：

   col1                                        col2
0     1                                       text1
1     2  This a &quot;TEXT&quot;. However, I cannot parse it.

优点是您可以直接获取正确的数据类型（如果需要）：

df.dtypes
col1     int64
col2    object
dtype: object

正则表达式演示

英文:

Building on your interesting idea, you can also add the first and last quotes as separators, then drop the unwanted columns:

data = io.StringIO(&#39;&#39;&#39;&quot;col1&quot;,&quot;col2&quot;
&quot;1&quot;,&quot;text1&quot;
&quot;2&quot;,&quot;This a &quot;TEXT&quot;. However, I cannot parse it.&quot;
&#39;&#39;&#39;)
df = pd.read_csv(data, sep=r&#39;&quot;,&quot;|^&quot;|&quot;$&#39;, engine=&#39;python&#39;).iloc[:, 1:-1]

Output:

   col1                                        col2
0     1                                       text1
1     2  This a &quot;TEXT&quot;. However, I cannot parse it.

The advantage is that you directly get the correct types (if needed):

df.dtypes
col1     int64
col2    object
dtype: object

regex demo

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

pandas的read_csv在每行的第一个和最后一个项目上显示引号。

问题

答案1

答案2

如何查找哪个包依赖于 “futures” 在 requirements.txt 中

Bokeh：无法显示具有月份轴的数据

如何在pandas中将列的数据类型从object更改为日期/时间

在Jupyter Notebook中在虚拟环境中安装Python包。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。