pandas的read_csv在每行的第一个和最后一个项目上显示引号。

huangapple go评论98阅读模式
英文:

pandas read_csv displays quotation marks on the first and last item of every row

问题

我的 .csv 文件看起来像这样:

  1. "col1","col2"
  2. "1","text1"
  3. "2","This a "TEXT". However, I cannot parse it."

即它包含在一个值内的逗号和引号。在 read_csv() 函数中使用 sep 参数会在每一行的开头和结尾产生引号:

  1. import pandas as pd
  2. df = pd.read_csv('test.csv', sep = '","', engine = 'python')
  3. df

结果会是这样:

  1. "col1 col2"
  2. 0 "1 text1"
  3. 1 "2 This a "TEXT". However, I cannot parse it.""

我该如何才能正确读取我的文件?

英文:

My .csv file looks like this:<br>

  1. &quot;col1&quot;,&quot;col2&quot;
  2. &quot;1&quot;,&quot;text1&quot;
  3. &quot;2&quot;,&quot;This a &quot;TEXT&quot;. However, I cannot parse it.&quot;

I.e. it contains commas and quotation marks within a value.
Using sep parameter in read_csv() function gives quotation marks in the beginning and ending on each line:

  1. import pandas as pd
  2. df = pd.read_csv(&#39;test.csv&#39;, sep = &#39;&quot;,&quot;&#39;, engine = &#39;python&#39;)
  3. df
  4. &quot;col1 col2&quot;
  5. 0 &quot;1 text1&quot;
  6. 1 &quot;2 This a &quot;TEXT&quot;. However, I cannot parse it.&quot;

What can I do to read my file correctly?

答案1

得分: 0

问题在于您的CSV中的逗号和引号都没有进行转义。使用&quot;,&quot;作为分隔符是一个巧妙的解决方法,但它会在开头和结尾保留引号。

  1. df.columns = ['col1', 'col2']
  2. df['col1'] = df['col1'].str[1:].astype(int)
  3. df['col2'] = df['col2'].str[:-1]
  4. col1 col2
  5. 0 1 text1
  6. 1 2 This a &quot;TEXT&quot;. However, I cannot parse it.

另一种方法是,如果不是寻找&quot;,&quot;,而是在引号上使用前瞻和后顾:

  1. df = pd.read_csv('test.csv', sep = r'(?<="),(?=")', engine = 'python')
  2. (df.applymap(lambda x: x.strip('"')) # 删除所有值开头和结尾的引号
  3. .rename(columns = lambda x: x.strip('"')) # 对列名进行相同处理
  4. .assign(col1 = lambda x: x.col1.astype(int)) # 将col1更改为整数列)
英文:

The issue is that none of the commas or quotation marks within your CSV are escaped. Using &quot;,&quot; as the delimeter is a smart way around it, but it leaves the quotation marks on the start and end.

  1. df.columns = [&#39;col1&#39;, &#39;col2&#39;]
  2. df[&#39;col1&#39;] = df[&#39;col1&#39;].str[1:].astype(int)
  3. df[&#39;col2&#39;] = df[&#39;col2&#39;].str[:-1]
  4. col1 col2
  5. 0 1 text1
  6. 1 2 This a &quot;TEXT&quot;. However, I cannot parse it.

Here's another way, if instead of looking for &quot;,&quot;, you instead had a lookahead and a lookbehind for quotation marks:

  1. df = pd.read_csv(&#39;test.csv&#39;, sep = r&#39;(?&lt;=\&quot;),(?=\&quot;)&#39;, engine = &#39;python&#39;)
  2. (df.applymap(lambda x: x.strip(&#39;&quot;&#39;)) # remove quotation marks from the start and end of all values
  3. .rename(columns = lambda x: x.strip(&#39;&quot;&#39;)) # same with column names
  4. .assign(col1 = lambda x: x.col1.astype(int)) # change col1 to be a column of ints)

答案2

得分: 0

根据您的有趣想法,您还可以将第一个和最后一个引号作为分隔符,然后删除不需要的列:

  1. data = io.StringIO('''"col1","col2"
  2. "1","text1"
  3. "2","This a &quot;TEXT&quot;. However, I cannot parse it."
  4. ''')
  5. df = pd.read_csv(data, sep=r'"|,\s*|^"|"$', engine='python').iloc[:, 1:-1]

输出:

  1. col1 col2
  2. 0 1 text1
  3. 1 2 This a &quot;TEXT&quot;. However, I cannot parse it.

优点是您可以直接获取正确的数据类型(如果需要):

  1. df.dtypes
  2. col1 int64
  3. col2 object
  4. dtype: object

正则表达式演示

英文:

Building on your interesting idea, you can also add the first and last quotes as separators, then drop the unwanted columns:

  1. data = io.StringIO(&#39;&#39;&#39;&quot;col1&quot;,&quot;col2&quot;
  2. &quot;1&quot;,&quot;text1&quot;
  3. &quot;2&quot;,&quot;This a &quot;TEXT&quot;. However, I cannot parse it.&quot;
  4. &#39;&#39;&#39;)
  5. df = pd.read_csv(data, sep=r&#39;&quot;,&quot;|^&quot;|&quot;$&#39;, engine=&#39;python&#39;).iloc[:, 1:-1]

Output:

  1. col1 col2
  2. 0 1 text1
  3. 1 2 This a &quot;TEXT&quot;. However, I cannot parse it.

The advantage is that you directly get the correct types (if needed):

  1. df.dtypes
  2. col1 int64
  3. col2 object
  4. dtype: object

regex demo

huangapple
  • 本文由 发表于 2023年7月10日 18:03:32
  • 转载请务必保留本文链接:https://go.coder-hub.com/76652688.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定