英文:
pandas read_csv displays quotation marks on the first and last item of every row
问题
我的 .csv 文件看起来像这样:
"col1","col2"
"1","text1"
"2","This a "TEXT". However, I cannot parse it."
即它包含在一个值内的逗号和引号。在 read_csv()
函数中使用 sep
参数会在每一行的开头和结尾产生引号:
import pandas as pd
df = pd.read_csv('test.csv', sep = '","', engine = 'python')
df
结果会是这样:
"col1 col2"
0 "1 text1"
1 "2 This a "TEXT". However, I cannot parse it.""
我该如何才能正确读取我的文件?
英文:
My .csv file looks like this:<br>
"col1","col2"
"1","text1"
"2","This a "TEXT". However, I cannot parse it."
I.e. it contains commas and quotation marks within a value.
Using sep
parameter in read_csv()
function gives quotation marks in the beginning and ending on each line:
import pandas as pd
df = pd.read_csv('test.csv', sep = '","', engine = 'python')
df
"col1 col2"
0 "1 text1"
1 "2 This a "TEXT". However, I cannot parse it."
What can I do to read my file correctly?
答案1
得分: 0
问题在于您的CSV中的逗号和引号都没有进行转义。使用","
作为分隔符是一个巧妙的解决方法,但它会在开头和结尾保留引号。
df.columns = ['col1', 'col2']
df['col1'] = df['col1'].str[1:].astype(int)
df['col2'] = df['col2'].str[:-1]
col1 col2
0 1 text1
1 2 This a "TEXT". However, I cannot parse it.
另一种方法是,如果不是寻找","
,而是在引号上使用前瞻和后顾:
df = pd.read_csv('test.csv', sep = r'(?<="),(?=")', engine = 'python')
(df.applymap(lambda x: x.strip('"')) # 删除所有值开头和结尾的引号
.rename(columns = lambda x: x.strip('"')) # 对列名进行相同处理
.assign(col1 = lambda x: x.col1.astype(int)) # 将col1更改为整数列)
英文:
The issue is that none of the commas or quotation marks within your CSV are escaped. Using ","
as the delimeter is a smart way around it, but it leaves the quotation marks on the start and end.
df.columns = ['col1', 'col2']
df['col1'] = df['col1'].str[1:].astype(int)
df['col2'] = df['col2'].str[:-1]
col1 col2
0 1 text1
1 2 This a "TEXT". However, I cannot parse it.
Here's another way, if instead of looking for ","
, you instead had a lookahead and a lookbehind for quotation marks:
df = pd.read_csv('test.csv', sep = r'(?<=\"),(?=\")', engine = 'python')
(df.applymap(lambda x: x.strip('"')) # remove quotation marks from the start and end of all values
.rename(columns = lambda x: x.strip('"')) # same with column names
.assign(col1 = lambda x: x.col1.astype(int)) # change col1 to be a column of ints)
答案2
得分: 0
根据您的有趣想法,您还可以将第一个和最后一个引号作为分隔符,然后删除不需要的列:
data = io.StringIO('''"col1","col2"
"1","text1"
"2","This a "TEXT". However, I cannot parse it."
''')
df = pd.read_csv(data, sep=r'"|,\s*|^"|"$', engine='python').iloc[:, 1:-1]
输出:
col1 col2
0 1 text1
1 2 This a "TEXT". However, I cannot parse it.
优点是您可以直接获取正确的数据类型(如果需要):
df.dtypes
col1 int64
col2 object
dtype: object
英文:
Building on your interesting idea, you can also add the first and last quotes as separators, then drop the unwanted columns:
data = io.StringIO('''"col1","col2"
"1","text1"
"2","This a "TEXT". However, I cannot parse it."
''')
df = pd.read_csv(data, sep=r'","|^"|"$', engine='python').iloc[:, 1:-1]
Output:
col1 col2
0 1 text1
1 2 This a "TEXT". However, I cannot parse it.
The advantage is that you directly get the correct types (if needed):
df.dtypes
col1 int64
col2 object
dtype: object
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论