英文:
Pandas dataframe: Replace values of columns addressed by index and based on condition
问题
我需要仅在特定列(无名称,因为我有一个没有标题的csv)中替换值,基于某些条件。具体来说,我需要用\N
替换""
。举例来说,假设我需要更改第8和第9列,csv的以下行:
964,64448,Alen,,2,1998,A45,,,,(Italy),e02d7543d85d91a772dc9f1cac542751
应该变成:
964,64448,Alen,,2,1998,A45,\N,\N,,(Italy),e02d7543d85d91a772dc9f1cac542751
我不能用SED
做这个,我必须使用Python
所以我正在加载csv:
df = pd.read_csv(filename, quotechar='"', escapechar='\\"', dtype=str, header=None)
并且,假设columns
是我必须更改的列的索引列表,我会这样做:
columns = [8, 9]
df.iloc[:, columns] = np.where(
df.iloc[:, columns] == '""', '\\N', df.iloc[:, columns]
)
df.to_csv(...)
这种方法不会抛出任何错误,但在输出文件中根本不起作用,不会更改任何内容。我认为这是因为iloc返回的是视图而不是df的副本,但不能确定。我尝试过df.iloc[:, columns].replace('""', '\\N', inplace=True)
,但结果相同,可能是因为操作必须在同一个df.iloc[...]
对象上执行。
我该如何完成这个任务?
英文:
I need to replace values only in certain columns (unnamed, as I have a csv without header) based on some condition.
In particular, I need to replace ""
with \N
.
For instance, suppose I need to change columns 8 and 9, the following row of the csv:
964,64448,Alen,,2,1998,A45,,,,(Italy),e02d7543d85d91a772dc9f1cac542751
Should become:
964,64448,Alen,,2,1998,A45,\N,\N,,(Italy),e02d7543d85d91a772dc9f1cac542751
I CANNOT DO THIS WITH SED
, I must use python
So I am loading the csv:
df = pd.read_csv(filename, quotechar='"', escapechar="\\", dtype=str, header=None)
and, suppose columns
is the list of indices of the columns I must change, I would do the following:
columns = [8, 9]
df.iloc[:, columns] = np.where(
df.iloc[:, columns] == "", "\\N", df.iloc[:, columns]
)
df.to_csv(...)
This approach does not throw any error but simply does not work and changes nothing in the output file. I think because iloc returns a view and not a copy of the df, but not sure.
I have tried with df.iloc[:, columns].replace("", "\\N", inplace=True)
but the result is the same, probably because the operation must be done on the same df.iloc[...]
object.
How can I get this done?
答案1
得分: 1
Pandas读取你的CSV文件时似乎将其视为具有标题但没有数据。至少需要将header
设置为None:
df = pd.read_csv('data', header=None, dtype=str)
然后,你可以通过它们的索引/默认名称(从0开始)来替换列,并写出输出:
df[7].fillna('\\N', inplace=True)
df[8].fillna('\\N', inplace=True)
# 去除任何其他的'NaN',因为它们被读取为空白
df.fillna('')
# 写出输出,同时去掉标题:
df.to_csv('out', header=None)
# 输出:
# 0,964,64448,Alen,,2,1998,A45,\N,\N,,(Italy),e02d7543d85d91a772dc9f1cac542751
在这里的概念相同,但现在不会将NaN插入到空白位置:
df = pd.read_csv('data', header=None, keep_default_na=False)
df[7].replace('', '\\N', inplace=True)
df[8].replace('', '\\N', inplace=True)
df.to_csv('out', header=None)
英文:
Pandas is reading your csv as if it has a header but no data. You at the very least need to set header
to None:
df = pd.read_csv('data', header=None, dtype=str)
Then you can go about replacing the columns by their index/default name (starts at 0), and writing the output:
df[7].fillna('\\N', inplace=True)
df[8].fillna('\\N', inplace=True)
# Get rid of any other 'NaN' as they were read in blank
df.fillna('')
# Write the output, also stripping the header:
df.to_csv('out', header=None)
# Output:
# 0,964,64448,Alen,,2,1998,A45,\N,\N,,(Italy),e02d7543d85d91a772dc9f1cac542751
Same concept here, but now doesn't inject NaN into the blank places:
df = pd.read_csv('data', header=None, keep_default_na=False)
df[7].replace('', '\\N', inplace=True)
df[8].replace('', '\\N', inplace=True)
df.to_csv('out', header=None)
答案2
得分: 0
你可以在读取文件后使用 fillna
方法:
df = pd.read_csv(filename, quotechar='"', escapechar="\\", dtype=str, header=None)
df = df.fillna({7: r'\N', 8: r'\N'}).fillna('')
df.to_csv('output.csv', index=False, header=False)
输出:
964,64448,Alen,,2,1998,A45,\N,\N,,(Italy),e02d7543d85d91a772dc9f1cac542751
英文:
You can use fillna
after reading your file:
df = pd.read_csv(filename, quotechar='"', escapechar="\\", dtype=str, header=None)
df = df.fillna({7: r'\N', 8: r'\N'}).fillna('')
df.to_csv('output.csv', index=False, header=False)
Output:
964,64448,Alen,,2,1998,A45,\N,\N,,(Italy),e02d7543d85d91a772dc9f1cac542751
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论