2023年6月6日 02:10:51go评论105阅读模式

英文:

Pandas dataframe: Replace values of columns addressed by index and based on condition

问题

我需要仅在特定列（无名称，因为我有一个没有标题的csv）中替换值，基于某些条件。具体来说，我需要用\N替换""。举例来说，假设我需要更改第8和第9列，csv的以下行：

964,64448,Alen,,2,1998,A45,,,,(Italy),e02d7543d85d91a772dc9f1cac542751

应该变成：

964,64448,Alen,,2,1998,A45,\N,\N,,(Italy),e02d7543d85d91a772dc9f1cac542751

我不能用SED做这个，我必须使用Python

所以我正在加载csv：

df = pd.read_csv(filename, quotechar='&quot;', escapechar='\\&quot;', dtype=str, header=None)

并且，假设columns是我必须更改的列的索引列表，我会这样做：

columns = [8, 9]
df.iloc[:, columns] = np.where(
    df.iloc[:, columns] == '&quot;&quot;', '\\N', df.iloc[:, columns]
)
df.to_csv(...)

这种方法不会抛出任何错误，但在输出文件中根本不起作用，不会更改任何内容。我认为这是因为iloc返回的是视图而不是df的副本，但不能确定。我尝试过df.iloc[:, columns].replace('""', '\\N', inplace=True)，但结果相同，可能是因为操作必须在同一个df.iloc[...]对象上执行。

我该如何完成这个任务？

英文:

I need to replace values only in certain columns (unnamed, as I have a csv without header) based on some condition.
In particular, I need to replace "" with \N.
For instance, suppose I need to change columns 8 and 9, the following row of the csv:

964,64448,Alen,,2,1998,A45,,,,(Italy),e02d7543d85d91a772dc9f1cac542751

Should become:

964,64448,Alen,,2,1998,A45,\N,\N,,(Italy),e02d7543d85d91a772dc9f1cac542751

I CANNOT DO THIS WITH SED, I must use python

So I am loading the csv:

df = pd.read_csv(filename, quotechar=&#39;&quot;&#39;, escapechar=&quot;\\&quot;, dtype=str, header=None)

and, suppose columns is the list of indices of the columns I must change, I would do the following:

columns = [8, 9]
df.iloc[:, columns] = np.where(
    df.iloc[:, columns] == &quot;&quot;, &quot;\\N&quot;, df.iloc[:, columns]
)
df.to_csv(...)

This approach does not throw any error but simply does not work and changes nothing in the output file. I think because iloc returns a view and not a copy of the df, but not sure.
I have tried with df.iloc[:, columns].replace("", "\\N", inplace=True) but the result is the same, probably because the operation must be done on the same df.iloc[...] object.

How can I get this done?

答案1

得分: 1

Pandas读取你的CSV文件时似乎将其视为具有标题但没有数据。至少需要将header设置为None：

df = pd.read_csv('data', header=None, dtype=str)

然后，你可以通过它们的索引/默认名称（从0开始）来替换列，并写出输出：

df[7].fillna('\\N', inplace=True)
df[8].fillna('\\N', inplace=True)
# 去除任何其他的'NaN'，因为它们被读取为空白
df.fillna('')
# 写出输出，同时去掉标题：
df.to_csv('out', header=None)
# 输出：
# 0,964,64448,Alen,,2,1998,A45,\N,\N,,(Italy),e02d7543d85d91a772dc9f1cac542751

在这里的概念相同，但现在不会将NaN插入到空白位置：

df = pd.read_csv('data', header=None, keep_default_na=False)
df[7].replace('', '\\N', inplace=True)
df[8].replace('', '\\N', inplace=True)
df.to_csv('out', header=None)

英文:

Pandas is reading your csv as if it has a header but no data. You at the very least need to set header to None:

df = pd.read_csv(&#39;data&#39;, header=None, dtype=str)

Then you can go about replacing the columns by their index/default name (starts at 0), and writing the output:

df[7].fillna(&#39;\\N&#39;, inplace=True)
df[8].fillna(&#39;\\N&#39;, inplace=True)
# Get rid of any other &#39;NaN&#39; as they were read in blank
df.fillna(&#39;&#39;)
# Write the output, also stripping the header:
df.to_csv(&#39;out&#39;, header=None)
# Output:
# 0,964,64448,Alen,,2,1998,A45,\N,\N,,(Italy),e02d7543d85d91a772dc9f1cac542751

Same concept here, but now doesn't inject NaN into the blank places:

df = pd.read_csv(&#39;data&#39;, header=None, keep_default_na=False)
df[7].replace(&#39;&#39;, &#39;\\N&#39;, inplace=True)
df[8].replace(&#39;&#39;, &#39;\\N&#39;, inplace=True)
df.to_csv(&#39;out&#39;, header=None)

答案2

得分: 0

你可以在读取文件后使用 fillna 方法：

df = pd.read_csv(filename, quotechar='"', escapechar="\\", dtype=str, header=None)
df = df.fillna({7: r'\N', 8: r'\N'}).fillna('')
df.to_csv('output.csv', index=False, header=False)

输出：

964,64448,Alen,,2,1998,A45,\N,\N,,(Italy),e02d7543d85d91a772dc9f1cac542751

英文:

You can use fillna after reading your file:

df = pd.read_csv(filename, quotechar=&#39;&quot;&#39;, escapechar=&quot;\\&quot;, dtype=str, header=None)
df = df.fillna({7: r&#39;\N&#39;, 8: r&#39;\N&#39;}).fillna(&#39;&#39;)
df.to_csv(&#39;output.csv&#39;, index=False, header=False)

Output:

964,64448,Alen,,2,1998,A45,\N,\N,,(Italy),e02d7543d85d91a772dc9f1cac542751

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas数据框架：根据索引和条件替换列中的值。

问题

答案1

答案2

Python继承 – 接口/类

Python初学者问题，什么是在0x处的类对象

可以在列表推导式中初始化变量吗？

拆分具有多个输入的LSTM的tensorflow BatchDataset

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。