2023年5月11日 15:04:02go评论68阅读模式

英文:

Pandas replace(np.nan, value) vs fillna(value) which is faster?

问题

join method creates NaN values that are of the same type as the DataFrame it operates on. In your case, you're using a Pandas DataFrame, so it generates Pandas NaN values, which are equivalent to np.nan.
In terms of speed, the performance difference between fillna and replace for replacing NaN values in a Pandas DataFrame is typically negligible for small to moderately-sized DataFrames. However, for very large DataFrames, fillna can be slightly faster since it's optimized for such operations. The difference in speed is usually not significant enough to be a primary factor in your choice, and you should prioritize readability and ease of use.

英文:

I'm trying to replace NaNs in different columns and I wanted to know which one is better (faster) for this task, replace or fillna.

Here's some sample code for the fillna option:

df = pd.DataFrame({&#39;key&#39;: [&#39;K0&#39;, &#39;K1&#39;, &#39;K2&#39;, &#39;K3&#39;, &#39;K4&#39;, &#39;K5&#39;],
                   &#39;A&#39;: [&#39;A0&#39;, &#39;A1&#39;, &#39;A2&#39;, &#39;A3&#39;, &#39;A4&#39;, &#39;A5&#39;]})
other = pd.DataFrame({&#39;key_2&#39;: [&#39;K0.1&#39;, &#39;K1.1&#39;, &#39;K2.1&#39;],
                      &#39;B&#39;: [&#39;B0&#39;, &#39;B1&#39;, &#39;B2&#39;]},index=[0,2,3])

result = df.join([other])

After this line the joined dataframe looks like this:

  key   A key_2    B
0  K0  A0  K0.1   B0
1  K1  A1   NaN  NaN
2  K2  A2  K1.1   B1
3  K3  A3  K2.1   B2
4  K4  A4   NaN  NaN
5  K5  A5   NaN  NaN

and after doing the fillna with

result[[&#39;key&#39;,&#39;key_2&#39;]] = result[[&#39;key&#39;,&#39;key_2&#39;]].fillna(&#39;K0.0&#39;)
result[[&#39;A&#39;,&#39;B&#39;]] = result[[&#39;A&#39;,&#39;B&#39;]].fillna(&#39;B0.0&#39;)

it looks like this:

  key   A key_2     B
0  K0  A0  K0.1    B0
1  K1  A1  K0.0  B0.0
2  K2  A2  K1.1    B1
3  K3  A3  K2.1    B2
4  K4  A4  K0.0  B0.0
5  K5  A5  K0.0  B0.0

Using the replace instead,

result[[&#39;key&#39;,&#39;key_2&#39;]] = result[[&#39;key&#39;,&#39;key_2&#39;]].replace(np.nan,&#39;K0.0&#39;)
result[[&#39;A&#39;,&#39;B&#39;]] = result[[&#39;A&#39;,&#39;B&#39;]].replace(np.nan,&#39;B0.0&#39;)

The resulting dataframe is:

  key   A key_2     B
0  K0  A0  K0.1    B0
1  K1  A1  K0.0  B0.0
2  K2  A2  K1.1    B1
3  K3  A3  K2.1    B2
4  K4  A4  K0.0  B0.0
5  K5  A5  K0.0  B0.0

As you can see, they both achieve the same result, at least as far as I've been able to test.

I have 2 questions:

What kind of NaN does join create (seeing as np.nan is found, I think it's that one, but I want to be sure to catch every NaN created by the join method)
Which one is faster, fillna or replace?

答案1

得分: 0

在pandas中，空值通常用np.nan表示，尽管对于日期时间，也可以使用NaT值，但在pandas中它们被视为兼容的。此外，根据上述链接的文档：

> 使用NaN内部来表示缺失数据的选择主要是为了简单和性能原因。从pandas 1.0开始，一些可选数据类型开始尝试使用基于掩码的方法来使用本机NA标量。更多信息请参见此处。

为了效率，它们看起来相当相似：

fillna
- 10000次运行时间：24.815383911132812
- 每次运行平均时间：0.0024815383911132812
replace
- 10000次运行时间：20.818645477294922
- 每次运行平均时间：0.002081864547729492

然而，考虑到文档中提到的“一些可选数据类型开始尝试使用基于掩码的本机NA标量方法”，最安全的做法是只使用fillna并让pandas处理缺失值。从可读性的角度来看，fillna比replace(np.nan, ...)更简洁清晰。

英文:

Empty values in pandas are often represented with np.nan, although it can also use NaT values for datetimes, but they are considered compatible in pandas. Also from the documentation linked above:

> The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. Starting from pandas 1.0, some optional data types start experimenting with a native NA scalar using a mask-based approach. See here for more.

For efficiency, they seem fairly similar:

fillna
- time for 10000 runs: 24.815383911132812
- average time per run: 0.0024815383911132812
replace
- time for 10000 runs: 20.818645477294922
- average time per run: 0.002081864547729492

However, considering the documentation where "some optional data types start experimenting with a native NA scalar using a mask-based approach", it is safer to just use fillna and let pandas handle the missing values. Also, from a readability standpoint, fillna is shorter and clearer than replace(np.nan, ...).

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Pandas replace(np.nan, value) 与 fillna(value) 哪个更快？

问题

答案1

I want to select data using ranges of longitudes and latitudes in a NetCDF4 file using Python on Windows. I can't even open the dataset with xarray

‘pipenv lock -r’ 出现 ‘No such option: -r’ 错误的原因是什么？

筛选R中具有特定字符串值的行

ValueError: DataFrame constructor not properly called! (WebScraping)

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论