Pandas replace(np.nan, value) 与 fillna(value) 哪个更快?

huangapple go评论68阅读模式
英文:

Pandas replace(np.nan, value) vs fillna(value) which is faster?

问题

  1. join method creates NaN values that are of the same type as the DataFrame it operates on. In your case, you're using a Pandas DataFrame, so it generates Pandas NaN values, which are equivalent to np.nan.

  2. In terms of speed, the performance difference between fillna and replace for replacing NaN values in a Pandas DataFrame is typically negligible for small to moderately-sized DataFrames. However, for very large DataFrames, fillna can be slightly faster since it's optimized for such operations. The difference in speed is usually not significant enough to be a primary factor in your choice, and you should prioritize readability and ease of use.

英文:

I'm trying to replace NaNs in different columns and I wanted to know which one is better (faster) for this task, replace or fillna.

Here's some sample code for the fillna option:

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
other = pd.DataFrame({'key_2': ['K0.1', 'K1.1', 'K2.1'],
                      'B': ['B0', 'B1', 'B2']},index=[0,2,3])

result = df.join([other])

After this line the joined dataframe looks like this:

  key   A key_2    B
0  K0  A0  K0.1   B0
1  K1  A1   NaN  NaN
2  K2  A2  K1.1   B1
3  K3  A3  K2.1   B2
4  K4  A4   NaN  NaN
5  K5  A5   NaN  NaN

and after doing the fillna with

result[['key','key_2']] = result[['key','key_2']].fillna('K0.0')
result[['A','B']] = result[['A','B']].fillna('B0.0')

it looks like this:

  key   A key_2     B
0  K0  A0  K0.1    B0
1  K1  A1  K0.0  B0.0
2  K2  A2  K1.1    B1
3  K3  A3  K2.1    B2
4  K4  A4  K0.0  B0.0
5  K5  A5  K0.0  B0.0

Using the replace instead,

result[['key','key_2']] = result[['key','key_2']].replace(np.nan,'K0.0')
result[['A','B']] = result[['A','B']].replace(np.nan,'B0.0')

The resulting dataframe is:

  key   A key_2     B
0  K0  A0  K0.1    B0
1  K1  A1  K0.0  B0.0
2  K2  A2  K1.1    B1
3  K3  A3  K2.1    B2
4  K4  A4  K0.0  B0.0
5  K5  A5  K0.0  B0.0

As you can see, they both achieve the same result, at least as far as I've been able to test.

I have 2 questions:

  1. What kind of NaN does join create (seeing as np.nan is found, I think it's that one, but I want to be sure to catch every NaN created by the join method)
  2. Which one is faster, fillna or replace?

答案1

得分: 0

在pandas中,空值通常用np.nan表示,尽管对于日期时间,也可以使用NaT值,但在pandas中它们被视为兼容的。此外,根据上述链接的文档:

> 使用NaN内部来表示缺失数据的选择主要是为了简单和性能原因。从pandas 1.0开始,一些可选数据类型开始尝试使用基于掩码的方法来使用本机NA标量。更多信息请参见此处。

为了效率,它们看起来相当相似:

  • fillna

    • 10000次运行时间:24.815383911132812
    • 每次运行平均时间:0.0024815383911132812
  • replace

    • 10000次运行时间:20.818645477294922
    • 每次运行平均时间:0.002081864547729492

然而,考虑到文档中提到的“一些可选数据类型开始尝试使用基于掩码的本机NA标量方法”,最安全的做法是只使用fillna并让pandas处理缺失值。从可读性的角度来看,fillnareplace(np.nan, ...)更简洁清晰。

英文:

Empty values in pandas are often represented with np.nan, although it can also use NaT values for datetimes, but they are considered compatible in pandas. Also from the documentation linked above:

> The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. Starting from pandas 1.0, some optional data types start experimenting with a native NA scalar using a mask-based approach. See here for more.

For efficiency, they seem fairly similar:

  • fillna

    • time for 10000 runs: 24.815383911132812
    • average time per run: 0.0024815383911132812
  • replace

    • time for 10000 runs: 20.818645477294922
    • average time per run: 0.002081864547729492

However, considering the documentation where "some optional data types start experimenting with a native NA scalar using a mask-based approach", it is safer to just use fillna and let pandas handle the missing values. Also, from a readability standpoint, fillna is shorter and clearer than replace(np.nan, ...).

huangapple
  • 本文由 发表于 2023年5月11日 15:04:02
  • 转载请务必保留本文链接:https://go.coder-hub.com/76224917.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定