英文:
Pandas replace(np.nan, value) vs fillna(value) which is faster?
问题
-
joinmethod creates NaN values that are of the same type as the DataFrame it operates on. In your case, you're using a Pandas DataFrame, so it generates Pandas NaN values, which are equivalent tonp.nan. -
In terms of speed, the performance difference between
fillnaandreplacefor replacing NaN values in a Pandas DataFrame is typically negligible for small to moderately-sized DataFrames. However, for very large DataFrames,fillnacan be slightly faster since it's optimized for such operations. The difference in speed is usually not significant enough to be a primary factor in your choice, and you should prioritize readability and ease of use.
英文:
I'm trying to replace NaNs in different columns and I wanted to know which one is better (faster) for this task, replace or fillna.
Here's some sample code for the fillna option:
df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
other = pd.DataFrame({'key_2': ['K0.1', 'K1.1', 'K2.1'],
'B': ['B0', 'B1', 'B2']},index=[0,2,3])
result = df.join([other])
After this line the joined dataframe looks like this:
key A key_2 B
0 K0 A0 K0.1 B0
1 K1 A1 NaN NaN
2 K2 A2 K1.1 B1
3 K3 A3 K2.1 B2
4 K4 A4 NaN NaN
5 K5 A5 NaN NaN
and after doing the fillna with
result[['key','key_2']] = result[['key','key_2']].fillna('K0.0')
result[['A','B']] = result[['A','B']].fillna('B0.0')
it looks like this:
key A key_2 B
0 K0 A0 K0.1 B0
1 K1 A1 K0.0 B0.0
2 K2 A2 K1.1 B1
3 K3 A3 K2.1 B2
4 K4 A4 K0.0 B0.0
5 K5 A5 K0.0 B0.0
Using the replace instead,
result[['key','key_2']] = result[['key','key_2']].replace(np.nan,'K0.0')
result[['A','B']] = result[['A','B']].replace(np.nan,'B0.0')
The resulting dataframe is:
key A key_2 B
0 K0 A0 K0.1 B0
1 K1 A1 K0.0 B0.0
2 K2 A2 K1.1 B1
3 K3 A3 K2.1 B2
4 K4 A4 K0.0 B0.0
5 K5 A5 K0.0 B0.0
As you can see, they both achieve the same result, at least as far as I've been able to test.
I have 2 questions:
- What kind of NaN does join create (seeing as np.nan is found, I think it's that one, but I want to be sure to catch every NaN created by the join method)
- Which one is faster, fillna or replace?
答案1
得分: 0
在pandas中,空值通常用np.nan表示,尽管对于日期时间,也可以使用NaT值,但在pandas中它们被视为兼容的。此外,根据上述链接的文档:
> 使用NaN内部来表示缺失数据的选择主要是为了简单和性能原因。从pandas 1.0开始,一些可选数据类型开始尝试使用基于掩码的方法来使用本机NA标量。更多信息请参见此处。
为了效率,它们看起来相当相似:
-
fillna
- 10000次运行时间:24.815383911132812
- 每次运行平均时间:0.0024815383911132812
-
replace
- 10000次运行时间:20.818645477294922
- 每次运行平均时间:0.002081864547729492
然而,考虑到文档中提到的“一些可选数据类型开始尝试使用基于掩码的本机NA标量方法”,最安全的做法是只使用fillna并让pandas处理缺失值。从可读性的角度来看,fillna比replace(np.nan, ...)更简洁清晰。
英文:
Empty values in pandas are often represented with np.nan, although it can also use NaT values for datetimes, but they are considered compatible in pandas. Also from the documentation linked above:
> The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. Starting from pandas 1.0, some optional data types start experimenting with a native NA scalar using a mask-based approach. See here for more.
For efficiency, they seem fairly similar:
-
fillna
- time for 10000 runs: 24.815383911132812
- average time per run: 0.0024815383911132812
-
replace
- time for 10000 runs: 20.818645477294922
- average time per run: 0.002081864547729492
However, considering the documentation where "some optional data types start experimenting with a native NA scalar using a mask-based approach", it is safer to just use fillna and let pandas handle the missing values. Also, from a readability standpoint, fillna is shorter and clearer than replace(np.nan, ...).
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论