英文:
Pandas replace(np.nan, value) vs fillna(value) which is faster?
问题
-
join
method creates NaN values that are of the same type as the DataFrame it operates on. In your case, you're using a Pandas DataFrame, so it generates Pandas NaN values, which are equivalent tonp.nan
. -
In terms of speed, the performance difference between
fillna
andreplace
for replacing NaN values in a Pandas DataFrame is typically negligible for small to moderately-sized DataFrames. However, for very large DataFrames,fillna
can be slightly faster since it's optimized for such operations. The difference in speed is usually not significant enough to be a primary factor in your choice, and you should prioritize readability and ease of use.
英文:
I'm trying to replace NaNs in different columns and I wanted to know which one is better (faster) for this task, replace or fillna.
Here's some sample code for the fillna option:
df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
other = pd.DataFrame({'key_2': ['K0.1', 'K1.1', 'K2.1'],
'B': ['B0', 'B1', 'B2']},index=[0,2,3])
result = df.join([other])
After this line the joined dataframe looks like this:
key A key_2 B
0 K0 A0 K0.1 B0
1 K1 A1 NaN NaN
2 K2 A2 K1.1 B1
3 K3 A3 K2.1 B2
4 K4 A4 NaN NaN
5 K5 A5 NaN NaN
and after doing the fillna with
result[['key','key_2']] = result[['key','key_2']].fillna('K0.0')
result[['A','B']] = result[['A','B']].fillna('B0.0')
it looks like this:
key A key_2 B
0 K0 A0 K0.1 B0
1 K1 A1 K0.0 B0.0
2 K2 A2 K1.1 B1
3 K3 A3 K2.1 B2
4 K4 A4 K0.0 B0.0
5 K5 A5 K0.0 B0.0
Using the replace instead,
result[['key','key_2']] = result[['key','key_2']].replace(np.nan,'K0.0')
result[['A','B']] = result[['A','B']].replace(np.nan,'B0.0')
The resulting dataframe is:
key A key_2 B
0 K0 A0 K0.1 B0
1 K1 A1 K0.0 B0.0
2 K2 A2 K1.1 B1
3 K3 A3 K2.1 B2
4 K4 A4 K0.0 B0.0
5 K5 A5 K0.0 B0.0
As you can see, they both achieve the same result, at least as far as I've been able to test.
I have 2 questions:
- What kind of NaN does join create (seeing as np.nan is found, I think it's that one, but I want to be sure to catch every NaN created by the join method)
- Which one is faster, fillna or replace?
答案1
得分: 0
在pandas中,空值通常用np.nan
表示,尽管对于日期时间,也可以使用NaT值,但在pandas中它们被视为兼容的。此外,根据上述链接的文档:
> 使用NaN内部来表示缺失数据的选择主要是为了简单和性能原因。从pandas 1.0开始,一些可选数据类型开始尝试使用基于掩码的方法来使用本机NA标量。更多信息请参见此处。
为了效率,它们看起来相当相似:
-
fillna
- 10000次运行时间:24.815383911132812
- 每次运行平均时间:0.0024815383911132812
-
replace
- 10000次运行时间:20.818645477294922
- 每次运行平均时间:0.002081864547729492
然而,考虑到文档中提到的“一些可选数据类型开始尝试使用基于掩码的本机NA标量方法”,最安全的做法是只使用fillna
并让pandas处理缺失值。从可读性的角度来看,fillna
比replace(np.nan, ...)
更简洁清晰。
英文:
Empty values in pandas are often represented with np.nan
, although it can also use NaT values for datetimes, but they are considered compatible in pandas. Also from the documentation linked above:
> The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons. Starting from pandas 1.0, some optional data types start experimenting with a native NA scalar using a mask-based approach. See here for more.
For efficiency, they seem fairly similar:
-
fillna
- time for 10000 runs: 24.815383911132812
- average time per run: 0.0024815383911132812
-
replace
- time for 10000 runs: 20.818645477294922
- average time per run: 0.002081864547729492
However, considering the documentation where "some optional data types start experimenting with a native NA scalar using a mask-based approach", it is safer to just use fillna
and let pandas handle the missing values. Also, from a readability standpoint, fillna
is shorter and clearer than replace(np.nan, ...)
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论