英文:
Efficient way to fill missing values using groupby
问题
以下是翻译好的内容:
我有一个包含一百万行的数据框
数据框包括ID、FDAT和LACT列
对于每个ID,可能有多个FDAT和LACT。每个ID的每个LACT的FDAT应该是相同的。偶尔会出现缺少的FDAT,我希望用相应ID和LACT的匹配FDAT来填充
示例数据
ID FDAT LACT
1 1/1/2020 1
1 1/1/2020 1
1 1/1/2021 2
1 NA 2
1 1/1/2021 2
1 1/1/2022 3
在这个示例中,NA应该被填充为1/1/2021
我正在使用以下代码来完成这个任务。不幸的是,它运行得很慢。我只想填充缺失的值,不想更改任何非空的FDAT条目。
df.sort_values(["ID",'FDAT'], inplace=True)
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="ffill")
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="bfill")
我正在寻找能够完成相同任务但运行更快的代码。
英文:
I have a dataframe of a million rows
The dataframe includes the columns ID, FDAT, LACT
For each ID there may be multiple FDAT and LACT. The FDAT should be the same for each LACT for that ID. Occasionally there is a missing FDAT which I want to fill with the matching FDAT from that ID for that LACT
example data
ID FDAT LACT
1 1/1/2020 1
1 1/1/2020 1
1 1/1/2021 2
1 NA 2
1 1/1/2021 2
1 1/1/2022 3
In this example the NA should be 1/1/2021
I am using the following code to do this. Unfortunately it is very slow. I only want to fill the missing values. I do not want to change any of the nonnull FDAT entries.
df.sort_values(["ID",'DATE'], inplace=True)
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="ffill")
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="bfill")
I was looking for code that would do the same thing but run faster.
答案1
得分: 1
看下面的,我给你一个更快的替代方案,与你的原始代码和计算时间一起:
import pandas as pd
data = {'ID': [1, 1, 1, 1, 1, 1],
'FDAT': ['1/1/2020', '1/1/2020', '1/1/2021', None, '1/1/2021', '1/1/2022'],
'LACT': [1, 1, 2, 2, 2, 3]}
df = pd.DataFrame(data)
import time
start_time = time.time()
df.sort_values(["ID", "FDAT", "LACT"], inplace=True)
df["FDAT"] = df.groupby(["ID", "LACT"])["FDAT"].transform(lambda x: x.fillna(method="ffill"))
print(df)
end_time = time.time()
print("执行时间:", end_time - start_time, "秒")
返回:
ID FDAT LACT
0 1 1/1/2020 1
1 1 1/1/2020 1
2 1 1/1/2021 2
4 1 1/1/2021 2
5 1 1/1/2022 3
3 1 1/1/2021 2
执行时间: 0.008013486862182617 秒
而你的解决方案:
import pandas as pd
data = {'ID': [1, 1, 1, 1, 1, 1],
'FDAT': ['1/1/2020', '1/1/2020', '1/1/2021', None, '1/1/2021', '1/1/2022'],
'LACT': [1, 1, 2, 2, 2, 3]}
df = pd.DataFrame(data)
import time
start_time = time.time()
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="ffill")
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="bfill")
print(df)
end_time = time.time()
print("执行时间:", end_time - start_time, "秒")
返回:
ID FDAT LACT
0 1 1/1/2020 1
1 1 1/1/2020 1
2 1 1/1/2021 2
3 1 1/1/2021 2
4 1 1/1/2021 2
5 1 1/1/2022 3
执行时间: 0.011833429336547852 秒
所以,使用transform
与ffill
一起大约快1.5倍。注意,你的代码示例中排除了sort_values()
的时间。因此,我估计使用我建议的方法可能快2.5倍。
英文:
As you see below, I give you a much faster alternative, together with your orginal and computation times:
import pandas as pd
data = {'ID': [1, 1, 1, 1, 1, 1],
'FDAT': ['1/1/2020', '1/1/2020', '1/1/2021', None, '1/1/2021', '1/1/2022'],
'LACT': [1, 1, 2, 2, 2, 3]}
df = pd.DataFrame(data)
import time
start_time = time.time()
df.sort_values(["ID", "FDAT", "LACT"], inplace=True)
df["FDAT"] = df.groupby(["ID", "LACT"])["FDAT"].transform(lambda x: x.fillna(method="ffill"))
print(df)
end_time = time.time()
print("Execution time:", end_time - start_time, "seconds")
returning:
ID FDAT LACT
0 1 1/1/2020 1
1 1 1/1/2020 1
2 1 1/1/2021 2
4 1 1/1/2021 2
5 1 1/1/2022 3
3 1 1/1/2021 2
Execution time: 0.008013486862182617 seconds
while you solution:
import pandas as pd
data = {'ID': [1, 1, 1, 1, 1, 1],
'FDAT': ['1/1/2020', '1/1/2020', '1/1/2021', None, '1/1/2021', '1/1/2022'],
'LACT': [1, 1, 2, 2, 2, 3]}
df = pd.DataFrame(data)
import time
start_time = time.time()
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="ffill")
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="bfill")
print(df)
end_time = time.time()
print("Execution time:", end_time - start_time, "seconds")
returns:
ID FDAT LACT
0 1 1/1/2020 1
1 1 1/1/2020 1
2 1 1/1/2021 2
3 1 1/1/2021 2
4 1 1/1/2021 2
5 1 1/1/2022 3
Execution time: 0.011833429336547852 seconds
So, using transform
together with fffill
is approximately 1.5 times faster. Note, that sort_values()
is excluded from the time in your code example. So, I'd reckon it should be up to 2.5 times faster to use the method I suggest.
答案2
得分: 1
这里是一些处理的向量化代码。在不到一秒的时间内处理了一百万行。
def fillna_fdat(df):
a = df.set_index(['ID', 'LACT'])['FDAT']
b = a.dropna()
return df.assign(
FDAT=a.fillna(b[~b.index.duplicated(keep='first')]).to_numpy()
)
应用到你的示例输入数据:
df = pd.DataFrame({
'ID': [1, 1, 1, 1, 1, 1],
'FDAT': [
'1/1/2020', '1/1/2020', '1/1/2021', float('nan'),
'1/1/2021', '1/1/2022'],
'LACT': [1, 1, 2, 2, 2, 3],
})
>>> fillna_fdat(df)
ID FDAT LACT
0 1 1/1/2020 1
1 1 1/1/2020 1
2 1 1/1/2021 2
3 1 1/1/2021 2
4 1 1/1/2021 2
5 1 1/1/2022 3
解释
基本思想是创建一个干净的映射 (ID, LACT): FDAT
。为了高效地完成这个任务,我们使用一个索引由 [ID, LACT]
组成的 df
版本:
a = df.set_index(['ID', 'LACT'])['FDAT']
>>> a
ID LACT
1 1 1/1/2020
1 1/1/2020
2 1/1/2021
2 NaN
2 1/1/2021
3 1/1/2022
我们去掉 NaN
值,以及重复的索引:
b = a.dropna()
c = b[~b.index.duplicated(keep='first')]
>>> c
ID LACT
1 1 1/1/2020
2 1/1/2021
3 1/1/2022
现在,我们可以用 c
中相同索引 [ 'ID', 'LACT' ]
的值替换 a
中的所有 NaN
:
d = a.fillna(b[~b.index.duplicated(keep='first')])
>>> d
ID LACT
1 1 1/1/2020
1 1/1/2020
2 1/1/2021
2 1/1/2021 <-- 这是从 d.loc[(1,2)] 填充的
2 1/1/2021
3 1/1/2022
在这一点上,我们只想得到这些值,它们的顺序与原始的 df
相同,并忽略索引,因为我们用这些值替换了 df['FDAT']
(因此有 .to_numpy()
部分)。为了保持原始的 df
不被修改(我强烈反对任何未经明确说明就更改我的输入的代码),我们使用 df.assign(FDAT=...)
的习语派生一个新的 df
,然后返回它。将所有这些放在一起,就得到了上面的函数。
其他观察
注意,其他列(如果有的话)都会被保留。为了展示这一点,并测量性能,让我们写一个随机 df
的生成器:
def gen(n, k=None):
nhalf = n // 2
k = n // 3 if k is None else k
df = pd.DataFrame({
'ID': np.random.randint(0, k, nhalf),
'FDAT': [f'1/1/{y}' for y in np.random.randint(2010, 2012+k, nhalf)],
'LACT': np.random.randint(0, k, nhalf),
})
df = pd.concat([
df,
df.assign(FDAT=np.nan),
]).sample(frac=1).reset_index(drop=True).assign(
other=np.random.uniform(size=2*nhalf)
)
return df
小例子:
np.random.seed(0) # 可再现的例子
df = gen(10)
>>> df
ID FDAT LACT other
0 0 1/1/2010 2 0.957155
1 1 1/1/2014 0 0.140351
2 1 1/1/2010 2 0.870087
3 1 NaN 1 0.473608
4 0 NaN 2 0.800911
5 0 1/1/2012 2 0.520477
6 1 NaN 2 0.678880
7 1 NaN 0 0.720633
8 0 NaN 2 0.582020
9 1 1/1/2014 1 0.537373
>>> fillna_fdat(df)
ID FDAT LACT other
0 0 1/1/2010 2 0.957155
1 1 1/1/2014 0 0.140351
2 1 1/1/2010 2 0.870087
3 1 1/1/2014 1 0.473608
4 0 1/1/2010 2 0.800911
5 0 1/1/2012 2 0.520477
6 1 1/1/2010 2 0.678880
7 1 1/1/2014 0 0.720633
8 0
<details>
<summary>英文:</summary>
Here is some vectorized code to handle this. It processes 1 million rows in under a second.
```python
def fillna_fdat(df):
a = df.set_index(['ID', 'LACT'])['FDAT']
b = a.dropna()
return df.assign(
FDAT=a.fillna(b[~b.index.duplicated(keep='first')]).to_numpy()
)
Applied to your example input data:
df = pd.DataFrame({
'ID': [1, 1, 1, 1, 1, 1],
'FDAT': [
'1/1/2020', '1/1/2020', '1/1/2021', float('nan'),
'1/1/2021', '1/1/2022'],
'LACT': [1, 1, 2, 2, 2, 3],
})
>>> fillna_fdat(df)
ID FDAT LACT
0 1 1/1/2020 1
1 1 1/1/2020 1
2 1 1/1/2021 2
3 1 1/1/2021 2
4 1 1/1/2021 2
5 1 1/1/2022 3
Explanation
The basic idea is to make a clean mapping of (ID, LACT): FDAT
. To do that efficiently, we use a version of df
where the index is made of [ID, LACT]
:
a = df.set_index(['ID', 'LACT'])['FDAT']
>>> a
ID LACT
1 1 1/1/2020
1 1/1/2020
2 1/1/2021
2 NaN
2 1/1/2021
3 1/1/2022
We drop NaN
values, and duplicated indices:
b = a.dropna()
c = b[~b.index.duplicated(keep='first')]
>>> c
ID LACT
1 1 1/1/2020
2 1/1/2021
3 1/1/2022
Now, we can replace all NaN
s in a
by the values from c
for the same index ['ID', 'LACT']
:
d = a.fillna(b[~b.index.duplicated(keep='first')])
>>> d
ID LACT
1 1 1/1/2020
1 1/1/2020
2 1/1/2021
2 1/1/2021 <-- this was filled from d.loc[(1,2)]
2 1/1/2021
3 1/1/2022
At this point, we just want to get those values, which are in the same order as in the original df
, and ignore the index as we replace df['FDAT']
with those (hence the .to_numpy()
part). In order to leave the original df
unmodified (I strongly resent any code that changes my inputs unless explicitly stated so), we derive a new df
using the idiom df.assign(FDAT=...)
, and return that. Putting it all together, that gives the function above.
Other observations
Note that other columns, if any, are preserved. To show this, and to measure performance, let's write a generator of random df
:
def gen(n, k=None):
nhalf = n // 2
k = n // 3 if k is None else k
df = pd.DataFrame({
'ID': np.random.randint(0, k, nhalf),
'FDAT': [f'1/1/{y}' for y in np.random.randint(2010, 2012+k, nhalf)],
'LACT': np.random.randint(0, k, nhalf),
})
df = pd.concat([
df,
df.assign(FDAT=np.nan),
]).sample(frac=1).reset_index(drop=True).assign(
other=np.random.uniform(size=2*nhalf)
)
return df
Small example:
np.random.seed(0) # reproducible example
df = gen(10)
>>> df
ID FDAT LACT other
0 0 1/1/2010 2 0.957155
1 1 1/1/2014 0 0.140351
2 1 1/1/2010 2 0.870087
3 1 NaN 1 0.473608
4 0 NaN 2 0.800911
5 0 1/1/2012 2 0.520477
6 1 NaN 2 0.678880
7 1 NaN 0 0.720633
8 0 NaN 2 0.582020
9 1 1/1/2014 1 0.537373
>>> fillna_fdat(df)
ID FDAT LACT other
0 0 1/1/2010 2 0.957155
1 1 1/1/2014 0 0.140351
2 1 1/1/2010 2 0.870087
3 1 1/1/2014 1 0.473608
4 0 1/1/2010 2 0.800911
5 0 1/1/2012 2 0.520477
6 1 1/1/2010 2 0.678880
7 1 1/1/2014 0 0.720633
8 0 1/1/2010 2 0.582020
9 1 1/1/2014 1 0.537373
Speed
np.random.seed(0)
df = gen(1_000_000)
%timeit fillna_fdat(df)
# 806 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Under a second for 1 million rows.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论