使用groupby填充缺失值的高效方法

huangapple go评论63阅读模式
英文:

Efficient way to fill missing values using groupby

问题

以下是翻译好的内容:

我有一个包含一百万行的数据框

数据框包括ID、FDAT和LACT列

对于每个ID,可能有多个FDAT和LACT。每个ID的每个LACT的FDAT应该是相同的。偶尔会出现缺少的FDAT,我希望用相应ID和LACT的匹配FDAT来填充

示例数据

ID FDAT LACT
1 1/1/2020 1
1 1/1/2020 1
1 1/1/2021 2
1 NA 2
1 1/1/2021 2
1 1/1/2022 3

在这个示例中,NA应该被填充为1/1/2021

我正在使用以下代码来完成这个任务。不幸的是,它运行得很慢。我只想填充缺失的值,不想更改任何非空的FDAT条目。

df.sort_values(["ID",'FDAT'], inplace=True)

df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="ffill")
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="bfill")


我正在寻找能够完成相同任务但运行更快的代码。
英文:

I have a dataframe of a million rows

The dataframe includes the columns ID, FDAT, LACT

For each ID there may be multiple FDAT and LACT. The FDAT should be the same for each LACT for that ID. Occasionally there is a missing FDAT which I want to fill with the matching FDAT from that ID for that LACT

example data

ID  FDAT      LACT
1   1/1/2020    1
1   1/1/2020    1
1   1/1/2021    2
1   NA          2
1   1/1/2021    2
1   1/1/2022    3

In this example the NA should be 1/1/2021

I am using the following code to do this. Unfortunately it is very slow. I only want to fill the missing values. I do not want to change any of the nonnull FDAT entries.

df.sort_values(["ID",'DATE'], inplace=True)

df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="ffill")
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="bfill")

I was looking for code that would do the same thing but run faster.

答案1

得分: 1

看下面的,我给你一个更快的替代方案,与你的原始代码和计算时间一起:

import pandas as pd

data = {'ID': [1, 1, 1, 1, 1, 1],
        'FDAT': ['1/1/2020', '1/1/2020', '1/1/2021', None, '1/1/2021', '1/1/2022'],
        'LACT': [1, 1, 2, 2, 2, 3]}

df = pd.DataFrame(data)
import time

start_time = time.time()

df.sort_values(["ID", "FDAT", "LACT"], inplace=True)
df["FDAT"] = df.groupby(["ID", "LACT"])["FDAT"].transform(lambda x: x.fillna(method="ffill"))

print(df)

end_time = time.time()
print("执行时间:", end_time - start_time, "秒")

返回:

   ID      FDAT  LACT
0   1  1/1/2020     1
1   1  1/1/2020     1
2   1  1/1/2021     2
4   1  1/1/2021     2
5   1  1/1/2022     3
3   1  1/1/2021     2
执行时间 0.008013486862182617

而你的解决方案:

import pandas as pd

data = {'ID': [1, 1, 1, 1, 1, 1],
        'FDAT': ['1/1/2020', '1/1/2020', '1/1/2021', None, '1/1/2021', '1/1/2022'],
        'LACT': [1, 1, 2, 2, 2, 3]}

df = pd.DataFrame(data)
import time

start_time = time.time()

df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="ffill")
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="bfill")
print(df)

end_time = time.time()
print("执行时间:", end_time - start_time, "秒")

返回:

   ID      FDAT  LACT
0   1  1/1/2020     1
1   1  1/1/2020     1
2   1  1/1/2021     2
3   1  1/1/2021     2
4   1  1/1/2021     2
5   1  1/1/2022     3
执行时间 0.011833429336547852

所以,使用transformffill一起大约快1.5倍。注意,你的代码示例中排除了sort_values()的时间。因此,我估计使用我建议的方法可能快2.5倍。

英文:

As you see below, I give you a much faster alternative, together with your orginal and computation times:

import pandas as pd

data = {'ID': [1, 1, 1, 1, 1, 1],
        'FDAT': ['1/1/2020', '1/1/2020', '1/1/2021', None, '1/1/2021', '1/1/2022'],
        'LACT': [1, 1, 2, 2, 2, 3]}

df = pd.DataFrame(data)
import time

start_time = time.time()

df.sort_values(["ID", "FDAT", "LACT"], inplace=True)
df["FDAT"] = df.groupby(["ID", "LACT"])["FDAT"].transform(lambda x: x.fillna(method="ffill"))

print(df)

end_time = time.time()
print("Execution time:", end_time - start_time, "seconds")


returning:

   ID      FDAT  LACT
0   1  1/1/2020     1
1   1  1/1/2020     1
2   1  1/1/2021     2
4   1  1/1/2021     2
5   1  1/1/2022     3
3   1  1/1/2021     2
Execution time: 0.008013486862182617 seconds

while you solution:

import pandas as pd

data = {'ID': [1, 1, 1, 1, 1, 1],
        'FDAT': ['1/1/2020', '1/1/2020', '1/1/2021', None, '1/1/2021', '1/1/2022'],
        'LACT': [1, 1, 2, 2, 2, 3]}

df = pd.DataFrame(data)
import time

start_time = time.time()

df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="ffill")
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="bfill")
print(df)

end_time = time.time()
print("Execution time:", end_time - start_time, "seconds")

returns:

   ID      FDAT  LACT
0   1  1/1/2020     1
1   1  1/1/2020     1
2   1  1/1/2021     2
3   1  1/1/2021     2
4   1  1/1/2021     2
5   1  1/1/2022     3
Execution time: 0.011833429336547852 seconds

So, using transform together with fffillis approximately 1.5 times faster. Note, that sort_values() is excluded from the time in your code example. So, I'd reckon it should be up to 2.5 times faster to use the method I suggest.

答案2

得分: 1

这里是一些处理的向量化代码。在不到一秒的时间内处理了一百万行。

def fillna_fdat(df):
    a = df.set_index(['ID', 'LACT'])['FDAT']
    b = a.dropna()
    return df.assign(
        FDAT=a.fillna(b[~b.index.duplicated(keep='first')]).to_numpy()
    )

应用到你的示例输入数据:

df = pd.DataFrame({
    'ID': [1, 1, 1, 1, 1, 1],
    'FDAT': [
        '1/1/2020', '1/1/2020', '1/1/2021', float('nan'),
        '1/1/2021', '1/1/2022'],
    'LACT': [1, 1, 2, 2, 2, 3],
})

>>> fillna_fdat(df)
   ID      FDAT  LACT
0   1  1/1/2020     1
1   1  1/1/2020     1
2   1  1/1/2021     2
3   1  1/1/2021     2
4   1  1/1/2021     2
5   1  1/1/2022     3

解释

基本思想是创建一个干净的映射 (ID, LACT): FDAT。为了高效地完成这个任务,我们使用一个索引由 [ID, LACT] 组成的 df 版本:

a = df.set_index(['ID', 'LACT'])['FDAT']
>>> a
ID  LACT
1   1       1/1/2020
    1       1/1/2020
    2       1/1/2021
    2            NaN
    2       1/1/2021
    3       1/1/2022

我们去掉 NaN 值,以及重复的索引:

b = a.dropna()
c = b[~b.index.duplicated(keep='first')]
>>> c
ID  LACT
1   1       1/1/2020
    2       1/1/2021
    3       1/1/2022

现在,我们可以用 c 中相同索引 [ 'ID', 'LACT' ] 的值替换 a 中的所有 NaN

d = a.fillna(b[~b.index.duplicated(keep='first')])
>>> d
ID  LACT
1   1       1/1/2020
    1       1/1/2020
    2       1/1/2021
    2       1/1/2021  <-- 这是从 d.loc[(1,2)] 填充的
    2       1/1/2021
    3       1/1/2022

在这一点上,我们只想得到这些值,它们的顺序与原始的 df 相同,并忽略索引,因为我们用这些值替换了 df['FDAT'](因此有 .to_numpy() 部分)。为了保持原始的 df 不被修改(我强烈反对任何未经明确说明就更改我的输入的代码),我们使用 df.assign(FDAT=...) 的习语派生一个新的 df,然后返回它。将所有这些放在一起,就得到了上面的函数。

其他观察

注意,其他列(如果有的话)都会被保留。为了展示这一点,并测量性能,让我们写一个随机 df 的生成器:

def gen(n, k=None):
    nhalf = n // 2
    k = n // 3 if k is None else k
    df = pd.DataFrame({
        'ID': np.random.randint(0, k, nhalf),
        'FDAT': [f'1/1/{y}' for y in np.random.randint(2010, 2012+k, nhalf)],
        'LACT': np.random.randint(0, k, nhalf),
    })
    df = pd.concat([
        df,
        df.assign(FDAT=np.nan),
    ]).sample(frac=1).reset_index(drop=True).assign(
        other=np.random.uniform(size=2*nhalf)
    )
    return df

小例子:

np.random.seed(0)  # 可再现的例子
df = gen(10)

>>> df
   ID      FDAT  LACT     other
0   0  1/1/2010     2  0.957155
1   1  1/1/2014     0  0.140351
2   1  1/1/2010     2  0.870087
3   1       NaN     1  0.473608
4   0       NaN     2  0.800911
5   0  1/1/2012     2  0.520477
6   1       NaN     2  0.678880
7   1       NaN     0  0.720633
8   0       NaN     2  0.582020
9   1  1/1/2014     1  0.537373

>>> fillna_fdat(df)
   ID      FDAT  LACT     other
0   0  1/1/2010     2  0.957155
1   1  1/1/2014     0  0.140351
2   1  1/1/2010     2  0.870087
3   1  1/1/2014     1  0.473608
4   0  1/1/2010     2  0.800911
5   0  1/1/2012     2  0.520477
6   1  1/1/2010     2  0.678880
7   1  1/1/2014     0  0.720633
8   0 

<details>
<summary>英文:</summary>

Here is some vectorized code to handle this. It processes 1 million rows in under a second.

```python
def fillna_fdat(df):
    a = df.set_index([&#39;ID&#39;, &#39;LACT&#39;])[&#39;FDAT&#39;]
    b = a.dropna()
    return df.assign(
        FDAT=a.fillna(b[~b.index.duplicated(keep=&#39;first&#39;)]).to_numpy()
    )

Applied to your example input data:

df = pd.DataFrame({
    &#39;ID&#39;: [1, 1, 1, 1, 1, 1],
    &#39;FDAT&#39;: [
        &#39;1/1/2020&#39;, &#39;1/1/2020&#39;, &#39;1/1/2021&#39;, float(&#39;nan&#39;),
        &#39;1/1/2021&#39;, &#39;1/1/2022&#39;],
    &#39;LACT&#39;: [1, 1, 2, 2, 2, 3],
})

&gt;&gt;&gt; fillna_fdat(df)
   ID      FDAT  LACT
0   1  1/1/2020     1
1   1  1/1/2020     1
2   1  1/1/2021     2
3   1  1/1/2021     2
4   1  1/1/2021     2
5   1  1/1/2022     3

Explanation

The basic idea is to make a clean mapping of (ID, LACT): FDAT. To do that efficiently, we use a version of df where the index is made of [ID, LACT]:

a = df.set_index([&#39;ID&#39;, &#39;LACT&#39;])[&#39;FDAT&#39;]
&gt;&gt;&gt; a
ID  LACT
1   1       1/1/2020
    1       1/1/2020
    2       1/1/2021
    2            NaN
    2       1/1/2021
    3       1/1/2022

We drop NaN values, and duplicated indices:

b = a.dropna()
c = b[~b.index.duplicated(keep=&#39;first&#39;)]
&gt;&gt;&gt; c
ID  LACT
1   1       1/1/2020
    2       1/1/2021
    3       1/1/2022

Now, we can replace all NaNs in a by the values from c for the same index [&#39;ID&#39;, &#39;LACT&#39;]:

d = a.fillna(b[~b.index.duplicated(keep=&#39;first&#39;)])
&gt;&gt;&gt; d
ID  LACT
1   1       1/1/2020
    1       1/1/2020
    2       1/1/2021
    2       1/1/2021  &lt;-- this was filled from d.loc[(1,2)]
    2       1/1/2021
    3       1/1/2022

At this point, we just want to get those values, which are in the same order as in the original df, and ignore the index as we replace df[&#39;FDAT&#39;] with those (hence the .to_numpy() part). In order to leave the original df unmodified (I strongly resent any code that changes my inputs unless explicitly stated so), we derive a new df using the idiom df.assign(FDAT=...), and return that. Putting it all together, that gives the function above.

Other observations

Note that other columns, if any, are preserved. To show this, and to measure performance, let's write a generator of random df:

def gen(n, k=None):
    nhalf = n // 2
    k = n // 3 if k is None else k
    df = pd.DataFrame({
        &#39;ID&#39;: np.random.randint(0, k, nhalf),
        &#39;FDAT&#39;: [f&#39;1/1/{y}&#39; for y in np.random.randint(2010, 2012+k, nhalf)],
        &#39;LACT&#39;: np.random.randint(0, k, nhalf),
    })
    df = pd.concat([
        df,
        df.assign(FDAT=np.nan),
    ]).sample(frac=1).reset_index(drop=True).assign(
        other=np.random.uniform(size=2*nhalf)
    )
    return df

Small example:

np.random.seed(0)  # reproducible example
df = gen(10)

&gt;&gt;&gt; df
   ID      FDAT  LACT     other
0   0  1/1/2010     2  0.957155
1   1  1/1/2014     0  0.140351
2   1  1/1/2010     2  0.870087
3   1       NaN     1  0.473608
4   0       NaN     2  0.800911
5   0  1/1/2012     2  0.520477
6   1       NaN     2  0.678880
7   1       NaN     0  0.720633
8   0       NaN     2  0.582020
9   1  1/1/2014     1  0.537373

&gt;&gt;&gt; fillna_fdat(df)
   ID      FDAT  LACT     other
0   0  1/1/2010     2  0.957155
1   1  1/1/2014     0  0.140351
2   1  1/1/2010     2  0.870087
3   1  1/1/2014     1  0.473608
4   0  1/1/2010     2  0.800911
5   0  1/1/2012     2  0.520477
6   1  1/1/2010     2  0.678880
7   1  1/1/2014     0  0.720633
8   0  1/1/2010     2  0.582020
9   1  1/1/2014     1  0.537373

Speed

np.random.seed(0)
df = gen(1_000_000)

%timeit fillna_fdat(df)
# 806 ms &#177; 13.1 ms per loop (mean &#177; std. dev. of 7 runs, 1 loop each)

Under a second for 1 million rows.

huangapple
  • 本文由 发表于 2023年2月18日 18:52:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75492822.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定