2023年2月18日 18:52:16go评论91阅读模式

英文:

Efficient way to fill missing values using groupby

问题

以下是翻译好的内容：

我有一个包含一百万行的数据框
数据框包括ID、FDAT和LACT列
对于每个ID，可能有多个FDAT和LACT。每个ID的每个LACT的FDAT应该是相同的。偶尔会出现缺少的FDAT，我希望用相应ID和LACT的匹配FDAT来填充
示例数据

ID FDAT LACT
1 1/1/2020 1
1 1/1/2020 1
1 1/1/2021 2
1 NA 2
1 1/1/2021 2
1 1/1/2022 3

在这个示例中，NA应该被填充为1/1/2021
我正在使用以下代码来完成这个任务。不幸的是，它运行得很慢。我只想填充缺失的值，不想更改任何非空的FDAT条目。

df.sort_values(["ID",'FDAT'], inplace=True)

df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="ffill")
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="bfill")


我正在寻找能够完成相同任务但运行更快的代码。

英文:

I have a dataframe of a million rows

The dataframe includes the columns ID, FDAT, LACT

For each ID there may be multiple FDAT and LACT. The FDAT should be the same for each LACT for that ID. Occasionally there is a missing FDAT which I want to fill with the matching FDAT from that ID for that LACT

example data

ID  FDAT      LACT
1   1/1/2020    1
1   1/1/2020    1
1   1/1/2021    2
1   NA          2
1   1/1/2021    2
1   1/1/2022    3

In this example the NA should be 1/1/2021

I am using the following code to do this. Unfortunately it is very slow. I only want to fill the missing values. I do not want to change any of the nonnull FDAT entries.

df.sort_values([&quot;ID&quot;,&#39;DATE&#39;], inplace=True)
df.loc[:, &#39;FDAT&#39;] = df.groupby([&#39;ID&#39;,&#39;LACT&#39;]).fillna(method=&quot;ffill&quot;)
df.loc[:, &#39;FDAT&#39;] = df.groupby([&#39;ID&#39;,&#39;LACT&#39;]).fillna(method=&quot;bfill&quot;)

I was looking for code that would do the same thing but run faster.

答案1

得分: 1

看下面的，我给你一个更快的替代方案，与你的原始代码和计算时间一起：

import pandas as pd
data = {'ID': [1, 1, 1, 1, 1, 1],
        'FDAT': ['1/1/2020', '1/1/2020', '1/1/2021', None, '1/1/2021', '1/1/2022'],
        'LACT': [1, 1, 2, 2, 2, 3]}
df = pd.DataFrame(data)
import time
start_time = time.time()
df.sort_values(["ID", "FDAT", "LACT"], inplace=True)
df["FDAT"] = df.groupby(["ID", "LACT"])["FDAT"].transform(lambda x: x.fillna(method="ffill"))
print(df)
end_time = time.time()
print("执行时间：", end_time - start_time, "秒")

   ID      FDAT  LACT
0   1  1/1/2020     1
1   1  1/1/2020     1
2   1  1/1/2021     2
4   1  1/1/2021     2
5   1  1/1/2022     3
3   1  1/1/2021     2
执行时间： 0.008013486862182617 秒

而你的解决方案：

import pandas as pd
data = {'ID': [1, 1, 1, 1, 1, 1],
        'FDAT': ['1/1/2020', '1/1/2020', '1/1/2021', None, '1/1/2021', '1/1/2022'],
        'LACT': [1, 1, 2, 2, 2, 3]}
df = pd.DataFrame(data)
import time
start_time = time.time()
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="ffill")
df.loc[:, 'FDAT'] = df.groupby(['ID','LACT']).fillna(method="bfill")
print(df)
end_time = time.time()
print("执行时间：", end_time - start_time, "秒")

   ID      FDAT  LACT
0   1  1/1/2020     1
1   1  1/1/2020     1
2   1  1/1/2021     2
3   1  1/1/2021     2
4   1  1/1/2021     2
5   1  1/1/2022     3
执行时间： 0.011833429336547852 秒

所以，使用transform与ffill一起大约快1.5倍。注意，你的代码示例中排除了sort_values()的时间。因此，我估计使用我建议的方法可能快2.5倍。

英文:

As you see below, I give you a much faster alternative, together with your orginal and computation times:

import pandas as pd
data = {&#39;ID&#39;: [1, 1, 1, 1, 1, 1],
        &#39;FDAT&#39;: [&#39;1/1/2020&#39;, &#39;1/1/2020&#39;, &#39;1/1/2021&#39;, None, &#39;1/1/2021&#39;, &#39;1/1/2022&#39;],
        &#39;LACT&#39;: [1, 1, 2, 2, 2, 3]}
df = pd.DataFrame(data)
import time
start_time = time.time()
df.sort_values([&quot;ID&quot;, &quot;FDAT&quot;, &quot;LACT&quot;], inplace=True)
df[&quot;FDAT&quot;] = df.groupby([&quot;ID&quot;, &quot;LACT&quot;])[&quot;FDAT&quot;].transform(lambda x: x.fillna(method=&quot;ffill&quot;))
print(df)
end_time = time.time()
print(&quot;Execution time:&quot;, end_time - start_time, &quot;seconds&quot;)

returning:

   ID      FDAT  LACT
0   1  1/1/2020     1
1   1  1/1/2020     1
2   1  1/1/2021     2
4   1  1/1/2021     2
5   1  1/1/2022     3
3   1  1/1/2021     2
Execution time: 0.008013486862182617 seconds

while you solution:

import pandas as pd
data = {&#39;ID&#39;: [1, 1, 1, 1, 1, 1],
        &#39;FDAT&#39;: [&#39;1/1/2020&#39;, &#39;1/1/2020&#39;, &#39;1/1/2021&#39;, None, &#39;1/1/2021&#39;, &#39;1/1/2022&#39;],
        &#39;LACT&#39;: [1, 1, 2, 2, 2, 3]}
df = pd.DataFrame(data)
import time
start_time = time.time()
df.loc[:, &#39;FDAT&#39;] = df.groupby([&#39;ID&#39;,&#39;LACT&#39;]).fillna(method=&quot;ffill&quot;)
df.loc[:, &#39;FDAT&#39;] = df.groupby([&#39;ID&#39;,&#39;LACT&#39;]).fillna(method=&quot;bfill&quot;)
print(df)
end_time = time.time()
print(&quot;Execution time:&quot;, end_time - start_time, &quot;seconds&quot;)

returns:

   ID      FDAT  LACT
0   1  1/1/2020     1
1   1  1/1/2020     1
2   1  1/1/2021     2
3   1  1/1/2021     2
4   1  1/1/2021     2
5   1  1/1/2022     3
Execution time: 0.011833429336547852 seconds

So, using transform together with fffillis approximately 1.5 times faster. Note, that sort_values() is excluded from the time in your code example. So, I'd reckon it should be up to 2.5 times faster to use the method I suggest.

答案2

得分: 1

这里是一些处理的向量化代码。在不到一秒的时间内处理了一百万行。

def fillna_fdat(df):
    a = df.set_index(['ID', 'LACT'])['FDAT']
    b = a.dropna()
    return df.assign(
        FDAT=a.fillna(b[~b.index.duplicated(keep='first')]).to_numpy()
    )

应用到你的示例输入数据：

df = pd.DataFrame({
    'ID': [1, 1, 1, 1, 1, 1],
    'FDAT': [
        '1/1/2020', '1/1/2020', '1/1/2021', float('nan'),
        '1/1/2021', '1/1/2022'],
    'LACT': [1, 1, 2, 2, 2, 3],
})
>>> fillna_fdat(df)
   ID      FDAT  LACT
0   1  1/1/2020     1
1   1  1/1/2020     1
2   1  1/1/2021     2
3   1  1/1/2021     2
4   1  1/1/2021     2
5   1  1/1/2022     3

解释

基本思想是创建一个干净的映射 (ID, LACT): FDAT。为了高效地完成这个任务，我们使用一个索引由 [ID, LACT] 组成的 df 版本：

a = df.set_index(['ID', 'LACT'])['FDAT']
>>> a
ID  LACT
1   1       1/1/2020
    1       1/1/2020
    2       1/1/2021
    2            NaN
    2       1/1/2021
    3       1/1/2022

我们去掉 NaN 值，以及重复的索引：

b = a.dropna()
c = b[~b.index.duplicated(keep='first')]
>>> c
ID  LACT
1   1       1/1/2020
    2       1/1/2021
    3       1/1/2022

现在，我们可以用 c 中相同索引 [ 'ID', 'LACT' ] 的值替换 a 中的所有 NaN：

d = a.fillna(b[~b.index.duplicated(keep='first')])
>>> d
ID  LACT
1   1       1/1/2020
    1       1/1/2020
    2       1/1/2021
    2       1/1/2021  <-- 这是从 d.loc[(1,2)] 填充的
    2       1/1/2021
    3       1/1/2022

在这一点上，我们只想得到这些值，它们的顺序与原始的 df 相同，并忽略索引，因为我们用这些值替换了 df['FDAT']（因此有 .to_numpy() 部分）。为了保持原始的 df 不被修改（我强烈反对任何未经明确说明就更改我的输入的代码），我们使用 df.assign(FDAT=...) 的习语派生一个新的 df，然后返回它。将所有这些放在一起，就得到了上面的函数。

其他观察

注意，其他列（如果有的话）都会被保留。为了展示这一点，并测量性能，让我们写一个随机 df 的生成器：

def gen(n, k=None):
    nhalf = n // 2
    k = n // 3 if k is None else k
    df = pd.DataFrame({
        'ID': np.random.randint(0, k, nhalf),
        'FDAT': [f'1/1/{y}' for y in np.random.randint(2010, 2012+k, nhalf)],
        'LACT': np.random.randint(0, k, nhalf),
    })
    df = pd.concat([
        df,
        df.assign(FDAT=np.nan),
    ]).sample(frac=1).reset_index(drop=True).assign(
        other=np.random.uniform(size=2*nhalf)
    )
    return df

小例子：

np.random.seed(0)  # 可再现的例子
df = gen(10)
>>> df
   ID      FDAT  LACT     other
0   0  1/1/2010     2  0.957155
1   1  1/1/2014     0  0.140351
2   1  1/1/2010     2  0.870087
3   1       NaN     1  0.473608
4   0       NaN     2  0.800911
5   0  1/1/2012     2  0.520477
6   1       NaN     2  0.678880
7   1       NaN     0  0.720633
8   0       NaN     2  0.582020
9   1  1/1/2014     1  0.537373
>>> fillna_fdat(df)
   ID      FDAT  LACT     other
0   0  1/1/2010     2  0.957155
1   1  1/1/2014     0  0.140351
2   1  1/1/2010     2  0.870087
3   1  1/1/2014     1  0.473608
4   0  1/1/2010     2  0.800911
5   0  1/1/2012     2  0.520477
6   1  1/1/2010     2  0.678880
7   1  1/1/2014     0  0.720633
8   0 
<details>
<summary>英文:</summary>
Here is some vectorized code to handle this. It processes 1 million rows in under a second.
```python
def fillna_fdat(df):
    a = df.set_index([&#39;ID&#39;, &#39;LACT&#39;])[&#39;FDAT&#39;]
    b = a.dropna()
    return df.assign(
        FDAT=a.fillna(b[~b.index.duplicated(keep=&#39;first&#39;)]).to_numpy()
    )

Applied to your example input data:

df = pd.DataFrame({
    &#39;ID&#39;: [1, 1, 1, 1, 1, 1],
    &#39;FDAT&#39;: [
        &#39;1/1/2020&#39;, &#39;1/1/2020&#39;, &#39;1/1/2021&#39;, float(&#39;nan&#39;),
        &#39;1/1/2021&#39;, &#39;1/1/2022&#39;],
    &#39;LACT&#39;: [1, 1, 2, 2, 2, 3],
})
&gt;&gt;&gt; fillna_fdat(df)
   ID      FDAT  LACT
0   1  1/1/2020     1
1   1  1/1/2020     1
2   1  1/1/2021     2
3   1  1/1/2021     2
4   1  1/1/2021     2
5   1  1/1/2022     3

Explanation

The basic idea is to make a clean mapping of (ID, LACT): FDAT. To do that efficiently, we use a version of df where the index is made of [ID, LACT]:

a = df.set_index([&#39;ID&#39;, &#39;LACT&#39;])[&#39;FDAT&#39;]
&gt;&gt;&gt; a
ID  LACT
1   1       1/1/2020
    1       1/1/2020
    2       1/1/2021
    2            NaN
    2       1/1/2021
    3       1/1/2022

We drop NaN values, and duplicated indices:

b = a.dropna()
c = b[~b.index.duplicated(keep=&#39;first&#39;)]
&gt;&gt;&gt; c
ID  LACT
1   1       1/1/2020
    2       1/1/2021
    3       1/1/2022

Now, we can replace all NaNs in a by the values from c for the same index ['ID', 'LACT']:

d = a.fillna(b[~b.index.duplicated(keep=&#39;first&#39;)])
&gt;&gt;&gt; d
ID  LACT
1   1       1/1/2020
    1       1/1/2020
    2       1/1/2021
    2       1/1/2021  &lt;-- this was filled from d.loc[(1,2)]
    2       1/1/2021
    3       1/1/2022

At this point, we just want to get those values, which are in the same order as in the original df, and ignore the index as we replace df['FDAT'] with those (hence the .to_numpy() part). In order to leave the original df unmodified (I strongly resent any code that changes my inputs unless explicitly stated so), we derive a new df using the idiom df.assign(FDAT=...), and return that. Putting it all together, that gives the function above.

Other observations

Note that other columns, if any, are preserved. To show this, and to measure performance, let's write a generator of random df:

def gen(n, k=None):
    nhalf = n // 2
    k = n // 3 if k is None else k
    df = pd.DataFrame({
        &#39;ID&#39;: np.random.randint(0, k, nhalf),
        &#39;FDAT&#39;: [f&#39;1/1/{y}&#39; for y in np.random.randint(2010, 2012+k, nhalf)],
        &#39;LACT&#39;: np.random.randint(0, k, nhalf),
    })
    df = pd.concat([
        df,
        df.assign(FDAT=np.nan),
    ]).sample(frac=1).reset_index(drop=True).assign(
        other=np.random.uniform(size=2*nhalf)
    )
    return df

Small example:

np.random.seed(0)  # reproducible example
df = gen(10)
&gt;&gt;&gt; df
   ID      FDAT  LACT     other
0   0  1/1/2010     2  0.957155
1   1  1/1/2014     0  0.140351
2   1  1/1/2010     2  0.870087
3   1       NaN     1  0.473608
4   0       NaN     2  0.800911
5   0  1/1/2012     2  0.520477
6   1       NaN     2  0.678880
7   1       NaN     0  0.720633
8   0       NaN     2  0.582020
9   1  1/1/2014     1  0.537373
&gt;&gt;&gt; fillna_fdat(df)
   ID      FDAT  LACT     other
0   0  1/1/2010     2  0.957155
1   1  1/1/2014     0  0.140351
2   1  1/1/2010     2  0.870087
3   1  1/1/2014     1  0.473608
4   0  1/1/2010     2  0.800911
5   0  1/1/2012     2  0.520477
6   1  1/1/2010     2  0.678880
7   1  1/1/2014     0  0.720633
8   0  1/1/2010     2  0.582020
9   1  1/1/2014     1  0.537373

Speed

np.random.seed(0)
df = gen(1_000_000)
%timeit fillna_fdat(df)
# 806 ms &#177; 13.1 ms per loop (mean &#177; std. dev. of 7 runs, 1 loop each)

Under a second for 1 million rows.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

使用groupby填充缺失值的高效方法

问题

答案1

答案2

解释

其他观察

Explanation

Other observations

Speed

Python Pandas: how to set date format dd/mm/yyyy or dd.mm.yyyy and dtypes = datetime64

cdsapi在尝试使用Python脚本下载ERA5数据时出现错误。

如何在循环中从二进制文件中读取双精度数？

importing mysql connector for python in visual studio code

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。