创建动态分箱的Python函数

huangapple go评论59阅读模式
英文:

Function to create dynamic bins in Python

问题

I have pandas DataFrame of numeric columns. This data does not contain any zeros(0).

我有一个包含数值列的pandas DataFrame。这个数据不包含任何零(0)。

I have defined a function that creates integer bins as follows

我已经定义了一个函数,它创建整数分箱,如下所示:

  1. -1 for all negative values
    对于所有负值设定为-1

  2. Bin width (e.g. 1000000) upto threshold (e.g. 20000000)
    划分的宽度(例如 1000000)直到阈值(例如 20000000)

  3. After the threshold (e.g. 20000000), it will dynamically size bins to contain 1% of the total count or NOBS(No. of Observations)
    i.e. if NOBS is going below 1% of total NOBS then it will dynamically adjust the bin width such that it will be approx 1% of total NOBS.
    在阈值之后(例如 20000000),它将动态调整分箱的大小,以包含总计数的1%或NOBS(观测次数)。
    即如果NOBS低于总NOBS的1%,它将动态调整分箱宽度,以使其大致等于总NOBS的1%。

Here is my code

这是我的代码:

import pandas as pd
import numpy as np

# Reproducible example
rng = np.random.default_rng(42)
a = rng.integers(-1687500000, 0, 5000)
b = rng.integers(0, 20000000, 25000)
c = rng.integers(20000000, 56956420100000, 4149)
df = pd.DataFrame({'Col1': np.concatenate([a, b, c])})

def dynamic_binning(df_, col_, first_bin, first_bin_label, bin_width, threshold):

    # Create bins
    m = df_[col_] > threshold
    q = np.linspace(0, 1, 100 * sum(m) // len(df_))

    bins = [-np.inf] + np.arange(first_bin, threshold+1, bin_width).tolist()
    bins += df_.loc[m, col_].quantile(q).tolist()[1:]

#     df_.loc[df_[col_]<=first_bin] = first_bin

    df_bins_ = (pd.cut(df_[col_].dropna(), bins, labels=False, right=False).sub(abs(first_bin_label)+1)
                    .value_counts().rename('NOBS')
                    .sort_index().rename_axis('Bins').reset_index()
                    .assign(Percentage=lambda x: x['NOBS'] / len(df_) * 108))
    return df_bins_

Calling the function

调用函数

```python
dynamic_binning(df, 'Col1', 0, 0, 1000000, 20000000)

The output of the function

函数的输出:

	Bins	NOBS	Percentage
0	-1.0	5000	15.813054555038216
1	0.0     1274	4.0291663006237375
2	1.0     1169	3.6970921549679345
3	2.0 	1287	4.070280242466836
4	3.0 	1241	3.924800140560485
5	4.0 	1296	4.098743740665905
6	5.0 	1246	3.940613195115523
7	6.0 	1288	4.073442853377844
8	7.0 	1201	3.798295704120179
9	8.0 	1289	4.076605464288852
10	9.0 	1174	3.712905209522973
11	10.0	1273	4.02600368971273
12	11.0	1246	3.940613195115523
13	12.0	1277	4.0386541333567605
14	13.0	1292	4.086093297021875
15	14.0	1259	3.9817271369586225
16	15.0	1276	4.035491522445753
17	16.0	1200	3.7951330932091714
18	17.0	1242	3.9279627514714925
19	18.0	1204	3.807783536853202
20	19.0	1266	4.003865413335676
21	20.0	378 	1.195466924360889
22	21.0	377 	1.1923043134498814
23	22.0	377 	1.1923043134498814
24	23.0	377 	1.1923043134498814
25	24.0	377 	1.1923043134498814
26	25.0	377 	1.1923043134498814
27	26.0	377 	1.1923043134498814
28	27.0	377 	1.1923043134498814
29	28.0	377 	1.1923043134498814
30	29.0	377 	1.1923043134498814
31	30.0	377 	1.1923043134498814

In the above example, 20 Bins -1, 0, 1, 2,3,......, 19 are till threshold i.e. 20000000. Bins 20, 21, 22,...30 are dynamic bins as per the third condition mentioned above.

在上面的例子中,20个分箱-1, 0, 1, 2,3,......, 19在阈值之前,即20000000。分箱20, 21, 22,...30是根据上述第三个条件而动态调整的分箱。

I'm facing issues while using this function for the following cases

我在以下情况下使用这个函数时遇到问题:

  1. first_bin = -10000000, bin_width = 500000, threshold = 5000000

    调用 dynamic_binning 函数:

    dynamic_binning(df, 'Col2', -10000000, -20, 500000, 10000000)

    In this case, 20 Bins -20, -19, -18,.....-1, 0, 1, 2, .....9 are till threshold i.e. 10000000. Bins `10

英文:

I have pandas DataFrame of numeric columns. This data does not contain any zeros(0).

I have defined a function that creates integer bins as follows

 1. -1 for all negative values 
 2. Bin width (e.g. 1000000) upto threshold (e.g. 20000000)
 3. After the threshold (e.g. 20000000), it will dynamically size bins to contain 1% of the total count or NOBS(No. of Observations) 
i.e. if NOBS is going below 1% of total NOBS then it will dynamically adjust the bin width such that it will be approx 1% of total NOBS.

Here is my code

import pandas as pd
import numpy as np

# Reproducible example
rng = np.random.default_rng(42)
a = rng.integers(-1687500000, 0, 5000)
b = rng.integers(0, 20000000, 25000)
c = rng.integers(20000000, 56956420100000, 4149)
df = pd.DataFrame({&#39;Col1&#39;: np.concatenate([a, b, c])})

def dynamic_binning(df_, col_, first_bin, first_bin_label, bin_width, threshold):

    # Create bins
    m = df_[col_] &gt; threshold
    q = np.linspace(0, 1, 100 * sum(m) // len(df_))

    bins = [-np.inf] + np.arange(first_bin, threshold+1, bin_width).tolist()
    bins += df_.loc[m, col_].quantile(q).tolist()[1:]

#     df_.loc[df_[col_]&lt;=first_bin] = first_bin

    df_bins_ = (pd.cut(df_[col_].dropna(), bins, labels=False, right=False).sub(abs(first_bin_label)+1)
                    .value_counts().rename(&#39;NOBS&#39;)
                    .sort_index().rename_axis(&#39;Bins&#39;).reset_index()
                    .assign(Percentage=lambda x: x[&#39;NOBS&#39;] / len(df_) * 108))
    return df_bins_

Calling the function

dynamic_binning(df, &#39;Col1&#39;, 0, 0, 1000000, 20000000)

The output of the function

	Bins	NOBS	Percentage
0	-1.0	5000	15.813054555038216
1	0.0     1274	4.0291663006237375
2	1.0     1169	3.6970921549679345
3	2.0 	1287	4.070280242466836
4	3.0 	1241	3.924800140560485
5	4.0 	1296	4.098743740665905
6	5.0 	1246	3.940613195115523
7	6.0 	1288	4.073442853377844
8	7.0 	1201	3.798295704120179
9	8.0 	1289	4.076605464288852
10	9.0 	1174	3.712905209522973
11	10.0	1273	4.02600368971273
12	11.0	1246	3.940613195115523
13	12.0	1277	4.0386541333567605
14	13.0	1292	4.086093297021875
15	14.0	1259	3.9817271369586225
16	15.0	1276	4.035491522445753
17	16.0	1200	3.7951330932091714
18	17.0	1242	3.9279627514714925
19	18.0	1204	3.807783536853202
20	19.0	1266	4.003865413335676
21	20.0	378 	1.195466924360889
22	21.0	377 	1.1923043134498814
23	22.0	377 	1.1923043134498814
24	23.0	377 	1.1923043134498814
25	24.0	377 	1.1923043134498814
26	25.0	377 	1.1923043134498814
27	26.0	377 	1.1923043134498814
28	27.0	377 	1.1923043134498814
29	28.0	377 	1.1923043134498814
30	29.0	377 	1.1923043134498814
31	30.0	377 	1.1923043134498814

In the above example, 20 Bins -1, 0, 1, 2,3,......, 19 are till threshold i.e. 20000000. Bins 20, 21, 22,...30 are dynamic bins as per the third condition mentioned above.

I'm facing issues while using this function for the following cases

  1. first_bin = -10000000, bin_width = 500000, threshold = 5000000

Calling dynamic_binning function

dynamic_binning(df, &#39;Col2&#39;, -10000000, -20, 500000, 10000000)

In this case, 20 Bins -20, -19, -18,.....-1, 0, 1, 2, .....9 are till threshold i.e. 10000000. Bins 10, 11, 12,... are dynamic bins as per the third condition mentioned above. But I need to use this code df_.loc[df_[col_]&lt;=first_bin] = first_bin before df_bins_ = (pd.cut(.... otherwise it creates two bins at the start i.e. -21 and -20 instead of -20 which includes values &lt;=10000000

  1. If I use first_bin = 0, bin_width = 0.1, threshold = 3
    Calling dynamic_binning function
    dynamic_binning(df, &#39;Col3&#39;, 0, 0, 0.1, 3)
    In this case, df[&#39;Col3&#39;] values range from -0.281 to 1.922

I get the following error

ValueError: bins must increase monotonically
  1. Getting upper and lower limits of each bin (including dynamic bins) in a new column (bin_labels) e.g. -inf - 0, 0 - 1, 1 - 2, 2 - 3.........

答案1

得分: 1

也许你可以尝试这个版本:

def dynamic_binning(df_, col_, first_bin, first_bin_label, bin_width, threshold):

    # 创建分箱
    m = df_[col_] > threshold
    q = np.linspace(0, 1, 100 * sum(m) // len(df_))

    bins = [-np.inf] + np.arange(first_bin, threshold+1, bin_width).tolist()
    bins += df_.loc[m, col_].quantile(q).tolist()[1:]
    
    df_bins_ = (pd.cut(df_[col_].dropna(), bins, labels=False, right=False)
                  .value_counts().rename('NOBS'))

    # 修复缺失的分箱
    idx = pd.RangeIndex(df_bins_.index.min(), df_bins_.index.max(), name='Bins')

    df_bins_ = (df_bins_.reindex(idx, fill_value=0).reset_index()
                        .assign(Percentage=lambda x: x['NOBS'] / len(df_) * 108,
                                Bins=lambda x: x['Bins'].astype(int) + first_bin_label))

    return df_bins_

然而,你的第一个调用应该是dynamic_binning(df, 'Col1', 0, -1, 1000000, 20000000)以保持一致。

>>> dynamic_binning(df, 'Col1', 0, 0, 1000000, 20000000)
    Bins  NOBS  Percentage
0      0  5000   15.813055
1      1  1274    4.029166
2      2  1169    3.697092
3      3  1287    4.070280
4      4  1241    3.924800
5      5  1296    4.098744
6      6  1246    3.940613
7      7  1288    4.073443
8      8  1201    3.798296
9      9  1289    4.076605
10    10  1174    3.712905
11    11  1273    4.026004
12    12  1246    3.940613
13    13  1277    4.038654
14    14  1292    4.086093
15    15  1259    3.981727
16    16  1276    4.035492
17    17  1200    3.795133
18    18  1242    3.927963
19    19  1204    3.807784
20    20  1266    4.003865
21    21   378    1.195467
22    22   377    1.192304
23    23   377    1.192304
24    24   377    1.192304
25    25   377    1.192304
26    26   377    1.192304
27    27   377    1.192304
28    28   377    1.192304
29    29   377    1.192304
30    30   377    1.192304

>>> dynamic_binning(df, 'Col1', -10000000, -20, 500000, 10000000)
    Bins  NOBS  Percentage
0    -20  4969   15.715014
1    -19     2    0.006325
2    -18     1    0.003163
3    -17     2    0.006325
4    -16     2    0.006325
..   ...   ...         ...
82    62   355    1.122727
83    63   355    1.122727
84    64   355    1.122727
85    65   355    1.122727
86    66   355    1.122727

[87 rows x 3 columns]

>>> dynamic_binning(df, 'Col1', 0, 0, 0.1, 3)
     Bins  NOBS  Percentage
0       0  5000   15.813055
1       1     0    0.000000
2       2     0    0.000000
3       3     0    0.000000
4       4     0    0.000000
..    ...   ...         ...
118   118   347    1.097426
119   119   347    1.097426
120   120   347    1.097426
121   121   347    1.097426
122   122   347    1.097426

[123 rows x 3 columns]
英文:

Maybe you can try this version:

def dynamic_binning(df_, col_, first_bin, first_bin_label, bin_width, threshold):
# Create bins
m = df_[col_] &gt; threshold
q = np.linspace(0, 1, 100 * sum(m) // len(df_))
bins = [-np.inf] + np.arange(first_bin, threshold+1, bin_width).tolist()
bins += df_.loc[m, col_].quantile(q).tolist()[1:]
df_bins_ = (pd.cut(df_[col_].dropna(), bins, labels=False, right=False)
.value_counts().rename(&#39;NOBS&#39;))
# Fix missing bins
idx = pd.RangeIndex(df_bins_.index.min(), df_bins_.index.max(), name=&#39;Bins&#39;)
df_bins_ = (df_bins_.reindex(idx, fill_value=0).reset_index()
.assign(Percentage=lambda x: x[&#39;NOBS&#39;] / len(df_) * 108,
Bins=lambda x: x[&#39;Bins&#39;].astype(int) + first_bin_label))
return df_bins_

However, your first call should be dynamic_binning(df, &#39;Col1&#39;, 0, -1, 1000000, 20000000) to be consistent.

&gt;&gt;&gt; dynamic_binning(df, &#39;Col1&#39;, 0, 0, 1000000, 20000000)
Bins  NOBS  Percentage
0      0  5000   15.813055
1      1  1274    4.029166
2      2  1169    3.697092
3      3  1287    4.070280
4      4  1241    3.924800
5      5  1296    4.098744
6      6  1246    3.940613
7      7  1288    4.073443
8      8  1201    3.798296
9      9  1289    4.076605
10    10  1174    3.712905
11    11  1273    4.026004
12    12  1246    3.940613
13    13  1277    4.038654
14    14  1292    4.086093
15    15  1259    3.981727
16    16  1276    4.035492
17    17  1200    3.795133
18    18  1242    3.927963
19    19  1204    3.807784
20    20  1266    4.003865
21    21   378    1.195467
22    22   377    1.192304
23    23   377    1.192304
24    24   377    1.192304
25    25   377    1.192304
26    26   377    1.192304
27    27   377    1.192304
28    28   377    1.192304
29    29   377    1.192304
30    30   377    1.192304
&gt;&gt;&gt; dynamic_binning(df, &#39;Col1&#39;, -10000000, -20, 500000, 10000000)
Bins  NOBS  Percentage
0    -20  4969   15.715014
1    -19     2    0.006325
2    -18     1    0.003163
3    -17     2    0.006325
4    -16     2    0.006325
..   ...   ...         ...
82    62   355    1.122727
83    63   355    1.122727
84    64   355    1.122727
85    65   355    1.122727
86    66   355    1.122727
[87 rows x 3 columns]
&gt;&gt;&gt; dynamic_binning(df, &#39;Col1&#39;, 0, 0, 0.1, 3)
Bins  NOBS  Percentage
0       0  5000   15.813055
1       1     0    0.000000
2       2     0    0.000000
3       3     0    0.000000
4       4     0    0.000000
..    ...   ...         ...
118   118   347    1.097426
119   119   347    1.097426
120   120   347    1.097426
121   121   347    1.097426
122   122   347    1.097426
[123 rows x 3 columns]

huangapple
  • 本文由 发表于 2023年6月16日 05:15:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76485569.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定