创建动态分箱的Python函数

huangapple go评论88阅读模式
英文:

Function to create dynamic bins in Python

问题

I have pandas DataFrame of numeric columns. This data does not contain any zeros(0).

我有一个包含数值列的pandas DataFrame。这个数据不包含任何零(0)。

I have defined a function that creates integer bins as follows

我已经定义了一个函数,它创建整数分箱,如下所示:

  1. -1 for all negative values
    对于所有负值设定为-1

  2. Bin width (e.g. 1000000) upto threshold (e.g. 20000000)
    划分的宽度(例如 1000000)直到阈值(例如 20000000)

  3. After the threshold (e.g. 20000000), it will dynamically size bins to contain 1% of the total count or NOBS(No. of Observations)
    i.e. if NOBS is going below 1% of total NOBS then it will dynamically adjust the bin width such that it will be approx 1% of total NOBS.
    在阈值之后(例如 20000000),它将动态调整分箱的大小,以包含总计数的1%或NOBS(观测次数)。
    即如果NOBS低于总NOBS的1%,它将动态调整分箱宽度,以使其大致等于总NOBS的1%。

Here is my code

这是我的代码:

  1. import pandas as pd
  2. import numpy as np
  3. # Reproducible example
  4. rng = np.random.default_rng(42)
  5. a = rng.integers(-1687500000, 0, 5000)
  6. b = rng.integers(0, 20000000, 25000)
  7. c = rng.integers(20000000, 56956420100000, 4149)
  8. df = pd.DataFrame({'Col1': np.concatenate([a, b, c])})
  9. def dynamic_binning(df_, col_, first_bin, first_bin_label, bin_width, threshold):
  10. # Create bins
  11. m = df_[col_] > threshold
  12. q = np.linspace(0, 1, 100 * sum(m) // len(df_))
  13. bins = [-np.inf] + np.arange(first_bin, threshold+1, bin_width).tolist()
  14. bins += df_.loc[m, col_].quantile(q).tolist()[1:]
  15. # df_.loc[df_[col_]<=first_bin] = first_bin
  16. df_bins_ = (pd.cut(df_[col_].dropna(), bins, labels=False, right=False).sub(abs(first_bin_label)+1)
  17. .value_counts().rename('NOBS')
  18. .sort_index().rename_axis('Bins').reset_index()
  19. .assign(Percentage=lambda x: x['NOBS'] / len(df_) * 108))
  20. return df_bins_
  21. Calling the function
  22. 调用函数
  23. ```python
  24. dynamic_binning(df, 'Col1', 0, 0, 1000000, 20000000)

The output of the function

函数的输出:

  1. Bins NOBS Percentage
  2. 0 -1.0 5000 15.813054555038216
  3. 1 0.0 1274 4.0291663006237375
  4. 2 1.0 1169 3.6970921549679345
  5. 3 2.0 1287 4.070280242466836
  6. 4 3.0 1241 3.924800140560485
  7. 5 4.0 1296 4.098743740665905
  8. 6 5.0 1246 3.940613195115523
  9. 7 6.0 1288 4.073442853377844
  10. 8 7.0 1201 3.798295704120179
  11. 9 8.0 1289 4.076605464288852
  12. 10 9.0 1174 3.712905209522973
  13. 11 10.0 1273 4.02600368971273
  14. 12 11.0 1246 3.940613195115523
  15. 13 12.0 1277 4.0386541333567605
  16. 14 13.0 1292 4.086093297021875
  17. 15 14.0 1259 3.9817271369586225
  18. 16 15.0 1276 4.035491522445753
  19. 17 16.0 1200 3.7951330932091714
  20. 18 17.0 1242 3.9279627514714925
  21. 19 18.0 1204 3.807783536853202
  22. 20 19.0 1266 4.003865413335676
  23. 21 20.0 378 1.195466924360889
  24. 22 21.0 377 1.1923043134498814
  25. 23 22.0 377 1.1923043134498814
  26. 24 23.0 377 1.1923043134498814
  27. 25 24.0 377 1.1923043134498814
  28. 26 25.0 377 1.1923043134498814
  29. 27 26.0 377 1.1923043134498814
  30. 28 27.0 377 1.1923043134498814
  31. 29 28.0 377 1.1923043134498814
  32. 30 29.0 377 1.1923043134498814
  33. 31 30.0 377 1.1923043134498814

In the above example, 20 Bins -1, 0, 1, 2,3,......, 19 are till threshold i.e. 20000000. Bins 20, 21, 22,...30 are dynamic bins as per the third condition mentioned above.

在上面的例子中,20个分箱-1, 0, 1, 2,3,......, 19在阈值之前,即20000000。分箱20, 21, 22,...30是根据上述第三个条件而动态调整的分箱。

I'm facing issues while using this function for the following cases

我在以下情况下使用这个函数时遇到问题:

  1. first_bin = -10000000, bin_width = 500000, threshold = 5000000

    调用 dynamic_binning 函数:

    dynamic_binning(df, 'Col2', -10000000, -20, 500000, 10000000)

    In this case, 20 Bins -20, -19, -18,.....-1, 0, 1, 2, .....9 are till threshold i.e. 10000000. Bins `10

英文:

I have pandas DataFrame of numeric columns. This data does not contain any zeros(0).

I have defined a function that creates integer bins as follows

  1. 1. -1 for all negative values
  2. 2. Bin width (e.g. 1000000) upto threshold (e.g. 20000000)
  3. 3. After the threshold (e.g. 20000000), it will dynamically size bins to contain 1% of the total count or NOBS(No. of Observations)
  4. i.e. if NOBS is going below 1% of total NOBS then it will dynamically adjust the bin width such that it will be approx 1% of total NOBS.

Here is my code

  1. import pandas as pd
  2. import numpy as np
  3. # Reproducible example
  4. rng = np.random.default_rng(42)
  5. a = rng.integers(-1687500000, 0, 5000)
  6. b = rng.integers(0, 20000000, 25000)
  7. c = rng.integers(20000000, 56956420100000, 4149)
  8. df = pd.DataFrame({&#39;Col1&#39;: np.concatenate([a, b, c])})
  9. def dynamic_binning(df_, col_, first_bin, first_bin_label, bin_width, threshold):
  10. # Create bins
  11. m = df_[col_] &gt; threshold
  12. q = np.linspace(0, 1, 100 * sum(m) // len(df_))
  13. bins = [-np.inf] + np.arange(first_bin, threshold+1, bin_width).tolist()
  14. bins += df_.loc[m, col_].quantile(q).tolist()[1:]
  15. # df_.loc[df_[col_]&lt;=first_bin] = first_bin
  16. df_bins_ = (pd.cut(df_[col_].dropna(), bins, labels=False, right=False).sub(abs(first_bin_label)+1)
  17. .value_counts().rename(&#39;NOBS&#39;)
  18. .sort_index().rename_axis(&#39;Bins&#39;).reset_index()
  19. .assign(Percentage=lambda x: x[&#39;NOBS&#39;] / len(df_) * 108))
  20. return df_bins_

Calling the function

  1. dynamic_binning(df, &#39;Col1&#39;, 0, 0, 1000000, 20000000)

The output of the function

  1. Bins NOBS Percentage
  2. 0 -1.0 5000 15.813054555038216
  3. 1 0.0 1274 4.0291663006237375
  4. 2 1.0 1169 3.6970921549679345
  5. 3 2.0 1287 4.070280242466836
  6. 4 3.0 1241 3.924800140560485
  7. 5 4.0 1296 4.098743740665905
  8. 6 5.0 1246 3.940613195115523
  9. 7 6.0 1288 4.073442853377844
  10. 8 7.0 1201 3.798295704120179
  11. 9 8.0 1289 4.076605464288852
  12. 10 9.0 1174 3.712905209522973
  13. 11 10.0 1273 4.02600368971273
  14. 12 11.0 1246 3.940613195115523
  15. 13 12.0 1277 4.0386541333567605
  16. 14 13.0 1292 4.086093297021875
  17. 15 14.0 1259 3.9817271369586225
  18. 16 15.0 1276 4.035491522445753
  19. 17 16.0 1200 3.7951330932091714
  20. 18 17.0 1242 3.9279627514714925
  21. 19 18.0 1204 3.807783536853202
  22. 20 19.0 1266 4.003865413335676
  23. 21 20.0 378 1.195466924360889
  24. 22 21.0 377 1.1923043134498814
  25. 23 22.0 377 1.1923043134498814
  26. 24 23.0 377 1.1923043134498814
  27. 25 24.0 377 1.1923043134498814
  28. 26 25.0 377 1.1923043134498814
  29. 27 26.0 377 1.1923043134498814
  30. 28 27.0 377 1.1923043134498814
  31. 29 28.0 377 1.1923043134498814
  32. 30 29.0 377 1.1923043134498814
  33. 31 30.0 377 1.1923043134498814

In the above example, 20 Bins -1, 0, 1, 2,3,......, 19 are till threshold i.e. 20000000. Bins 20, 21, 22,...30 are dynamic bins as per the third condition mentioned above.

I'm facing issues while using this function for the following cases

  1. first_bin = -10000000, bin_width = 500000, threshold = 5000000

Calling dynamic_binning function

dynamic_binning(df, &#39;Col2&#39;, -10000000, -20, 500000, 10000000)

In this case, 20 Bins -20, -19, -18,.....-1, 0, 1, 2, .....9 are till threshold i.e. 10000000. Bins 10, 11, 12,... are dynamic bins as per the third condition mentioned above. But I need to use this code df_.loc[df_[col_]&lt;=first_bin] = first_bin before df_bins_ = (pd.cut(.... otherwise it creates two bins at the start i.e. -21 and -20 instead of -20 which includes values &lt;=10000000

  1. If I use first_bin = 0, bin_width = 0.1, threshold = 3
    Calling dynamic_binning function
    dynamic_binning(df, &#39;Col3&#39;, 0, 0, 0.1, 3)
    In this case, df[&#39;Col3&#39;] values range from -0.281 to 1.922

I get the following error

  1. ValueError: bins must increase monotonically
  1. Getting upper and lower limits of each bin (including dynamic bins) in a new column (bin_labels) e.g. -inf - 0, 0 - 1, 1 - 2, 2 - 3.........

答案1

得分: 1

也许你可以尝试这个版本:

  1. def dynamic_binning(df_, col_, first_bin, first_bin_label, bin_width, threshold):
  2. # 创建分箱
  3. m = df_[col_] > threshold
  4. q = np.linspace(0, 1, 100 * sum(m) // len(df_))
  5. bins = [-np.inf] + np.arange(first_bin, threshold+1, bin_width).tolist()
  6. bins += df_.loc[m, col_].quantile(q).tolist()[1:]
  7. df_bins_ = (pd.cut(df_[col_].dropna(), bins, labels=False, right=False)
  8. .value_counts().rename('NOBS'))
  9. # 修复缺失的分箱
  10. idx = pd.RangeIndex(df_bins_.index.min(), df_bins_.index.max(), name='Bins')
  11. df_bins_ = (df_bins_.reindex(idx, fill_value=0).reset_index()
  12. .assign(Percentage=lambda x: x['NOBS'] / len(df_) * 108,
  13. Bins=lambda x: x['Bins'].astype(int) + first_bin_label))
  14. return df_bins_

然而,你的第一个调用应该是dynamic_binning(df, 'Col1', 0, -1, 1000000, 20000000)以保持一致。

  1. >>> dynamic_binning(df, 'Col1', 0, 0, 1000000, 20000000)
  2. Bins NOBS Percentage
  3. 0 0 5000 15.813055
  4. 1 1 1274 4.029166
  5. 2 2 1169 3.697092
  6. 3 3 1287 4.070280
  7. 4 4 1241 3.924800
  8. 5 5 1296 4.098744
  9. 6 6 1246 3.940613
  10. 7 7 1288 4.073443
  11. 8 8 1201 3.798296
  12. 9 9 1289 4.076605
  13. 10 10 1174 3.712905
  14. 11 11 1273 4.026004
  15. 12 12 1246 3.940613
  16. 13 13 1277 4.038654
  17. 14 14 1292 4.086093
  18. 15 15 1259 3.981727
  19. 16 16 1276 4.035492
  20. 17 17 1200 3.795133
  21. 18 18 1242 3.927963
  22. 19 19 1204 3.807784
  23. 20 20 1266 4.003865
  24. 21 21 378 1.195467
  25. 22 22 377 1.192304
  26. 23 23 377 1.192304
  27. 24 24 377 1.192304
  28. 25 25 377 1.192304
  29. 26 26 377 1.192304
  30. 27 27 377 1.192304
  31. 28 28 377 1.192304
  32. 29 29 377 1.192304
  33. 30 30 377 1.192304
  34. >>> dynamic_binning(df, 'Col1', -10000000, -20, 500000, 10000000)
  35. Bins NOBS Percentage
  36. 0 -20 4969 15.715014
  37. 1 -19 2 0.006325
  38. 2 -18 1 0.003163
  39. 3 -17 2 0.006325
  40. 4 -16 2 0.006325
  41. .. ... ... ...
  42. 82 62 355 1.122727
  43. 83 63 355 1.122727
  44. 84 64 355 1.122727
  45. 85 65 355 1.122727
  46. 86 66 355 1.122727
  47. [87 rows x 3 columns]
  48. >>> dynamic_binning(df, 'Col1', 0, 0, 0.1, 3)
  49. Bins NOBS Percentage
  50. 0 0 5000 15.813055
  51. 1 1 0 0.000000
  52. 2 2 0 0.000000
  53. 3 3 0 0.000000
  54. 4 4 0 0.000000
  55. .. ... ... ...
  56. 118 118 347 1.097426
  57. 119 119 347 1.097426
  58. 120 120 347 1.097426
  59. 121 121 347 1.097426
  60. 122 122 347 1.097426
  61. [123 rows x 3 columns]
英文:

Maybe you can try this version:

  1. def dynamic_binning(df_, col_, first_bin, first_bin_label, bin_width, threshold):
  2. # Create bins
  3. m = df_[col_] &gt; threshold
  4. q = np.linspace(0, 1, 100 * sum(m) // len(df_))
  5. bins = [-np.inf] + np.arange(first_bin, threshold+1, bin_width).tolist()
  6. bins += df_.loc[m, col_].quantile(q).tolist()[1:]
  7. df_bins_ = (pd.cut(df_[col_].dropna(), bins, labels=False, right=False)
  8. .value_counts().rename(&#39;NOBS&#39;))
  9. # Fix missing bins
  10. idx = pd.RangeIndex(df_bins_.index.min(), df_bins_.index.max(), name=&#39;Bins&#39;)
  11. df_bins_ = (df_bins_.reindex(idx, fill_value=0).reset_index()
  12. .assign(Percentage=lambda x: x[&#39;NOBS&#39;] / len(df_) * 108,
  13. Bins=lambda x: x[&#39;Bins&#39;].astype(int) + first_bin_label))
  14. return df_bins_

However, your first call should be dynamic_binning(df, &#39;Col1&#39;, 0, -1, 1000000, 20000000) to be consistent.

  1. &gt;&gt;&gt; dynamic_binning(df, &#39;Col1&#39;, 0, 0, 1000000, 20000000)
  2. Bins NOBS Percentage
  3. 0 0 5000 15.813055
  4. 1 1 1274 4.029166
  5. 2 2 1169 3.697092
  6. 3 3 1287 4.070280
  7. 4 4 1241 3.924800
  8. 5 5 1296 4.098744
  9. 6 6 1246 3.940613
  10. 7 7 1288 4.073443
  11. 8 8 1201 3.798296
  12. 9 9 1289 4.076605
  13. 10 10 1174 3.712905
  14. 11 11 1273 4.026004
  15. 12 12 1246 3.940613
  16. 13 13 1277 4.038654
  17. 14 14 1292 4.086093
  18. 15 15 1259 3.981727
  19. 16 16 1276 4.035492
  20. 17 17 1200 3.795133
  21. 18 18 1242 3.927963
  22. 19 19 1204 3.807784
  23. 20 20 1266 4.003865
  24. 21 21 378 1.195467
  25. 22 22 377 1.192304
  26. 23 23 377 1.192304
  27. 24 24 377 1.192304
  28. 25 25 377 1.192304
  29. 26 26 377 1.192304
  30. 27 27 377 1.192304
  31. 28 28 377 1.192304
  32. 29 29 377 1.192304
  33. 30 30 377 1.192304
  34. &gt;&gt;&gt; dynamic_binning(df, &#39;Col1&#39;, -10000000, -20, 500000, 10000000)
  35. Bins NOBS Percentage
  36. 0 -20 4969 15.715014
  37. 1 -19 2 0.006325
  38. 2 -18 1 0.003163
  39. 3 -17 2 0.006325
  40. 4 -16 2 0.006325
  41. .. ... ... ...
  42. 82 62 355 1.122727
  43. 83 63 355 1.122727
  44. 84 64 355 1.122727
  45. 85 65 355 1.122727
  46. 86 66 355 1.122727
  47. [87 rows x 3 columns]
  48. &gt;&gt;&gt; dynamic_binning(df, &#39;Col1&#39;, 0, 0, 0.1, 3)
  49. Bins NOBS Percentage
  50. 0 0 5000 15.813055
  51. 1 1 0 0.000000
  52. 2 2 0 0.000000
  53. 3 3 0 0.000000
  54. 4 4 0 0.000000
  55. .. ... ... ...
  56. 118 118 347 1.097426
  57. 119 119 347 1.097426
  58. 120 120 347 1.097426
  59. 121 121 347 1.097426
  60. 122 122 347 1.097426
  61. [123 rows x 3 columns]

huangapple
  • 本文由 发表于 2023年6月16日 05:15:55
  • 转载请务必保留本文链接:https://go.coder-hub.com/76485569.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定