Generate a dataframe of random numbers, but each column is restricted by lower an upper bound, and sum of row to be 100

huangapple go评论54阅读模式
英文:

Generate a dataframe of random numbers, but each column is restricted by lower an upper bound, and sum of row to be 100

问题

我正在寻找生成一个DataFrame,其中填充了随机浮点值,每列的值受元组列表的约束,这些元组定义了下限和上限。此外,DataFrame中的每一行的值应该总和为100。我尝试添加一个容差,使总和为99.99。但我的逻辑似乎有问题。问题在于有些行的总和不大于或等于99.99。

import pandas as pd
import numpy as np

num_rows = 150
num_cols = 15

bounds = [(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10), (0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10)]

tolerance = 0.1

df = pd.DataFrame(np.zeros((num_rows, num_cols)))

num_generated_rows = 0
while num_generated_rows < num_rows:
    for row in range(num_generated_rows, num_rows):
        row_sum = 0
        for col in range(num_cols):
            lower, upper = bounds[col]
            remaining_cols = num_cols - col - 1
            remaining_sum = 100 - row_sum
            if remaining_cols > 0:
                min_possible_value = max(0, remaining_sum - remaining_cols * upper)
                max_possible_value = min(upper, remaining_sum - remaining_cols * lower)
            else:
                min_possible_value = max(0, 100 - row_sum - upper)
                max_possible_value = min(upper, 100 - row_sum - lower)
            value = np.random.uniform(min_possible_value, max_possible_value)
            df.iloc[row, col] = value
            row_sum += value
        
        # 检查行总和是否在定义的容差范围内
        if abs(row_sum - 100) <= tolerance:
            num_generated_rows += 1
            break
        
        # 如果已达到所需数量,则停止生成行
        if num_generated_rows == num_rows:
            break

df = df.sample(frac=1).reset_index(drop=True)

请注意,这段代码生成一个DataFrame,其中包含符合定义的约束条件的随机浮点值,并且每行的总和在容差范围内。

英文:

I am looking to generate a DataFrame, filled with random float values, where the values in each column are bound by a list of tuples, defining lower and upper bounds. Additionally, each row in the dataframe should sum to 100. I tried adding a tolerance also to achieve 99.99. But my logic is somehow failing. The issue is that there are rows that don't sum greater or equal to 99.99.

import pandas as pd
import numpy as np

num_rows = 150
num_cols = 15

bounds = [(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10), (0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10)]

tolerance = 0.1

df = pd.DataFrame(np.zeros((num_rows, num_cols)))

num_generated_rows = 0
while num_generated_rows &lt; num_rows:
    for row in range(num_generated_rows, num_rows):
        row_sum = 0
        for col in range(num_cols):
            lower, upper = bounds[col]
            remaining_cols = num_cols - col - 1
            remaining_sum = 100 - row_sum
            if remaining_cols &gt; 0:
                min_possible_value = max(0, remaining_sum - remaining_cols * upper)
                max_possible_value = min(upper, remaining_sum - remaining_cols * lower)
            else:
                min_possible_value = max(0, 100 - row_sum - upper)
                max_possible_value = min(upper, 100 - row_sum - lower)
            value = np.random.uniform(min_possible_value, max_possible_value)
            df.iloc[row, col] = value
            row_sum += value
        
        # Check if the row sum is within the defined tolerance
        if abs(row_sum - 100) &lt;= tolerance:
            num_generated_rows += 1
            break
        
        # Stop generating rows if we&#39;ve reached the desired number
        if num_generated_rows == num_rows:
            break

df = df.sample(frac=1).reset_index(drop=True)

答案1

得分: 1

One (somewhat heavy-handed) approach is to use linear programming.

Advantages:

  • your bounds are guaranteed to be respected
  • your sum constraint is guaranteed to be respected and will be accurate
  • it is vectorised, so you don't need for-loops
  • the results are random-ish

Disadvantages:

  • there are probably faster methods, though this runs "quickly enough".
  • if you care about guarantees of random distribution uniformity: there are none here (though I'm pretty sure given your constraints that true uniformity is impossible)
import numpy as np
import pandas as pd
import scipy.optimize
from numpy.random import default_rng
from scipy.optimize import Bounds, LinearConstraint

num_rows = 150
num_cols = 15

rand = default_rng(seed=0)

bounds = np.array((
    (0, 20), (0, 10), (0, 15), (0,  5), (0, 10), (0,  5), (0, 20), (0, 10),
    (0, 10), (0,  5), (0,  5), (0, 10), (0,  5), (0, 10), (0, 10),
))
cell_bounds = np.tile bounds, (num_rows, 1))
perturbance = (cell_bounds[:, 1].min() - cell_bounds[:, 0].max()) / 2
lower = cell_bounds[:, 0] + rand.uniform(low=0, high=perturbance, size=num_cols*num_rows)
upper = cell_bounds[:, 1] - rand.uniform(low=0, high=perturbance, size=num_cols*num_rows)

A = np.repeat(a=np.eye(num_rows), repeats=num_cols, axis=1)

c_bound = np.full(shape=num_rows, fill_value=100)

result = scipy.optimize.milp(
    c=rand.uniform(low=-1, high=1, size=num_cols*num_rows),
    bounds=Bounds(lb=lower, ub=upper),
    constraints=LinearConstraint(A=A, lb=c_bound, ub=c_bound),)
print(result.message)
df = pd.DataFrame(result.x.reshape((num_rows, num_cols)))
print(df)
Optimization terminated successfully. (HiGHS Status 7: Optimal)
            0         1          2   ...        12        13        14
0    19.695780  8.378342   0.102434  ...  4.444407  8.091462  9.073666
1    17.113129  9.108588  14.591218  ...  2.452088  8.309021  9.861998
2    18.569490  8.074654  13.583485  ...  3.475941  0.844778  7.541122
3    19.393683  8.766053   1.557968  ...  1.990811  8.882646  0.775763
4    19.836319  8.456286   0.226883  ...  4.442556  8.904374  2.308569
..         ...       ...        ...  ...       ...       ...       ...
145  19.226648  3.675740  13.277363  ...  1.632323  0.432656  1.095681
146  17.893662  8.960568  13.408240  ...  2.419953  1.104904  8.556052
147   1.376055  8.251378  12.722397  ...  3.709209  9.451243  8.980960
148  17.538430  8.255690  12.762953  ...  3.606618  1.019980  9.087719
149   0.817833  9.246465  13.594034  ...  4.184895  8.385792  1.716741

[150 rows x 15 columns]
英文:

One (somewhat heavy-handed) approach is to use linear programming.

Advantages:

  • your bounds are guaranteed to be respected
  • your sum constraint is guaranteed to be respected and will be accurate
  • it is vectorised, so you don't need for-loops
  • the results are random-ish

Disadvantages:

  • there are probably faster methods, though this runs "quickly enough".
  • if you care about guarantees of random distribution uniformity: there are none here (though I'm pretty sure given your constraints that true uniformity is impossible)
import numpy as np
import pandas as pd
import scipy.optimize
from numpy.random import default_rng
from scipy.optimize import Bounds, LinearConstraint

num_rows = 150
num_cols = 15

rand = default_rng(seed=0)

bounds = np.array((
    (0, 20), (0, 10), (0, 15), (0,  5), (0, 10), (0,  5), (0, 20), (0, 10),
    (0, 10), (0,  5), (0,  5), (0, 10), (0,  5), (0, 10), (0, 10),
))
cell_bounds = np.tile(bounds, (num_rows, 1))
perturbance = (cell_bounds[:, 1].min() - cell_bounds[:, 0].max()) / 2
lower = cell_bounds[:, 0] + rand.uniform(low=0, high=perturbance, size=num_cols*num_rows)
upper = cell_bounds[:, 1] - rand.uniform(low=0, high=perturbance, size=num_cols*num_rows)

A = np.repeat(a=np.eye(num_rows), repeats=num_cols, axis=1)

c_bound = np.full(shape=num_rows, fill_value=100)

result = scipy.optimize.milp(
    c=rand.uniform(low=-1, high=1, size=num_cols*num_rows),
    bounds=Bounds(lb=lower, ub=upper),
    constraints=LinearConstraint(A=A, lb=c_bound, ub=c_bound),)
print(result.message)
df = pd.DataFrame(result.x.reshape((num_rows, num_cols)))
print(df)
Optimization terminated successfully. (HiGHS Status 7: Optimal)
            0         1          2   ...        12        13        14
0    19.695780  8.378342   0.102434  ...  4.444407  8.091462  9.073666
1    17.113129  9.108588  14.591218  ...  2.452088  8.309021  9.861998
2    18.569490  8.074654  13.583485  ...  3.475941  0.844778  7.541122
3    19.393683  8.766053   1.557968  ...  1.990811  8.882646  0.775763
4    19.836319  8.456286   0.226883  ...  4.442556  8.904374  2.308569
..         ...       ...        ...  ...       ...       ...       ...
145  19.226648  3.675740  13.277363  ...  1.632323  0.432656  1.095681
146  17.893662  8.960568  13.408240  ...  2.419953  1.104904  8.556052
147   1.376055  8.251378  12.722397  ...  3.709209  9.451243  8.980960
148  17.538430  8.255690  12.762953  ...  3.606618  1.019980  9.087719
149   0.817833  9.246465  13.594034  ...  4.184895  8.385792  1.716741

[150 rows x 15 columns]

答案2

得分: 1

不需要翻译的代码部分:

Rather than to draw numbers between [lb, ub], you can draw numbers between [0, 1] that acts as coefficient. 0.25 for an interval of [0, 30] is 7.5 and rescale this coefficient to the right value after fixing new bounds (check comments).

You can use the following optimized code:

import numpy as np

bounds = [(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10), (0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10)]
num_cols = len(bounds)
num_rows = 150

TARGET = 100

1st-pass, random generation

lb = np.array(bounds, dtype=float)[:, 0]
ub = np.array(bounds, dtype=float)[:, 1]

Check feasibility

assert lb.sum() <= TARGET, f'Infeasible problem: sum(lb) > {TARGET}'
assert ub.sum() >= TARGET, f'Infeasible problem: sum(ub) < {TARGET}'

x is a coefficient between 0 and 1

y is a value between lb and ub

np.random.seed(42) # for demo only
x = np.random.random(size=(num_rows, num_cols))
y = x*ub + (1-x)*lb

Fix bounds

m = np.c_[np.sum(y, axis=1) < TARGET] # below or above TARGET?
lb = np.where(m, y, lb) # Change lower bounds to x if below TARGET
ub = np.where(~m, y, ub) # Change upper bounds to x if above TARGET

2nd-pass, fix each values

LB, UB = lb.sum(axis=1), ub.sum(axis=1)
x = (TARGET-LB) / (UB-LB)
y = x[:, None]*ub + (1-x[:, None])*lb

Note: don't forget to remove np.random.seed(42)

英文:

Rather than to draw numbers between [lb, ub], you can draw numbers between [0, 1] that acts as coefficient. 0.25 for an interval of [0, 30] is 7.5 and rescale this coefficient to the right value after fixing new bounds (check comments).

You can use the following optimized code:

import numpy as np

bounds = [(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10), (0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10)]
num_cols = len(bounds)
num_rows = 150

TARGET = 100

# 1st-pass, random generation
lb = np.array(bounds, dtype=float)[:, 0]
ub = np.array(bounds, dtype=float)[:, 1]

# Check feasibility
assert lb.sum() &lt;= TARGET, f&#39;Infeasible problem: sum(lb) &gt; {TARGET}&#39;
assert ub.sum() &gt;= TARGET, f&#39;Infeasible problem: sum(ub) &lt; {TARGET}&#39;

# x is a coefficient between 0 and 1
# y is a value between lb and ub
np.random.seed(42)  # for demo only
x = np.random.random(size=(num_rows, num_cols))
y = x*ub + (1-x)*lb

# Fix bounds
m = np.c_[np.sum(y, axis=1) &lt; TARGET]  # below or above TARGET?
lb = np.where(m, y, lb)  # Change lower bounds to x if below TARGET
ub = np.where(~m, y, ub)  # Change upper bounds to x if above TARGET

# 2nd-pass, fix each values
LB, UB = lb.sum(axis=1), ub.sum(axis=1)
x = (TARGET-LB) / (UB-LB)
y = x[:, None]*ub + (1-x[:, None])*lb

Note: don't forget to remove np.random.seed(42)

Output:

&gt;&gt;&gt; y
array([[12.12345246,  9.68966746, 12.4687076 , ...,  4.47247875,
         5.04041056,  4.84827506],
       [11.89476371,  6.54707803, 11.46217608, ...,  3.79461597,
         7.97722602,  5.26770304],
       [15.28289786,  5.01507058,  6.57181308, ...,  2.09846591,
         9.45503973,  5.54546481],
       ...,
       [18.39384868,  2.94238081, 12.9170368 , ...,  1.47846341,
         8.75263679,  1.3008411 ],
       [16.51397403,  7.29824288,  7.02829416, ...,  4.71439879,
         9.13096929,  8.04592187],
       [15.17896817,  9.36179865,  7.75855936, ...,  4.50037072,
         7.62359895,  4.75520839]])

&gt;&gt;&gt; y.shape
(150, 15)

&gt;&gt;&gt; y.sum(axis=1)
array([100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100.])

huangapple
  • 本文由 发表于 2023年3月9日 14:52:48
  • 转载请务必保留本文链接:https://go.coder-hub.com/75681296.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定