2023年3月9日 14:52:48go评论94阅读模式

英文:

Generate a dataframe of random numbers, but each column is restricted by lower an upper bound, and sum of row to be 100

问题

我正在寻找生成一个DataFrame，其中填充了随机浮点值，每列的值受元组列表的约束，这些元组定义了下限和上限。此外，DataFrame中的每一行的值应该总和为100。我尝试添加一个容差，使总和为99.99。但我的逻辑似乎有问题。问题在于有些行的总和不大于或等于99.99。

import pandas as pd
import numpy as np
num_rows = 150
num_cols = 15
bounds = [(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10), (0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10)]
tolerance = 0.1
df = pd.DataFrame(np.zeros((num_rows, num_cols)))
num_generated_rows = 0
while num_generated_rows < num_rows:
    for row in range(num_generated_rows, num_rows):
        row_sum = 0
        for col in range(num_cols):
            lower, upper = bounds[col]
            remaining_cols = num_cols - col - 1
            remaining_sum = 100 - row_sum
            if remaining_cols > 0:
                min_possible_value = max(0, remaining_sum - remaining_cols * upper)
                max_possible_value = min(upper, remaining_sum - remaining_cols * lower)
            else:
                min_possible_value = max(0, 100 - row_sum - upper)
                max_possible_value = min(upper, 100 - row_sum - lower)
            value = np.random.uniform(min_possible_value, max_possible_value)
            df.iloc[row, col] = value
            row_sum += value
        
        # 检查行总和是否在定义的容差范围内
        if abs(row_sum - 100) <= tolerance:
            num_generated_rows += 1
            break
        
        # 如果已达到所需数量，则停止生成行
        if num_generated_rows == num_rows:
            break
df = df.sample(frac=1).reset_index(drop=True)

请注意，这段代码生成一个DataFrame，其中包含符合定义的约束条件的随机浮点值，并且每行的总和在容差范围内。

英文:

I am looking to generate a DataFrame, filled with random float values, where the values in each column are bound by a list of tuples, defining lower and upper bounds. Additionally, each row in the dataframe should sum to 100. I tried adding a tolerance also to achieve 99.99. But my logic is somehow failing. The issue is that there are rows that don't sum greater or equal to 99.99.

import pandas as pd
import numpy as np
num_rows = 150
num_cols = 15
bounds = [(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10), (0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10)]
tolerance = 0.1
df = pd.DataFrame(np.zeros((num_rows, num_cols)))
num_generated_rows = 0
while num_generated_rows &lt; num_rows:
    for row in range(num_generated_rows, num_rows):
        row_sum = 0
        for col in range(num_cols):
            lower, upper = bounds[col]
            remaining_cols = num_cols - col - 1
            remaining_sum = 100 - row_sum
            if remaining_cols &gt; 0:
                min_possible_value = max(0, remaining_sum - remaining_cols * upper)
                max_possible_value = min(upper, remaining_sum - remaining_cols * lower)
            else:
                min_possible_value = max(0, 100 - row_sum - upper)
                max_possible_value = min(upper, 100 - row_sum - lower)
            value = np.random.uniform(min_possible_value, max_possible_value)
            df.iloc[row, col] = value
            row_sum += value
        
        # Check if the row sum is within the defined tolerance
        if abs(row_sum - 100) &lt;= tolerance:
            num_generated_rows += 1
            break
        
        # Stop generating rows if we&#39;ve reached the desired number
        if num_generated_rows == num_rows:
            break
df = df.sample(frac=1).reset_index(drop=True)

答案1

得分: 1

One (somewhat heavy-handed) approach is to use linear programming.

Advantages:

your bounds are guaranteed to be respected
your sum constraint is guaranteed to be respected and will be accurate
it is vectorised, so you don't need for-loops
the results are random-ish

Disadvantages:

there are probably faster methods, though this runs "quickly enough".
if you care about guarantees of random distribution uniformity: there are none here (though I'm pretty sure given your constraints that true uniformity is impossible)

import numpy as np
import pandas as pd
import scipy.optimize
from numpy.random import default_rng
from scipy.optimize import Bounds, LinearConstraint
num_rows = 150
num_cols = 15
rand = default_rng(seed=0)
bounds = np.array((
    (0, 20), (0, 10), (0, 15), (0,  5), (0, 10), (0,  5), (0, 20), (0, 10),
    (0, 10), (0,  5), (0,  5), (0, 10), (0,  5), (0, 10), (0, 10),
))
cell_bounds = np.tile bounds, (num_rows, 1))
perturbance = (cell_bounds[:, 1].min() - cell_bounds[:, 0].max()) / 2
lower = cell_bounds[:, 0] + rand.uniform(low=0, high=perturbance, size=num_cols*num_rows)
upper = cell_bounds[:, 1] - rand.uniform(low=0, high=perturbance, size=num_cols*num_rows)
A = np.repeat(a=np.eye(num_rows), repeats=num_cols, axis=1)
c_bound = np.full(shape=num_rows, fill_value=100)
result = scipy.optimize.milp(
    c=rand.uniform(low=-1, high=1, size=num_cols*num_rows),
    bounds=Bounds(lb=lower, ub=upper),
    constraints=LinearConstraint(A=A, lb=c_bound, ub=c_bound),)
print(result.message)
df = pd.DataFrame(result.x.reshape((num_rows, num_cols)))
print(df)

Optimization terminated successfully. (HiGHS Status 7: Optimal)
            0         1          2   ...        12        13        14
0    19.695780  8.378342   0.102434  ...  4.444407  8.091462  9.073666
1    17.113129  9.108588  14.591218  ...  2.452088  8.309021  9.861998
2    18.569490  8.074654  13.583485  ...  3.475941  0.844778  7.541122
3    19.393683  8.766053   1.557968  ...  1.990811  8.882646  0.775763
4    19.836319  8.456286   0.226883  ...  4.442556  8.904374  2.308569
..         ...       ...        ...  ...       ...       ...       ...
145  19.226648  3.675740  13.277363  ...  1.632323  0.432656  1.095681
146  17.893662  8.960568  13.408240  ...  2.419953  1.104904  8.556052
147   1.376055  8.251378  12.722397  ...  3.709209  9.451243  8.980960
148  17.538430  8.255690  12.762953  ...  3.606618  1.019980  9.087719
149   0.817833  9.246465  13.594034  ...  4.184895  8.385792  1.716741
[150 rows x 15 columns]

英文:

One (somewhat heavy-handed) approach is to use linear programming.

Advantages:

your bounds are guaranteed to be respected
your sum constraint is guaranteed to be respected and will be accurate
it is vectorised, so you don't need for-loops
the results are random-ish

Disadvantages:

there are probably faster methods, though this runs "quickly enough".
if you care about guarantees of random distribution uniformity: there are none here (though I'm pretty sure given your constraints that true uniformity is impossible)

import numpy as np
import pandas as pd
import scipy.optimize
from numpy.random import default_rng
from scipy.optimize import Bounds, LinearConstraint
num_rows = 150
num_cols = 15
rand = default_rng(seed=0)
bounds = np.array((
    (0, 20), (0, 10), (0, 15), (0,  5), (0, 10), (0,  5), (0, 20), (0, 10),
    (0, 10), (0,  5), (0,  5), (0, 10), (0,  5), (0, 10), (0, 10),
))
cell_bounds = np.tile(bounds, (num_rows, 1))
perturbance = (cell_bounds[:, 1].min() - cell_bounds[:, 0].max()) / 2
lower = cell_bounds[:, 0] + rand.uniform(low=0, high=perturbance, size=num_cols*num_rows)
upper = cell_bounds[:, 1] - rand.uniform(low=0, high=perturbance, size=num_cols*num_rows)
A = np.repeat(a=np.eye(num_rows), repeats=num_cols, axis=1)
c_bound = np.full(shape=num_rows, fill_value=100)
result = scipy.optimize.milp(
    c=rand.uniform(low=-1, high=1, size=num_cols*num_rows),
    bounds=Bounds(lb=lower, ub=upper),
    constraints=LinearConstraint(A=A, lb=c_bound, ub=c_bound),)
print(result.message)
df = pd.DataFrame(result.x.reshape((num_rows, num_cols)))
print(df)

Optimization terminated successfully. (HiGHS Status 7: Optimal)
            0         1          2   ...        12        13        14
0    19.695780  8.378342   0.102434  ...  4.444407  8.091462  9.073666
1    17.113129  9.108588  14.591218  ...  2.452088  8.309021  9.861998
2    18.569490  8.074654  13.583485  ...  3.475941  0.844778  7.541122
3    19.393683  8.766053   1.557968  ...  1.990811  8.882646  0.775763
4    19.836319  8.456286   0.226883  ...  4.442556  8.904374  2.308569
..         ...       ...        ...  ...       ...       ...       ...
145  19.226648  3.675740  13.277363  ...  1.632323  0.432656  1.095681
146  17.893662  8.960568  13.408240  ...  2.419953  1.104904  8.556052
147   1.376055  8.251378  12.722397  ...  3.709209  9.451243  8.980960
148  17.538430  8.255690  12.762953  ...  3.606618  1.019980  9.087719
149   0.817833  9.246465  13.594034  ...  4.184895  8.385792  1.716741
[150 rows x 15 columns]

答案2

得分: 1

不需要翻译的代码部分：

Rather than to draw numbers between [lb, ub], you can draw numbers between [0, 1] that acts as coefficient. 0.25 for an interval of [0, 30] is 7.5 and rescale this coefficient to the right value after fixing new bounds (check comments).

You can use the following optimized code:

import numpy as np

bounds = [(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10), (0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10)]
num_cols = len(bounds)
num_rows = 150

TARGET = 100

1st-pass, random generation

lb = np.array(bounds, dtype=float)[:, 0]
ub = np.array(bounds, dtype=float)[:, 1]

Check feasibility

assert lb.sum() <= TARGET, f'Infeasible problem: sum(lb) > {TARGET}'
assert ub.sum() >= TARGET, f'Infeasible problem: sum(ub) < {TARGET}'

x is a coefficient between 0 and 1

y is a value between lb and ub

np.random.seed(42) # for demo only
x = np.random.random(size=(num_rows, num_cols))
y = x*ub + (1-x)*lb

Fix bounds

m = np.c_[np.sum(y, axis=1) < TARGET] # below or above TARGET?
lb = np.where(m, y, lb) # Change lower bounds to x if below TARGET
ub = np.where(~m, y, ub) # Change upper bounds to x if above TARGET

2nd-pass, fix each values

LB, UB = lb.sum(axis=1), ub.sum(axis=1)
x = (TARGET-LB) / (UB-LB)
y = x[:, None]*ub + (1-x[:, None])*lb

Note: don't forget to remove np.random.seed(42)

英文:

You can use the following optimized code:

import numpy as np
bounds = [(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10), (0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10)]
num_cols = len(bounds)
num_rows = 150
TARGET = 100
# 1st-pass, random generation
lb = np.array(bounds, dtype=float)[:, 0]
ub = np.array(bounds, dtype=float)[:, 1]
# Check feasibility
assert lb.sum() &lt;= TARGET, f&#39;Infeasible problem: sum(lb) &gt; {TARGET}&#39;
assert ub.sum() &gt;= TARGET, f&#39;Infeasible problem: sum(ub) &lt; {TARGET}&#39;
# x is a coefficient between 0 and 1
# y is a value between lb and ub
np.random.seed(42)  # for demo only
x = np.random.random(size=(num_rows, num_cols))
y = x*ub + (1-x)*lb
# Fix bounds
m = np.c_[np.sum(y, axis=1) &lt; TARGET]  # below or above TARGET?
lb = np.where(m, y, lb)  # Change lower bounds to x if below TARGET
ub = np.where(~m, y, ub)  # Change upper bounds to x if above TARGET
# 2nd-pass, fix each values
LB, UB = lb.sum(axis=1), ub.sum(axis=1)
x = (TARGET-LB) / (UB-LB)
y = x[:, None]*ub + (1-x[:, None])*lb

Note: don't forget to remove np.random.seed(42)

Output:

&gt;&gt;&gt; y
array([[12.12345246,  9.68966746, 12.4687076 , ...,  4.47247875,
         5.04041056,  4.84827506],
       [11.89476371,  6.54707803, 11.46217608, ...,  3.79461597,
         7.97722602,  5.26770304],
       [15.28289786,  5.01507058,  6.57181308, ...,  2.09846591,
         9.45503973,  5.54546481],
       ...,
       [18.39384868,  2.94238081, 12.9170368 , ...,  1.47846341,
         8.75263679,  1.3008411 ],
       [16.51397403,  7.29824288,  7.02829416, ...,  4.71439879,
         9.13096929,  8.04592187],
       [15.17896817,  9.36179865,  7.75855936, ...,  4.50037072,
         7.62359895,  4.75520839]])
&gt;&gt;&gt; y.shape
(150, 15)
&gt;&gt;&gt; y.sum(axis=1)
array([100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
       100., 100., 100., 100., 100., 100., 100.])

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

Generate a dataframe of random numbers, but each column is restricted by lower an upper bound, and sum of row to be 100

问题

答案1

答案2

1st-pass, random generation

Check feasibility

x is a coefficient between 0 and 1

y is a value between lb and ub

Fix bounds

2nd-pass, fix each values

需要输入搜索行的Excel

在特定列上匹配索引值连接numpy数组。

数据框架：根据条件替换值及其周围的值。

requests.exceptions.InvalidSchema: 找不到连接适配器。我正在尝试遍历一个列表

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。