英文:
Generate a dataframe of random numbers, but each column is restricted by lower an upper bound, and sum of row to be 100
问题
我正在寻找生成一个DataFrame,其中填充了随机浮点值,每列的值受元组列表的约束,这些元组定义了下限和上限。此外,DataFrame中的每一行的值应该总和为100。我尝试添加一个容差,使总和为99.99。但我的逻辑似乎有问题。问题在于有些行的总和不大于或等于99.99。
import pandas as pd
import numpy as np
num_rows = 150
num_cols = 15
bounds = [(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10), (0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10)]
tolerance = 0.1
df = pd.DataFrame(np.zeros((num_rows, num_cols)))
num_generated_rows = 0
while num_generated_rows < num_rows:
for row in range(num_generated_rows, num_rows):
row_sum = 0
for col in range(num_cols):
lower, upper = bounds[col]
remaining_cols = num_cols - col - 1
remaining_sum = 100 - row_sum
if remaining_cols > 0:
min_possible_value = max(0, remaining_sum - remaining_cols * upper)
max_possible_value = min(upper, remaining_sum - remaining_cols * lower)
else:
min_possible_value = max(0, 100 - row_sum - upper)
max_possible_value = min(upper, 100 - row_sum - lower)
value = np.random.uniform(min_possible_value, max_possible_value)
df.iloc[row, col] = value
row_sum += value
# 检查行总和是否在定义的容差范围内
if abs(row_sum - 100) <= tolerance:
num_generated_rows += 1
break
# 如果已达到所需数量,则停止生成行
if num_generated_rows == num_rows:
break
df = df.sample(frac=1).reset_index(drop=True)
请注意,这段代码生成一个DataFrame,其中包含符合定义的约束条件的随机浮点值,并且每行的总和在容差范围内。
英文:
I am looking to generate a DataFrame, filled with random float values, where the values in each column are bound by a list of tuples, defining lower and upper bounds. Additionally, each row in the dataframe should sum to 100. I tried adding a tolerance also to achieve 99.99. But my logic is somehow failing. The issue is that there are rows that don't sum greater or equal to 99.99.
import pandas as pd
import numpy as np
num_rows = 150
num_cols = 15
bounds = [(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10), (0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10)]
tolerance = 0.1
df = pd.DataFrame(np.zeros((num_rows, num_cols)))
num_generated_rows = 0
while num_generated_rows < num_rows:
for row in range(num_generated_rows, num_rows):
row_sum = 0
for col in range(num_cols):
lower, upper = bounds[col]
remaining_cols = num_cols - col - 1
remaining_sum = 100 - row_sum
if remaining_cols > 0:
min_possible_value = max(0, remaining_sum - remaining_cols * upper)
max_possible_value = min(upper, remaining_sum - remaining_cols * lower)
else:
min_possible_value = max(0, 100 - row_sum - upper)
max_possible_value = min(upper, 100 - row_sum - lower)
value = np.random.uniform(min_possible_value, max_possible_value)
df.iloc[row, col] = value
row_sum += value
# Check if the row sum is within the defined tolerance
if abs(row_sum - 100) <= tolerance:
num_generated_rows += 1
break
# Stop generating rows if we've reached the desired number
if num_generated_rows == num_rows:
break
df = df.sample(frac=1).reset_index(drop=True)
答案1
得分: 1
One (somewhat heavy-handed) approach is to use linear programming.
Advantages:
- your bounds are guaranteed to be respected
- your sum constraint is guaranteed to be respected and will be accurate
- it is vectorised, so you don't need
for
-loops - the results are random-ish
Disadvantages:
- there are probably faster methods, though this runs "quickly enough".
- if you care about guarantees of random distribution uniformity: there are none here (though I'm pretty sure given your constraints that true uniformity is impossible)
import numpy as np
import pandas as pd
import scipy.optimize
from numpy.random import default_rng
from scipy.optimize import Bounds, LinearConstraint
num_rows = 150
num_cols = 15
rand = default_rng(seed=0)
bounds = np.array((
(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10),
(0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10),
))
cell_bounds = np.tile bounds, (num_rows, 1))
perturbance = (cell_bounds[:, 1].min() - cell_bounds[:, 0].max()) / 2
lower = cell_bounds[:, 0] + rand.uniform(low=0, high=perturbance, size=num_cols*num_rows)
upper = cell_bounds[:, 1] - rand.uniform(low=0, high=perturbance, size=num_cols*num_rows)
A = np.repeat(a=np.eye(num_rows), repeats=num_cols, axis=1)
c_bound = np.full(shape=num_rows, fill_value=100)
result = scipy.optimize.milp(
c=rand.uniform(low=-1, high=1, size=num_cols*num_rows),
bounds=Bounds(lb=lower, ub=upper),
constraints=LinearConstraint(A=A, lb=c_bound, ub=c_bound),)
print(result.message)
df = pd.DataFrame(result.x.reshape((num_rows, num_cols)))
print(df)
Optimization terminated successfully. (HiGHS Status 7: Optimal)
0 1 2 ... 12 13 14
0 19.695780 8.378342 0.102434 ... 4.444407 8.091462 9.073666
1 17.113129 9.108588 14.591218 ... 2.452088 8.309021 9.861998
2 18.569490 8.074654 13.583485 ... 3.475941 0.844778 7.541122
3 19.393683 8.766053 1.557968 ... 1.990811 8.882646 0.775763
4 19.836319 8.456286 0.226883 ... 4.442556 8.904374 2.308569
.. ... ... ... ... ... ... ...
145 19.226648 3.675740 13.277363 ... 1.632323 0.432656 1.095681
146 17.893662 8.960568 13.408240 ... 2.419953 1.104904 8.556052
147 1.376055 8.251378 12.722397 ... 3.709209 9.451243 8.980960
148 17.538430 8.255690 12.762953 ... 3.606618 1.019980 9.087719
149 0.817833 9.246465 13.594034 ... 4.184895 8.385792 1.716741
[150 rows x 15 columns]
英文:
One (somewhat heavy-handed) approach is to use linear programming.
Advantages:
- your bounds are guaranteed to be respected
- your sum constraint is guaranteed to be respected and will be accurate
- it is vectorised, so you don't need
for
-loops - the results are random-ish
Disadvantages:
- there are probably faster methods, though this runs "quickly enough".
- if you care about guarantees of random distribution uniformity: there are none here (though I'm pretty sure given your constraints that true uniformity is impossible)
import numpy as np
import pandas as pd
import scipy.optimize
from numpy.random import default_rng
from scipy.optimize import Bounds, LinearConstraint
num_rows = 150
num_cols = 15
rand = default_rng(seed=0)
bounds = np.array((
(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10),
(0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10),
))
cell_bounds = np.tile(bounds, (num_rows, 1))
perturbance = (cell_bounds[:, 1].min() - cell_bounds[:, 0].max()) / 2
lower = cell_bounds[:, 0] + rand.uniform(low=0, high=perturbance, size=num_cols*num_rows)
upper = cell_bounds[:, 1] - rand.uniform(low=0, high=perturbance, size=num_cols*num_rows)
A = np.repeat(a=np.eye(num_rows), repeats=num_cols, axis=1)
c_bound = np.full(shape=num_rows, fill_value=100)
result = scipy.optimize.milp(
c=rand.uniform(low=-1, high=1, size=num_cols*num_rows),
bounds=Bounds(lb=lower, ub=upper),
constraints=LinearConstraint(A=A, lb=c_bound, ub=c_bound),)
print(result.message)
df = pd.DataFrame(result.x.reshape((num_rows, num_cols)))
print(df)
Optimization terminated successfully. (HiGHS Status 7: Optimal)
0 1 2 ... 12 13 14
0 19.695780 8.378342 0.102434 ... 4.444407 8.091462 9.073666
1 17.113129 9.108588 14.591218 ... 2.452088 8.309021 9.861998
2 18.569490 8.074654 13.583485 ... 3.475941 0.844778 7.541122
3 19.393683 8.766053 1.557968 ... 1.990811 8.882646 0.775763
4 19.836319 8.456286 0.226883 ... 4.442556 8.904374 2.308569
.. ... ... ... ... ... ... ...
145 19.226648 3.675740 13.277363 ... 1.632323 0.432656 1.095681
146 17.893662 8.960568 13.408240 ... 2.419953 1.104904 8.556052
147 1.376055 8.251378 12.722397 ... 3.709209 9.451243 8.980960
148 17.538430 8.255690 12.762953 ... 3.606618 1.019980 9.087719
149 0.817833 9.246465 13.594034 ... 4.184895 8.385792 1.716741
[150 rows x 15 columns]
答案2
得分: 1
不需要翻译的代码部分:
Rather than to draw numbers between [lb, ub], you can draw numbers between [0, 1] that acts as coefficient. 0.25 for an interval of [0, 30] is 7.5 and rescale this coefficient to the right value after fixing new bounds (check comments).
You can use the following optimized code:
import numpy as np
bounds = [(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10), (0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10)]
num_cols = len(bounds)
num_rows = 150
TARGET = 100
1st-pass, random generation
lb = np.array(bounds, dtype=float)[:, 0]
ub = np.array(bounds, dtype=float)[:, 1]
Check feasibility
assert lb.sum() <= TARGET, f'Infeasible problem: sum(lb) > {TARGET}'
assert ub.sum() >= TARGET, f'Infeasible problem: sum(ub) < {TARGET}'
x is a coefficient between 0 and 1
y is a value between lb and ub
np.random.seed(42) # for demo only
x = np.random.random(size=(num_rows, num_cols))
y = x*ub + (1-x)*lb
Fix bounds
m = np.c_[np.sum(y, axis=1) < TARGET] # below or above TARGET?
lb = np.where(m, y, lb) # Change lower bounds to x if below TARGET
ub = np.where(~m, y, ub) # Change upper bounds to x if above TARGET
2nd-pass, fix each values
LB, UB = lb.sum(axis=1), ub.sum(axis=1)
x = (TARGET-LB) / (UB-LB)
y = x[:, None]*ub + (1-x[:, None])*lb
Note: don't forget to remove np.random.seed(42)
英文:
Rather than to draw numbers between [lb, ub], you can draw numbers between [0, 1] that acts as coefficient. 0.25 for an interval of [0, 30] is 7.5 and rescale this coefficient to the right value after fixing new bounds (check comments).
You can use the following optimized code:
import numpy as np
bounds = [(0, 20), (0, 10), (0, 15), (0, 5), (0, 10), (0, 5), (0, 20), (0, 10), (0, 10), (0, 5), (0, 5), (0, 10), (0, 5), (0, 10), (0, 10)]
num_cols = len(bounds)
num_rows = 150
TARGET = 100
# 1st-pass, random generation
lb = np.array(bounds, dtype=float)[:, 0]
ub = np.array(bounds, dtype=float)[:, 1]
# Check feasibility
assert lb.sum() <= TARGET, f'Infeasible problem: sum(lb) > {TARGET}'
assert ub.sum() >= TARGET, f'Infeasible problem: sum(ub) < {TARGET}'
# x is a coefficient between 0 and 1
# y is a value between lb and ub
np.random.seed(42) # for demo only
x = np.random.random(size=(num_rows, num_cols))
y = x*ub + (1-x)*lb
# Fix bounds
m = np.c_[np.sum(y, axis=1) < TARGET] # below or above TARGET?
lb = np.where(m, y, lb) # Change lower bounds to x if below TARGET
ub = np.where(~m, y, ub) # Change upper bounds to x if above TARGET
# 2nd-pass, fix each values
LB, UB = lb.sum(axis=1), ub.sum(axis=1)
x = (TARGET-LB) / (UB-LB)
y = x[:, None]*ub + (1-x[:, None])*lb
Note: don't forget to remove np.random.seed(42)
Output:
>>> y
array([[12.12345246, 9.68966746, 12.4687076 , ..., 4.47247875,
5.04041056, 4.84827506],
[11.89476371, 6.54707803, 11.46217608, ..., 3.79461597,
7.97722602, 5.26770304],
[15.28289786, 5.01507058, 6.57181308, ..., 2.09846591,
9.45503973, 5.54546481],
...,
[18.39384868, 2.94238081, 12.9170368 , ..., 1.47846341,
8.75263679, 1.3008411 ],
[16.51397403, 7.29824288, 7.02829416, ..., 4.71439879,
9.13096929, 8.04592187],
[15.17896817, 9.36179865, 7.75855936, ..., 4.50037072,
7.62359895, 4.75520839]])
>>> y.shape
(150, 15)
>>> y.sum(axis=1)
array([100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100., 100., 100., 100., 100.,
100., 100., 100., 100., 100., 100., 100.])
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论