英文:
How to subset a binary matrix to maximize “uniqueness”?
问题
I'm here to help with the translation. Here's the translated content:
我正在尝试通过子集化行来优化矩阵,然后对子集化的行执行计算。该计算是确保每列恰好出现一次,并尽量减少重复。
要明确,要优化的参数如下:
- p = 检测到的列数(越多越好)
- q = 重复行数(越少越好)
- r = 包含重复行以增加 p 的惩罚
唯一性可以定义为 p - q*r
以下是一个简单的示例,其中有一个明确的答案,将选择行 [2,3,4](舍弃行 0,因为行 2 更好):
A = [
[1,0,0,0,0],
[0,0,0,0,0],
[1,0,1,0,0],
[0,1,0,1,0],
[0,0,0,0,1],
]
p = 5(所有列都有)
q = 0(没有重复)
并不总是有完美的组合,有时需要添加惩罚(包括重复 q)以增加表示的列数(p)。这将通过 r 来权衡。
B = [
[1,1,0,1,0],
[0,0,0,0,0],
[1,0,0,0,1],
[0,0,1,0,0],
[0,0,0,0,1],
]
最佳组合是 [0,2,3]
P = 5(检测到所有列)
q = 1(1 个重复)
最后,有时会有 2 个或更多的子集具有最佳组合,因此应包含所有最佳子集:
C = [
[1,0,0,1,0],
[0,1,1,0,1],
[0,1,0,0,1],
[1,0,0,1,0],
[0,0,0,0,1],
]
在这个示例中,我发现了一些好的选项:[0,1],[1,3]
除了对所有行的组合进行蛮力尝试之外,有人能帮我理解如何开始实现这个过程,同时利用 Python 中的 Scikit-Learn、SciPy、NumPy 或 Platypus 中的算法吗?
更具体地说,我可以使用哪些算法来优化 NumPy 数组中的行,以最大化 p
,最小化 q
,并根据 r
进行加权(例如,score = p - q*r
)?
以下是我的一些测试代码:
from platypus import NSGAII, Problem, Real
A = np.asarray([
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[0, 0, 0, 0, 1],
])
def objective(x):
x = np.asarray(x)
selected_rows = A[x >= 0.5]
p = len(set(selected_rows.flatten()))
q = len(selected_rows) - len(set([tuple(row) for row in selected_rows]))
return [p, q * 0.1]
problem = Problem(5, 2)
problem.types[:] = [Real(0, 1)] * 5
problem.function = objective
problem.directions[:] = [Problem.MAXIMIZE, Problem.MINIMIZE]
algorithm = NSGAII(problem)
algorithm.run(10000)
for solution in algorithm.result:
print(f"x = {solution.variables} \tp = {solution.objectives[0]:.2f} \tq*r = {solution.objectives[1]:.2f}")
英文:
I’m trying to optimize a matrix by subsetting rows and then performing a calculation on the subsetted rows. The calculation is representing each of the columns exactly once and having as few duplicates as possible.
To be clear, the parameters to optimize are the following:
- p = Number of columns detected (More is better)
- q = Number of duplicate rows (Less is better)
- r = Penalty on including a duplicate row to increase p
Uniqueness would be defined as p - q*r
Here is a simple example where there is a clear answer where rows [2,3,4] will be chosen (row 0 is dropped because row 2 is better):
A = [
[1,0,0,0,0],
[0,0,0,0,0],
[1,0,1,0,0],
[0,1,0,1,0],
[0,0,0,0,1],
]
p = 5 (all columns are represented)
q = 0 (no duplicates)
There will not always be a perfect combination and sometimes it will need to add a penalty (include a duplicate q) to add to the number of columns represented (p). Weighting this will be important which will be done by r.
B = [
[1,1,0,1,0],
[0,0,0,0,0],
[1,0,0,0,1],
[0,0,1,0,0],
[0,0,0,0,1],
]
Best combination is [0,2,3]
P = 5 (all columns detected)
q = 1 (1 duplicate)
Lastly, there will sometimes be 2 or more subsets that have the best combination so it should include all best subsets:
C = [
[1,0,0,1,0],
[0,1,1,0,1],
[0,1,0,0,1],
[1,0,0,1,0],
[0,0,0,0,1],
]
In this one, I spot a few good options: [0,1], [1,3]
Other than doing bruteforce of all combinations of rows, can someone help me understand how to start implement this while leveraging any of the algorithms in Scikit-Learn, SciPy, NumPy, or Platypus in Python?
More specifically, what algorithms can I use to optimize the rows in a NumPy array that maximizes p
, minimize q
weighted by r
(e.g., score = p - q*r
)?
Here's some of my test code:
from platypus import NSGAII, Problem, Real
A = np.asarray([
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[0, 0, 0, 0, 1],
])
def objective(x):
x = np.asarray(x)
selected_rows = A[x >= 0.5]
p = len(set(selected_rows.flatten()))
q = len(selected_rows) - len(set([tuple(row) for row in selected_rows]))
return [p, q * 0.1]
problem = Problem(5, 2)
problem.types[:] = [Real(0, 1)] * 5
problem.function = objective
problem.directions[:] = [Problem.MAXIMIZE, Problem.MINIMIZE]
algorithm = NSGAII(problem)
algorithm.run(10000)
for solution in algorithm.result:
print(f"x = {solution.variables} \tp = {solution.objectives[0]:.2f} \tq*r = {solution.objectives[1]:.2f}")
答案1
得分: 1
以下是您提供的内容的中文翻译部分:
"Since you need all optimal solutions, the CP-SAT solver from the OR-Tools library seems like the right tool here (disclaimer: the developers of OR-Tools are my colleagues, but it’s free and open-source and can be installed as easily as pip3 install ortools
).
I’ve provided complete working Python below, though I haven’t tried it on nontrivial instances. The idea is to formulate a mathematical program with a Boolean variable x<sub>i</sub> for each row i, indicating whether row i belongs to the subset, and a Boolean variable y<sub>j</sub> for each column j, indicating whether that column is covered. I’ve rewritten your objective to the equivalent
> maximize (1/r + 1) ∑<sub>j</sub> y<sub>j</sub> − ∑<sub>i</sub> weight(i) x<sub>i</sub>,
since the CP-SAT solver prefers to deal in integers. The other idea here is that, instead of detecting duplicates explicitly, we add a bonus to the coverage reward that exactly offsets the first “duplicate”.
The constraints are simple: for each column, some row that covers it belongs to the subset, or the column is uncovered. We could strengthen this constraint to an either/or, but we don’t need to, since there’s no reason to declare a covered column as uncovered."
from collections import defaultdict
from ortools.sat.python import cp_model
class SolutionCollector(cp_model.CpSolverSolutionCallback):
def __init__(self, row_vars):
super().__init__()
self._row_vars = row_vars
self.solutions = []
def on_solution_callback(self):
self.solutions.append(
[i for (i, row_var) in enumerate(self._row_vars) if self.Value(row_var)]
)
def solve(matrix):
model = cp_model.CpModel()
objective = 0
col_index_to_row_vars = defaultdict(list)
row_index_to_var = []
for i, row in enumerate(matrix):
row_var = model.NewBoolVar(f"x{i}")
objective -= sum(row) * row_var
for j, entry in enumerate(row):
if entry:
col_index_to_row_vars[j].append(row_var)
row_index_to_var.append(row_var)
for j, row_vars in col_index_to_row_vars.items():
col_var = model.NewBoolVar(f"y{j}")
objective += 11 * col_var
model.AddBoolOr(row_vars + [col_var.Not()])
# We would love to find all optimal solutions in one go, but CP-SAT does not
# support that. Find the optimal objective value.
model.Maximize(objective)
solver = cp_model.CpSolver()
status = solver.Solve(model)
if status != cp_model.OPTIMAL:
raise Exception("did not find optimal solution")
optimal_objective_value = solver.Value(objective)
# Add the optimal objective value as a constraint and clear the objective.
model.Add(objective == optimal_objective_value)
model.Proto().ClearField("objective")
# Find all (optimal) solutions.
solver = cp_model.CpSolver()
solution_collector = SolutionCollector(row_index_to_var)
solver.parameters.enumerate_all_solutions = True
status = solver.Solve(model, solution_collector)
if status != cp_model.OPTIMAL:
raise Exception("did not find all solutions")
return solution_collector.solutions
A = [
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[0, 0, 0, 0, 1],
]
print("A", solve(A))
B = [
[1, 1, 0, 1, 0],
[0, 0, 0, 0, 0],
[1, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[0, 0, 0, 0, 1],
]
print("B", solve(B))
C = [
[1, 0, 0, 1, 0],
[0, 1, 1, 0, 1],
[0, 1, 0, 0, 1],
[1, 0, 0, 1, 0],
[0, 0, 0, 0, 1],
]
print("C", solve(C))
Output:
A [[2, 3, 4], [1, 2, 3, 4]]
B [[0, 3, 4], [0, 1, 3, 4]]
C [[0, 1], [1, 3]]
英文:
Since you need all optimal solutions, the CP-SAT solver from the OR-Tools library seems like the right tool here (disclaimer: the developers of OR-Tools are my colleagues, but it’s free and open-source and can be installed as easily as pip3 install ortools
).
I’ve provided complete working Python below, though I haven’t tried it on nontrivial instances. The idea is to formulate a mathematical program with a Boolean variable x<sub>i</sub> for each row i, indicating whether row i belongs to the subset, and a Boolean variable y<sub>j</sub> for each column j, indicating whether that column is covered. I’ve rewritten your objective to the equivalent
> maximize (1/r + 1) ∑<sub>j</sub> y<sub>j</sub> − ∑<sub>i</sub> weight(i) x<sub>i</sub>,
since the CP-SAT solver prefers to deal in integers. The other idea here is that, instead of detecting duplicates explicitly, we add a bonus to the coverage reward that exactly offsets the first “duplicate”.
The constraints are simple: for each column, some row that covers it belongs to the subset, or the column is uncovered. We could strengthen this constraint to an either/or, but we don’t need to, since there’s no reason to declare a covered column as uncovered.
from collections import defaultdict
from ortools.sat.python import cp_model
class SolutionCollector(cp_model.CpSolverSolutionCallback):
def __init__(self, row_vars):
super().__init__()
self._row_vars = row_vars
self.solutions = []
def on_solution_callback(self):
self.solutions.append(
[i for (i, row_var) in enumerate(self._row_vars) if self.Value(row_var)]
)
def solve(matrix):
model = cp_model.CpModel()
objective = 0
col_index_to_row_vars = defaultdict(list)
row_index_to_var = []
for i, row in enumerate(matrix):
row_var = model.NewBoolVar(f"x{i}")
objective -= sum(row) * row_var
for j, entry in enumerate(row):
if entry:
col_index_to_row_vars[j].append(row_var)
row_index_to_var.append(row_var)
for j, row_vars in col_index_to_row_vars.items():
col_var = model.NewBoolVar(f"y{j}")
objective += 11 * col_var
model.AddBoolOr(row_vars + [col_var.Not()])
# We would love to find all optimal solutions in one go, but CP-SAT does not
# support that. Find the optimal objective value.
model.Maximize(objective)
solver = cp_model.CpSolver()
status = solver.Solve(model)
if status != cp_model.OPTIMAL:
raise Exception("did not find optimal solution")
optimal_objective_value = solver.Value(objective)
# Add the optimal objective value as a constraint and clear the objective.
model.Add(objective == optimal_objective_value)
model.Proto().ClearField("objective")
# Find all (optimal) solutions.
solver = cp_model.CpSolver()
solution_collector = SolutionCollector(row_index_to_var)
solver.parameters.enumerate_all_solutions = True
status = solver.Solve(model, solution_collector)
if status != cp_model.OPTIMAL:
raise Exception("did not find all solutions")
return solution_collector.solutions
A = [
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[1, 0, 1, 0, 0],
[0, 1, 0, 1, 0],
[0, 0, 0, 0, 1],
]
print("A", solve(A))
B = [
[1, 1, 0, 1, 0],
[0, 0, 0, 0, 0],
[1, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[0, 0, 0, 0, 1],
]
print("B", solve(B))
C = [
[1, 0, 0, 1, 0],
[0, 1, 1, 0, 1],
[0, 1, 0, 0, 1],
[1, 0, 0, 1, 0],
[0, 0, 0, 0, 1],
]
print("C", solve(C))
Output:
A [[2, 3, 4], [1, 2, 3, 4]]
B [[0, 3, 4], [0, 1, 3, 4]]
C [[0, 1], [1, 3]]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论