英文:
Efficient random subsample of a Pandas DataFrame based on multiple columns target requirements
问题
我目前正在进行一个项目,在这个项目中,我有一个包含多边形的大型DataFrame(500,000行),其中每个多边形代表一个地理区域。
DataFrame的列代表不同的土地覆盖类别(34个类别),单元格中的值表示每个土地覆盖类别的平方千米面积。
我的目标是根据土地覆盖类别的目标要求来对这个DataFrame进行子采样。具体来说,我想选择一组多边形,它们共同满足每个土地覆盖类别的一定目标覆盖面积要求。这些目标要求是指定每个土地覆盖类别所需的期望面积覆盖度。
一些同事暗示这可以被解释为一个带有目标函数的优化问题。然而,我目前还没有找到解决方案,尝试了一种不同的、较慢的迭代方法(请参阅下文)。
为了让你更好地理解,这里是我的DataFrame结构的一个最小可复制示例,仅包含4个多边形和3个类别:
import pandas as pd
# 创建一个示例DataFrame
data = {
'Polygon': ['Polygon A', 'Polygon B', 'Polygon C', 'Polygon D'],
'Landcover 1': [10, 5, 7, 3],
'Landcover 2': [15, 8, 4, 6],
'Landcover 3': [20, 12, 9, 14]
}
df = pd.DataFrame(data)
例如,假设我有以下土地覆盖类别的目标要求:
target_requirements = {
'Landcover 1': 15,
'Landcover 2': 20,
'Landcover 3': 25
}
基于这些目标要求,我想通过选择一组多边形来进行DataFrame的子采样,这些多边形共同满足或接近满足每个土地覆盖类别的目标面积覆盖要求。在这个示例中,多边形A和C是良好的子样本,因为它们的土地覆盖总面积接近我设定的要求。
我的[扩展]代码到目前为止
到目前为止,我已经编写了以下代码。你将看到一些额外的步骤,这些步骤在这里实现:
- 权重:使用递归和剩余面积来引导多边形的选择。
- 随机采样前0.5%:基于权重,我选择前0.5%的多边形,并从这个选择中随机选择一个。
- 容差:我设置了一个允许当前子样本与所需子样本之间的面积差异的容差。
- 进度条:美观。
import numpy as np
import pandas as pd
from tqdm import tqdm
def select_polygons(row, cumulative_coverages, landcover_columns, target_coverages):
selected_polygon = row[landcover_columns]
# 将选择的多边形添加到子样本中
subsample = selected_polygon.to_frame().T
cumulative_coverages += selected_polygon.values
return cumulative_coverages, subsample
df_data = # 你的包含多边形和土地覆盖类别的DataFrame
landcover_columns = # DataFrame中的土地覆盖类别列的列表
target_coverages = # 每个土地覆盖类别的目标覆盖要求的字典
total_coverages = df_data[landcover_columns].sum()
target_coverages = pd.Series(target_coverages, landcover_columns)
df_data = df_data.sample(frac=1).dropna().reset_index(drop=True)
# 设置收敛的参数
max_iterations = 30000
convergence_threshold = 0.1
top_percentage = 0.005
# 初始化变量
subsample = pd.DataFrame(columns=landcover_columns)
cumulative_coverages = pd.Series(0, index=landcover_columns)
# 初始化tqdm进度条
progress_bar = tqdm(total=max_iterations)
# 迭代直到累积覆盖度匹配或接近目标覆盖度
for iteration in range(max_iterations):
remaining_diff = target_coverages - cumulative_coverages
deficit = remaining_diff.clip(lower=0)
surplus = remaining_diff.clip(upper=0) * 0.1
deficit_sum = deficit.sum()
normalized_weights = deficit / deficit_sum
# 计算整个数据集的递归和剩余面积的组合权重
weights = df_data[landcover_columns].mul(normalized_weights) + surplus
# 计算每个多边形的权重总和
weight_sum = weights.sum(axis=1)
# 基于权重总和选择前1%的多边形
top_percentile = int(len(df_data) * top_percentage)
top_indices = weight_sum.nlargest(top_percentile).index
selected_polygon_index = np.random.choice(top_indices)
selected_polygon = df_data.loc[selected_polygon_index]
cumulative_coverages, subsample_iteration = select_polygons(
selected_polygon, cumulative_coverages, landcover_columns, target_coverages
)
# 将选择的多边形添加到子样本中
subsample = subsample.append(subsample_iteration)
df_data = df_data.drop(selected_polygon_index)
# 检查是否已选择所有多边形或累积覆盖度匹配或接近目标覆盖度
if df_data.empty or np.allclose(cumulative_coverages, target_coverages, rtol=convergence_threshold):
break
# 计算达到的覆盖度百分比
coverage_percentage = (cumulative_coverages.sum() / target_coverages.sum()) * 100
# 更新tqdm进度条
progress_bar.set_description(f"迭代 {iteration+1}: 覆盖百
<details>
<summary>英文:</summary>
I am currently working on a project where I have a large DataFrame (500'000 rows) containing polygons as rows, with each polygon representing a geographical area.
The columns of the DataFrame represent different landcover classes (34 classes), and the values in the cells represent the area covered by each landcover class in square kilometers.
My objective is to **subsample this DataFrame based on target requirements for landcover classes**. Specifically, I want to select a subset of polygons that collectively meet certain target coverage requirements for each landcover class. The target requirements are specified as the desired area coverage for each landcover class.
Some collegue hinted that this could be interpreted as an **optimisation problem with an objective function**. However, I have not found a solution to it yet and tried a different, slow, iterative approach (see below).
To give you a better understanding, here is a minimum reproducible example of my DataFrame structure with only 4 polygons and 3 classes:
import pandas as pd
Create a sample DataFrame
data = {
'Polygon': ['Polygon A', 'Polygon B', 'Polygon C', 'Polygon D'],
'Landcover 1': [10, 5, 7, 3],
'Landcover 2': [15, 8, 4, 6],
'Landcover 3': [20, 12, 9, 14]
}
df = pd.DataFrame(data)
For instance, let's say I have the following target requirements for landcover classes:
target_requirements = {
'Landcover 1': 15,
'Landcover 2': 20,
'Landcover 3': 25
}
Based on these target requirements, I would like to subsample the DataFrame by selecting a subset of polygons that **collectively meet or closely approximate the target area coverage** for each landcover class.
In this example, the polygons A and C are good subsamples as their landcover coverages summed together comes close to the requirements I set.
# My [extended] code so far
Here is what I coded so far.
You will see some *extra steps* which are implemented here:
1. Weights: to guide the selection of polygons using deficits and surplus
1. Random sampling of top 0.5%: based on weights, I select the top 0.5% polygons and randomly pick 1 from this selection.
1. Tolerance: I set a tolerance for discrepancies between cumulated areas found with the current subsample and the requirements needed.
1. Progress bar: aesthetic.
import numpy as np
import pandas as pd
from tqdm import tqdm
def select_polygons(row, cumulative_coverages, landcover_columns, target_coverages):
selected_polygon = row[landcover_columns]
# Add the selected polygon to the subsample
subsample = selected_polygon.to_frame().T
cumulative_coverages += selected_polygon.values
return cumulative_coverages, subsample
df_data = # Your DataFrame with polygons and landcover classes
landcover_columns = # List of landcover columns in the DataFrame
target_coverages = # Dictionary of target coverages for each landcover class
total_coverages = df_data[landcover_columns].sum()
target_coverages = pd.Series(target_coverages, landcover_columns)
df_data = df_data.sample(frac=1).dropna().reset_index(drop=True)
Set parameters for convergence
max_iterations = 30000
convergence_threshold = 0.1
top_percentage = 0.005
Initialize variables
subsample = pd.DataFrame(columns=landcover_columns)
cumulative_coverages = pd.Series(0, index=landcover_columns)
Initialize tqdm progress bar
progress_bar = tqdm(total=max_iterations)
Iterate until the cumulative coverage matches or is close to the target coverage
for iteration in range(max_iterations):
remaining_diff = target_coverages - cumulative_coverages
deficit = remaining_diff.clip(lower=0)
surplus = remaining_diff.clip(upper=0) * 0.1
deficit_sum = deficit.sum()
normalized_weights = deficit / deficit_sum
# Calculate the combined weights for deficit and surplus for the entire dataset
weights = df_data[landcover_columns].mul(normalized_weights) + surplus
# Calculate the weight sum for each polygon
weight_sum = weights.sum(axis=1)
# Select the top 1% polygons based on weight sum
top_percentile = int(len(df_data) * top_percentage)
top_indices = weight_sum.nlargest(top_percentile).index
selected_polygon_index = np.random.choice(top_indices)
selected_polygon = df_data.loc[selected_polygon_index]
cumulative_coverages, subsample_iteration = select_polygons(
selected_polygon, cumulative_coverages, landcover_columns, target_coverages
)
# Add the selected polygon to the subsample
subsample = subsample.append(subsample_iteration)
df_data = df_data.drop(selected_polygon_index)
# Check if all polygons have been selected or the cumulative coverage matches or is close to the target coverage
if df_data.empty or np.allclose(cumulative_coverages, target_coverages, rtol=convergence_threshold):
break
# Calculate the percentage of coverage achieved
coverage_percentage = (cumulative_coverages.sum() / target_coverages.sum()) * 100
# Update tqdm progress bar
progress_bar.set_description(f"Iteration {iteration+1}: Coverage Percentage: {coverage_percentage:.2f}%")
progress_bar.update(1)
progress_bar.close()
subsample.reset_index(drop=True, inplace=True)
# The problem
#
Code is slow (10 iterations/s) and doesn't manage well tolerance, i.e I can get cumulative_coverages way above 100% while tolerance is not met yet ( my "guidance for selection" is not good enough).
Plus, there must be a much better **OPTIMISATION** to get what I want.
Any help/idea would be appreciated.
</details>
# 答案1
**得分**: 3
问题描述适用于使用混合整数线性规划(MIP)来解决。
用于解决混合整数问题的便捷库是PuLP,它与内置的Coin-OR套件一起提供,并且特别包括整数解算器CBC。
请注意,我已经更改了示例中的DataFrame的数据结构,因为原始的DataFrame会因为在没有索引的情况下进行查找而使代码变得混乱,这是我个人不喜欢的:D
然后我们制定线性问题:
```python
from pulp import LpProblem, LpVariable, LpMinimize, lpSum, value
model = LpProblem("LandcoverOptimization", LpMinimize)
# 为每个多边形创建一个二进制变量,指示是否在解决方案中选择了它
polygons = list(data.keys())
selected = LpVariable.dicts("Selected", polygons, cat='Binary')
# 最小化选择的多边形的数量
model += lpSum(selected[polygon] for polygon in polygons)
# 我们需要大致覆盖目标所需的土地覆盖率+-容差
# 此外:我们必须选择一个多边形并将其添加到解决方案中,这将通过目标来最小化
容差 = 0.1
# 我们仅为性能创建总和一次
sums = {}
for landcover in target_requirements:
sums[landcover] = lpSum(landcovers[polygon][landcover] * selected[polygon] for polygon in polygons)
for landcover in target_requirements:
model += sums[landcover] >= target_requirements[landcover] * (1 - tolerance)
model += sums[landcover] <= target_requirements[landcover] * (1 + tolerance)
model.solve(PULP_CBC_CMD(fracGap=0.9)) # 为性能设置非常宽松的MIP差距
要可视化输出,我们提取解决方案:
selected_polygons = [polygon for polygon in polygons if value(selected[polygon]) == 1.0]
print("Selected Polygons:")
for polygon in selected_polygons:
print(polygon)
achieved_landcovers = {}
for landcover in target_requirements:
total_landcover = sum(data[polygon][landcover] for polygon in selected_polygons)
achieved_landcovers[landcover] = total_landcover
print("Required Landcovers:")
for landcover, requirement in target_requirements.items():
print(f"{landcover}: {requirement}")
print("Achieved Landcovers:")
for landcover, coverage in achieved_landcovers.items():
print(f"{landcover}: {coverage}")
输出:
Selected Polygons:
Polygon A
Polygon C
Required Landcovers:
Landcover 1: 15
Landcover 2: 20
Landcover 3: 25
Achieved Landcovers:
Landcover 1: 17
Landcover 2: 19
Landcover 3: 29
关于可处理性:
我已经解决了一些拥有约300万个二进制变量的MIP模型。此问题说500万个变量无法解决。通过微调容差以及最优性差距将使您能够找到有用和实用的解决方案。输出还将包含有关最优性的度量和/或证明问题的不可行性的日志(例如,如果您将容差设置得太严格):
Result - Optimal solution found
Objective value: 2.00000000
Enumerated nodes: 0
Total iterations: 1
Time (CPU seconds): 0.00
Time (Wallclock seconds): 0.00
Option for printingOptions changed from normal to all
Total time (CPU seconds): 0.00 (Wallclock seconds): 0.01
如果数据量确实非常大,并且容差需要非常严格,CBC可能无法处理该问题。像Gurobi这样的商业可用求解器将处理更大的问题实例,但不是免费提供的。
根据我的经验,如果求解器无法找到满意的解决方案,那么您无法编写一些Python/Pandas代码来解决问题。
要以描述问题的思维方式,而不是考虑解决方案,我建议观看Raymond Hettinger在PyCon 2019上的出色演讲 - 现代求解器:定义良好的问题即解决问题
要安装PuLP
:
pip install pulp
编辑:对于我的MacBook Pro上的OP的88MB问题实例:
Result - Optimal solution found (within gap tolerance)
Objective value: 14178.00000000
Lower bound: 14080.357
Gap: 0.01
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 218.06
Time (Wallclock seconds): 219.21
Option for printingOptions changed from normal to all
Total time (CPU seconds): 293.59 (Wallclock seconds): 296.26
微调容差和MIP差距将改善运行时间或解决方案质量。
它可以证明至少需要选择14081个多边形,并找到选择14178个多边形的解决方案(1%的最优性差距),覆盖容差为+-3%:
土地覆盖比较:
n20: 需要=3303.7,实现=3204.7,差值=99.0
n25: 需要=20000.0,实现=19401.1,差值=598.9
n5: 需要=20000.0,实现=19400.0,
<details>
<summary>英文:</summary>
The problem description lends itself to solving with a [Mixed-Integer-Linear Program (MIP)][1].
A convenient library for solving mixed integer problems is [`PuLP`][2] that ships with the built-in [Coin-OR suite][3] and in particular the integer solver CBC.
Note that I changed the data structure from the `DataFrame` in the example because the code gets really messy with your original `DataFrame` as lookups without index use a lot of `.loc` which I personally dislike :D
data = {
'Polygon A': {
'Landcover 1': 10,
'Landcover 2': 15,
'Landcover 3': 20
},
'Polygon B': {
'Landcover 1': 5,
'Landcover 2': 8,
'Landcover 3': 12
},
'Polygon C': {
'Landcover 1': 7,
'Landcover 2': 4,
'Landcover 3': 9
},
'Polygon D': {
'Landcover 1': 3,
'Landcover 2': 6,
'Landcover 3': 14
}
}
target_requirements = {
'Landcover 1': 15,
'Landcover 2': 20,
'Landcover 3': 25
}
Then we formulate the linear problem:
from pulp import LpProblem, LpVariable, LpMinimize, lpSum, value
model = LpProblem("LandcoverOptimization", LpMinimize)
Create a binary variable for each polygon to indicate whether it was selected in the solution
polygons = list(data.keys())
selected = LpVariable.dicts("Selected", polygons, cat='Binary')
Minimize the amount of selected polygons
model += lpSum(selected[polygon] for polygon in polygons)
We need to approximately cover the target required landcover +- tolerance
Also: We have to select a polygon to add it to the solution, this will be minimized by the objective
tolerance = 0.1
We create the sums only once for performance
sums = {}
for landcover in target_requirements:
sums[landcover] = lpSum(landcovers[polygon][landcover] * selected[polygon] for polygon in polygons)
for landcover in target_requirements:
model += sums[landcover] >= target_requirements[landcover] * (1 - tolerance)
model += sums[landcover] <= target_requirements[landcover] * (1 + tolerance)
model.solve(PULP_CBC_CMD(fracGap = 0.9))# Set mip gap very permissively for performance
To visualize the output we extract the solution:
selected_polygons = [polygon for polygon in polygons if value(selected[polygon]) == 1.0]
print("Selected Polygons:")
for polygon in selected_polygons:
print(polygon)
achieved_landcovers = {}
for landcover in target_requirements:
total_landcover = sum(data[polygon][landcover] for polygon in selected_polygons)
achieved_landcovers[landcover] = total_landcover
print("Required Landcovers:")
for landcover, requirement in target_requirements.items():
print(f"{landcover}: {requirement}")
print("Achieved Landcovers:")
for landcover, coverage in achieved_landcovers.items():
print(f"{landcover}: {coverage}")
Output:
Selected Polygons:
Polygon A
Polygon C
Required Landcovers:
Landcover 1: 15
Landcover 2: 20
Landcover 3: 25
Achieved Landcovers:
Landcover 1: 17
Landcover 2: 19
Landcover 3: 29
With regard to tractability:
I've solved some MIP models with ~3 million binary variables. T[his question][4] says 5 million variables won't solve for them. Playing with tolerances, as well as the [optimality gap][5] will enable you to dial into a useful and practical solution for your problem. The output will also contain logs of measures of optimality and/or prove infeasibility of the problem(e.g. if you set your tolerances too tightly) :
Result - Optimal solution found
Objective value: 2.00000000
Enumerated nodes: 0
Total iterations: 1
Time (CPU seconds): 0.00
Time (Wallclock seconds): 0.00
Option for printingOptions changed from normal to all
Total time (CPU seconds): 0.00 (Wallclock seconds): 0.01
If the amount of data is truly huge and the tolerances need to be very tight, CBC might no longer be able to handle the problem. Commercially available solvers such as [Gurobi][6] will handle much larger problem instances, but aren't available for free.
In my experience, if a solver can't find a satisfying solution, there is no way you can write some python/pandas code that will.
To get the mindset of describing problems, rather than thinking of solutions I recommend Raymond Hettinger's excellent talk on PyCon 2019 - [Modern solvers: Problems well-defined are problems solved][7]
to install `PuLP`
pip install pulp
EDIT: For OP's 88MB problem instance on my MacBook Pro:
Result - Optimal solution found (within gap tolerance)
Objective value: 14178.00000000
Lower bound: 14080.357
Gap: 0.01
Enumerated nodes: 0
Total iterations: 0
Time (CPU seconds): 218.06
Time (Wallclock seconds): 219.21
Option for printingOptions changed from normal to all
Total time (CPU seconds): 293.59 (Wallclock seconds): 296.26
Tweaking the tolerance and MIP gap will improve runtime or solution quality.
It can prove at least 14081 polygons need to be chosen and it found a solution that chooses 14178 polygons(1% optimality gap) with a coverage tolerance of +- 3%:
Landcover Comparison:
n20: Required=3303.7, Achieved=3204.7, Difference=99.0
n25: Required=20000.0, Achieved=19401.1, Difference=598.9
n5: Required=20000.0, Achieved=19400.0, Difference=600.0
n16: Required=8021.1, Achieved=7781.3, Difference=239.8
....
[1]: https://en.wikipedia.org/wiki/Integer_programming
[2]: https://coin-or.github.io/pulp/
[3]: https://www.google.com/search?q=coin%20or&oq=coin%20or&gs_lcrp=EgZjaHJvbWUyBggAEEUYOTIGCAEQRRg7MgYIAhAjGCcyCQgDECMYJxiKBTIJCAQQABhDGIoFMgYIBRBFGDwyBggGEEUYPDIGCAcQRRg80gEHOTQ2ajBqN6gCALACAA&sourceid=chrome&ie=UTF-8
[4]: https://stackoverflow.com/questions/52257435/is-there-a-limit-on-the-size-of-an-lp-file-or-number-of-variables-in-cbc-glpk
[5]: https://stackoverflow.com/questions/39943236/how-to-set-optimality-gap-in-pulp-or-with-cbc-solver
[6]: https://www.gurobi.com/
[7]: https://www.youtube.com/watch?v=_GP9OpZPUYc&ab_channel=PyCon2019
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论