如何高效计算两个 geopandas 几何对象之间的距离

huangapple go评论46阅读模式
英文:

How to efficiently calculate the distance between two geopandas geometries

问题

Here's the translation of the code part you provided:

我有两个 geopandas 数据框一个带有点几何数据一个带有线几何数据并计算它们之间的距离对于每个点我计算与相关线几何的距离线几何的 ID 存储在点数据框的一列中以供参考
共有 321,113 个点要计算距离

我尝试使用列表推导式但仍然需要很长时间时间太长了因为我将需要对更大的数据集进行相同操作其中包含更多点要素到目前为止我的代码如下

```python
def get_distance(point_lineID, point_FID, point_GEOM, lines_df, points_df):

    ref_line = lines_df.loc[lines_df["line_id"] == point_lineID]

    try:
        d = point_GEOM.distance(ref_line["geometry"]).values[0]
        
    except IndexError:
        d = -99
        

    # 将值添加到数据框
    row_num = points_df[points_df["point_id"] == point_FID].index
    points_df.loc[row_num, "distance_mp"] = d


result = [
    get_distance(point_lineid, point_fid, point_geom, df_lines, df_points)
    for point_lineid, point_fid, point_geom in zip(
        points["line_id"], points["point_id"], points["geometry"]
    )
]

如何提高性能?希望在这里得到一些带有解释的支持。


Please note that the code you provided appears to contain HTML-encoded characters (e.g., `"`) instead of regular Python code. If you have any specific questions or need explanations related to improving performance, please feel free to ask.

<details>
<summary>英文:</summary>

I have two geopandas data frames,one with a point geometry and one with line geometries, and am calculating the distance between the geometries. For each point I calculate the distance to the relevant line geometry of which the line geometry id is stored in a column of the point dataframe for reference.
There are 321.113 point features for which the distance is calculated.

I&#39;m trying to use list comprehension, but it still takes a looot of time. Way too long, as I will need to do this for even bigger data sets with way more point features. My code so far is as follows, 

def get_distance(point_lineID, point_FID, point_GEOM, lines_df, points_df):

ref_line = lines_df.loc[lines_df[&quot;line_id&quot;] == point_lineID]

try:
    d = point_GEOM.distance(ref_line[&quot;geometry&quot;]).values[0]
    
except IndexError:
    d = -99
    

# Add value to frame
row_num = points_df[points_df[&quot;point_id&quot;] == point_FID].index
points_df.loc[row_num, &quot;distance_mp&quot;] = d

result = [
get_distance(point_lineid, point_fid, point_geom, df_lines, df_points)
for point_lineid, point_fid, point_geom in zip(
points["line_id"], points["point_id"], points["geometry"]
)
]


How can I make this more performant? 
It would be awesome to have here some support with explanations.

</details>


# 答案1
**得分**: 3

以下是翻译好的部分:

有几种可能提高代码性能的方法。以下是一些建议:

使用矢量化:不要逐行迭代遍历点DataFrame,可以使用矢量化操作一次性计算所有距离。例如,您可以使用apply方法与lambda函数将距离计算应用于所有行:

```python
def get_distance(row, lines_df):
    ref_line = lines_df.loc[lines_df["line_id"] == row["line_id"]]
    try:
        return row["geometry"].distance(ref_line["geometry"]).values[0]
    except IndexError:
        return -99

points["distance_mp"] = points.apply(lambda row: get_distance(row, df_lines), axis=1)

使用空间索引:如果lines_df DataFrame非常大,使用空间索引(例如R-tree)可以显着加速距离计算。您可以使用geopandas.sindex模块为lines_df DataFrame创建空间索引:

from geopandas.sindex import RTree

# 创建空间索引
index = RTree(lines_df.geometry)

def get_distance(row, index, lines_df):
    # 使用空间索引找到最近的线
    nearest_line_idx = list(index.nearest(row["geometry"].bounds))[0]
    nearest_line = lines_df.loc[nearest_line_idx]

    try:
        return row["geometry"].distance(nearest_line["geometry"])
    except IndexError:
        return -99

points["distance_mp"] = points.apply(lambda row: get_distance(row, index, df_lines), axis=1)

使用Cython或Numba:如果距离计算是代码的瓶颈,可以考虑使用Cython或Numba来加速计算。这些工具可以将您的Python代码编译为更快的C代码或机器代码。以下是使用Numba的示例:

import numba as nb

@nb.jit(nopython=True)
def get_distance(point_lineID, point_GEOM, lines_df, line_lengths):
    min_dist = np.inf
    for i in range(len(lines_df)):
        if lines_df[i]["line_id"] == point_lineID:
            dist = point_GEOM.distance(lines_df[i]["geometry"])
            if dist < min_dist:
                min_dist = dist
                line_length = line_lengths[i]
    return min_dist, line_length

# 预先计算线长度以加速访问
df_lines["length"] = df_lines["geometry"].length

# 创建线长度数组
line_lengths = df_lines["length"].values

distances = np.zeros(len(points))
for i in nb.prange(len(points)):
    distances[i], line_length = get_distance(points["line_id"][i], points["geometry"][i], df_lines, line_lengths)
    if distances[i] == -1:
        distances[i] = -99
points["distance_mp"] = distances
英文:

There are several ways to potentially make the code more performant. Here are a few suggestions:

Use vectorization: Instead of iterating through each row in the points DataFrame, you can use vectorized operations to calculate the distances all at once. For example, you can use the apply method with a lambda function to apply the distance calculation to all rows at once:

    def get_distance(row, lines_df):
    ref_line = lines_df.loc[lines_df[&quot;line_id&quot;] == row[&quot;line_id&quot;]]
    try:
        return row[&quot;geometry&quot;].distance(ref_line[&quot;geometry&quot;]).values[0]
    except IndexError:
        return -99

points[&quot;distance_mp&quot;] = points.apply(lambda row: get_distance(row, df_lines), axis=1)

Use spatial indexing: If the lines_df DataFrame is very large, using a spatial index (such as an R-tree) can significantly speed up the distance calculations. You can use the geopandas.sindex module to create a spatial index for the lines_df DataFrame:

from geopandas.sindex import RTree

# Create spatial index
index = RTree(lines_df.geometry)

def get_distance(row, index, lines_df):
    # Find nearest line using spatial index
    nearest_line_idx = list(index.nearest(row[&quot;geometry&quot;].bounds))[0]
    nearest_line = lines_df.loc[nearest_line_idx]

    try:
        return row[&quot;geometry&quot;].distance(nearest_line[&quot;geometry&quot;])
    except IndexError:
        return -99

points[&quot;distance_mp&quot;] = points.apply(lambda row: get_distance(row, index, df_lines), axis=1)

Use Cython or Numba: If the distance calculation is the bottleneck in your code, you can consider using Cython or Numba to speed up the calculation. These tools can compile your Python code to faster C code or machine code, respectively. Here's an example using Numba:

import numba as nb

@nb.jit(nopython=True)
def get_distance(point_lineID, point_GEOM, lines_df, line_lengths):
    min_dist = np.inf
    for i in range(len(lines_df)):
        if lines_df[i][&quot;line_id&quot;] == point_lineID:
            dist = point_GEOM.distance(lines_df[i][&quot;geometry&quot;])
            if dist &lt; min_dist:
                min_dist = dist
                line_length = line_lengths[i]
    return min_dist, line_length

# Precompute line lengths for faster access
df_lines[&quot;length&quot;] = df_lines[&quot;geometry&quot;].length

# Create array of line lengths
line_lengths = df_lines[&quot;length&quot;].values

distances = np.zeros(len(points))
for i in nb.prange(len(points)):
    distances[i], line_length = get_distance(points[&quot;line_id&quot;][i], points[&quot;geometry&quot;][i], df_lines, line_lengths)
    if distances[i] == -1:
        distances[i] = -99
points[&quot;distance_mp&quot;] = distances

huangapple
  • 本文由 发表于 2023年4月19日 18:54:15
  • 转载请务必保留本文链接:https://go.coder-hub.com/76053666.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定