更快的算法来计算空间中点之间的相似度。

huangapple go评论69阅读模式
英文:

Faster algorithm to calculate similarity between points in space

问题

我试图计算地理空间中具有坐标的点之间的某种相似性。为了更清晰,我将使用一个示例:

import pandas as pd
import geopandas as gpd
from geopy import distance
from shapely import Point

df = pd.DataFrame({
    'Name':['a','b','c','d'],
    'Value':[1,2,3,4],
    'geometry':[Point(1,0), Point(1,2), Point(1,0), Point(3,3)]
})
gdf = gpd.GeoDataFrame(df, geometry=df.geometry)
print(gdf)

我需要一个新的数据框,其中包含每对点之间的距离,它们的相似性(在这种情况下是曼哈顿距离),以及它们的其他可能变量(在这种情况下只有name作为附加变量)。

我的解决方案如下:

import numpy as np

def calc_values_for_row(row, sourcepoint):  ## sourcepoint is a row of tdf
    sourcename = sourcepoint['Name']
    targetname = row['Name']
    manhattan = abs(sourcepoint['Value']-row['Value'])
    sourcecoord = sourcepoint['geometry']
    targetcoord = row['geometry']
    dist_meters = distance.distance(np.array(sourcecoord.coords), np.array(targetcoord.coords)).meters

    new_row = [sourcename, targetname, manhattan, sourcecoord, targetcoord, dist_meters]
    new_row = pd.Series(new_row)
    new_row.index = ['SourceName','TargetName','Manhattan','SourceCoord','TargetCoord','Distance (m)']
    return new_row

def calc_dist_df(df):
    full_df = pd.DataFrame()
    for i in df.index:
        tdf = df.loc[df.index>i]
        if tdf.empty == False:
            sliced_df = tdf.apply(lambda x: calc_values_for_row(x, df.loc[i]), axis=1)
            full_df = pd.concat([full_df, sliced_df])
    return full_df.reset_index(drop=True)

calc_dist_df(gdf)

期望的结果如下:

  SourceName TargetName  Manhattan  SourceCoord  TargetCoord   Distance (m)
0          a          b          1  POINT (1 0)  POINT (1 2)  222605.296097
1          a          c          2  POINT (1 0)  POINT (1 0)       0.000000
2          a          d          3  POINT (1 0)  POINT (3 3)  400362.335920
3          b          c          1  POINT (1 2)  POINT (1 0)  222605.296097
4          b          d          2  POINT (1 2)  POINT (3 3)  247555.571681
5          c          d          1  POINT (1 0)  POINT (3 3)  400362.335920

这个代码能按预期工作,但对于大型数据集来说非常慢。我正在一次迭代数据框的每一行,以切片gdf,然后在切片gdf上使用.apply(),但我想知道是否有一种方法可以避免第一个for循环,或者是否有其他更快的解决方案。

注意
itertools 中的 combination 可能不是解决方案,因为几何列可能包含重复的值。

英文:

I am trying to calculate some kind of similarity between points with coordinates in geographical space.
I will use an example to make things a bit more clear:

import pandas as pd
import geopandas as gpd
from geopy import distance
from shapely import Point

df = pd.DataFrame({
    'Name':['a','b','c','d'],
    'Value':[1,2,3,4],
    'geometry':[Point(1,0), Point(1,2), Point(1,0), Point(3,3)]
})
gdf = gpd.GeoDataFrame(df, geometry=df.geometry)
print(gdf)
  Name  Value                 geometry
0    a      1  POINT (1.00000 0.00000)
1    b      2  POINT (1.00000 2.00000)
2    c      3  POINT (1.00000 0.00000)
3    d      4  POINT (3.00000 3.00000)

I need a new dataframe containing the distance between each pair of points, their similarity (Manhattan distance in this case) and also their other possible variables (in this case there is only name as additional variable).

My solution is the following:

def calc_values_for_row(row, sourcepoint):  ## sourcepoint is a row of tdf
    sourcename = sourcepoint['Name']
    targetname = row['Name']
    manhattan = abs(sourcepoint['Value']-row['Value'])
    sourcecoord = sourcepoint['geometry']
    targetcoord = row['geometry']
    dist_meters = distance.distance(np.array(sourcecoord.coords), np.array(targetcoord.coords)).meters

    new_row = [sourcename, targetname, manhattan, sourcecoord, targetcoord, dist_meters]
    new_row = pd.Series(new_row)
    new_row.index = ['SourceName','TargetName','Manhattan','SourceCoord','TargetCoord','Distance (m)']
    return new_row

def calc_dist_df(df):
    full_df = pd.DataFrame()
    for i in df.index:
        tdf = df.loc[df.index>i]
        if tdf.empty == False:
            sliced_df = tdf.apply(lambda x: calc_values_for_row(x, df.loc[i]), axis=1)
            full_df = pd.concat([full_df, sliced_df])
    return full_df.reset_index(drop=True)

calc_dist_df(gdf)


### EXPECTED RESULT
  SourceName TargetName  Manhattan  SourceCoord  TargetCoord   Distance (m)
0          a          b          1  POINT (1 0)  POINT (1 2)  222605.296097
1          a          c          2  POINT (1 0)  POINT (1 0)       0.000000
2          a          d          3  POINT (1 0)  POINT (3 3)  400362.335920
3          b          c          1  POINT (1 2)  POINT (1 0)  222605.296097
4          b          d          2  POINT (1 2)  POINT (3 3)  247555.571681
5          c          d          1  POINT (1 0)  POINT (3 3)  400362.335920

It works good as expected, BUT it is extremely slow for big datasets.
I am iterating once over each row of the dataframe to slice the gdf once and then I use .apply() on the sliced gdf, but I was wondering if there is a way to avoid the first for loop or maybe another solution to make this algorithm much faster.

NOTE
combination from itertools might not be the solution because the geometry column can contain repeated values
EDIT
This is the distribution of repeated values for the 'geometry' column. As you can see most of the points are repeated and only a few are unique.
更快的算法来计算空间中点之间的相似度。

答案1

得分: 2

你可以使用 scipy.spatial.distance_matrix。使用 .x.y 属性从 Shapely Point 中提取坐标:

from scipy.spatial import distance_matrix

RADIUS = 6371.009 * 1e3  # 米

cx = gdf.add_prefix('Source').merge(gdf.add_prefix('Target'), how='cross')
coords = np.radians(np.stack([gdf['geometry'].x, gdf['geometry'].y], axis=1))
cx['Distance'] = distance_matrix(coords, coords, p=2).ravel() * RADIUS

r, c = np.triu_indices(len(gdf), k=1)
cx = cx.loc[r * len(df) + c]

输出:

>>> cx
   SourceName  SourceValue           Sourcegeometry TargetName  TargetValue           Targetgeometry       Distance
1           a            1  POINT (1.00000 0.00000)          b            2  POINT (1.00000 2.00000)  222390.167448
2           a            1  POINT (1.00000 0.00000)          c            3  POINT (1.00000 0.00000)       0.000000
3           a            1  POINT (1.00000 0.00000)          d            4  POINT (3.00000 3.00000)  400919.575947
6           b            2  POINT (1.00000 2.00000)          c            3  POINT (1.00000 0.00000)  222390.167448
7           b            2  POINT (1.00000 2.00000)          d            4  POINT (3.00000 3.00000)  248639.765971
11          c            3  POINT (1.00000 0.00000)          d            4  POINT (3.00000 3.00000)  400919.575947
英文:

You can use scipy.spatial.distance_matrix. Use .x and .y properties to extract coordinates from shapely Point:

from scipy.spatial import distance_matrix

RADIUS = 6371.009 * 1e3  # meters

cx = gdf.add_prefix('Source').merge(gdf.add_prefix('Target'), how='cross')
coords = np.radians(np.stack([gdf['geometry'].x, gdf['geometry'].y], axis=1))
cx['Distance'] = distance_matrix(coords, coords, p=2).ravel() * RADIUS

r, c = np.triu_indices(len(gdf), k=1)
cx = cx.loc[r * len(df) + c]

Output:

>>> cx
   SourceName  SourceValue           Sourcegeometry TargetName  TargetValue           Targetgeometry       Distance
1           a            1  POINT (1.00000 0.00000)          b            2  POINT (1.00000 2.00000)  222390.167448
2           a            1  POINT (1.00000 0.00000)          c            3  POINT (1.00000 0.00000)       0.000000
3           a            1  POINT (1.00000 0.00000)          d            4  POINT (3.00000 3.00000)  400919.575947
6           b            2  POINT (1.00000 2.00000)          c            3  POINT (1.00000 0.00000)  222390.167448
7           b            2  POINT (1.00000 2.00000)          d            4  POINT (3.00000 3.00000)  248639.765971
11          c            3  POINT (1.00000 0.00000)          d            4  POINT (3.00000 3.00000)  400919.575947

答案2

得分: 1

以下是您要翻译的内容:

如果点的基数足够低,您可以预先计算所有唯一点对之间的距离(因为 x.distance(y) == y.distance(x)),然后将其应用于 df

对于我的随机数生成器示例,有39,800行和19,900个点对 更快的算法来计算空间中点之间的相似度。

import math
import random
from itertools import product

import pandas as pd
from shapely import Point

# 生成示例数据框
rng = random.Random(1309510)
N = 200
df = pd.DataFrame({
    'Name': [f'n{x}' for x in range(N)],
    'Value': [math.sin(x) for x in range(N)],
    'geometry': [Point(round(rng.uniform(0, 10), 2), round(rng.uniform(0, 10), 2)) for x in range(N)],
})

# 获取数据框中的所有唯一点对
point_pairs = {frozenset((x, y)): (x, y) for (x, y) in product(df.geometry, df.geometry) if x != y}

# 计算所有点对之间的距离
point_distances = {pair_key: pair[0].distance(pair[1]) for pair_key, pair in point_pairs.items()}

# 生成包含所有点对及其相关数据的数据框
df = df.merge(df, how='cross', suffixes=('_1', '_2'))
df = df[df.Name_1 != df.Name_2]

# 从预先计算的字典中读取距离
df["distance"] = df.apply(lambda x: point_distances[frozenset((x.geometry_1, x.geometry_2))], axis=1)

print(df)
英文:

How about something like this? If the cardinality of the points is low enough, you can precalculate the distances between all unique pairs (since x.distance(y) == y.distance(x)), and just apply that to the df.

For my RNG example, there's 39,800 rows and 19,900 pairs 更快的算法来计算空间中点之间的相似度。

import math
import random
from itertools import product

import pandas as pd
from shapely import Point

# Generate example dataframe
rng = random.Random(1309510)
N = 200
df = pd.DataFrame({
    'Name': [f'n{x}' for x in range(N)],
    'Value': [math.sin(x) for x in range(N)],
    'geometry': [Point(round(rng.uniform(0, 10), 2), round(rng.uniform(0, 10), 2)) for x in range(N)],
})

# Get all unique point pairs within the dataframe
point_pairs = {frozenset((x, y)): (x, y) for (x, y) in product(df.geometry, df.geometry) if x != y}

# Calculate distances between all point pairs
point_distances = {pair_key: pair[0].distance(pair[1]) for pair_key, pair in point_pairs.items()}

# Generate dataframe with all point pairs and their associated data
df = df.merge(df, how='cross', suffixes=('_1', '_2'))
df = df[df.Name_1 != df.Name_2]

# Read distances from precalculated dictionary
df["distance"] = df.apply(lambda x: point_distances[frozenset((x.geometry_1, x.geometry_2))], axis=1)

print(df)

答案3

得分: -1

In pandas, you have the method diff() which calculates the difference between one value and its previous index in the same column. In a case like this, you need to create a row with a continuous value and apply the .diff(). Remember that you need to insert a NaN value as the first index since the diff() method will create a list of values with len-1. This is:

import pandas as pd

df = pd.DataFrame({
    'Name': ['a', 'b', 'c', 'd'],
    'Value': [1, 2, 3, 4],
    'geometry': [(1, 0), (1, 2), (1, 0), (3, 3)]
})

df['first_val'] = df.geometry.str[0]
df['second_val'] = df.geometry.str[1]

df['first_diff'] = df.first_val.diff()
df['second_diff'] = df.second_val.diff()

row_list = []
for idx, rows in df.iterrows():
    my_list = [rows.first_diff, rows.second_diff]
    row_list.append(my_list)

df['geometrical_distance'] = row_list
print(df)

In case you are trying to calculate the distance between geographical points, you can use haversine. It has a method to calculate the distance between geographical points passing their geometrical coordinate. This method is:

import haversine
from haversine import Unit

loc1 = (35.526954, 44.659832)
loc2 = (36.215489, 45.625896)
haversine.haversine(loc1, loc2, unit=Unit.METERS)

distances = []
for row_index in range(len(df)):
    distances.append(
        haversine.haversine(
            df['geometrical_point_1'].iloc[row_index],
            df['geometrical_point_2'].iloc[row_index], unit=Unit.METERS)
    )
英文:

In pandas you have the method diff() which calculates the difference between one value and it's preivous index in the same column. In a case like this you need to create a row with a continuous value and apply the .diff(). Remember that you need to insert a NaN value as the first index since the diff() method will create a list of values with len-1. This is:

import pandas as pd


df = pd.DataFrame({
    'Name':['a','b','c','d'],
    'Value':[1,2,3,4],
    'geometry':[(1,0), (1,2), (1,0), (3,3)]
})

df['first_val'] = df.geometry.str[0]
df['second_val'] = df.geometry.str[1]

df['first_diff'] = df.first_val.diff()
df['second_diff'] = df.second_val.diff()

row_list = []
for idx, rows in df.iterrows():
    my_list = [rows.first_diff, rows.second_diff]
    row_list.append(my_list)

df['geometrical_distance'] = row_list
print(df)

       Name Value geometry   first_val second_val first_diff geometrical_distance
 0  	a	  1	   (1, 0)	    1	        0	    NaN	NaN	    [nan, nan]
 1	    b  	  2	   (1, 2)	    1	        2	    0.0	2.0	    [0.0, 2.0]
 2  	c	  3	   (1, 0)	    1	        0	    0.0	-2.0	[0.0, -2.0]
 3	    d	  4	   (3, 3)	    3	        3	    2.0	3.0	    [2.0, 3.0]

In case you are tyring to calculate the distance between geographical points you can use haversine. It has a method to calculate the distance between geographical points passing their geometrical coordinate. This method is:

import haversine
from haversine import Unit

loc1=(35.526954, 44.659832)
loc2=(36.215489, 45.625896)
haversine.haversine(loc1, loc2, unit=Unit.METERS)

distances = []
for row_index in range(len(df)):
distances.append(
    haversine.haversine(
        df['geometrical_point_1'].iloc[row_index],
        df['geometrical_point_2'].iloc[row_index], unit=Unit.METERS)
    )

huangapple
  • 本文由 发表于 2023年5月29日 19:13:29
  • 转载请务必保留本文链接:https://go.coder-hub.com/76356846.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定