英文:
Faster algorithm to calculate similarity between points in space
问题
我试图计算地理空间中具有坐标的点之间的某种相似性。为了更清晰,我将使用一个示例:
import pandas as pd
import geopandas as gpd
from geopy import distance
from shapely import Point
df = pd.DataFrame({
'Name':['a','b','c','d'],
'Value':[1,2,3,4],
'geometry':[Point(1,0), Point(1,2), Point(1,0), Point(3,3)]
})
gdf = gpd.GeoDataFrame(df, geometry=df.geometry)
print(gdf)
我需要一个新的数据框,其中包含每对点之间的距离,它们的相似性(在这种情况下是曼哈顿距离),以及它们的其他可能变量(在这种情况下只有name
作为附加变量)。
我的解决方案如下:
import numpy as np
def calc_values_for_row(row, sourcepoint): ## sourcepoint is a row of tdf
sourcename = sourcepoint['Name']
targetname = row['Name']
manhattan = abs(sourcepoint['Value']-row['Value'])
sourcecoord = sourcepoint['geometry']
targetcoord = row['geometry']
dist_meters = distance.distance(np.array(sourcecoord.coords), np.array(targetcoord.coords)).meters
new_row = [sourcename, targetname, manhattan, sourcecoord, targetcoord, dist_meters]
new_row = pd.Series(new_row)
new_row.index = ['SourceName','TargetName','Manhattan','SourceCoord','TargetCoord','Distance (m)']
return new_row
def calc_dist_df(df):
full_df = pd.DataFrame()
for i in df.index:
tdf = df.loc[df.index>i]
if tdf.empty == False:
sliced_df = tdf.apply(lambda x: calc_values_for_row(x, df.loc[i]), axis=1)
full_df = pd.concat([full_df, sliced_df])
return full_df.reset_index(drop=True)
calc_dist_df(gdf)
期望的结果如下:
SourceName TargetName Manhattan SourceCoord TargetCoord Distance (m)
0 a b 1 POINT (1 0) POINT (1 2) 222605.296097
1 a c 2 POINT (1 0) POINT (1 0) 0.000000
2 a d 3 POINT (1 0) POINT (3 3) 400362.335920
3 b c 1 POINT (1 2) POINT (1 0) 222605.296097
4 b d 2 POINT (1 2) POINT (3 3) 247555.571681
5 c d 1 POINT (1 0) POINT (3 3) 400362.335920
这个代码能按预期工作,但对于大型数据集来说非常慢。我正在一次迭代数据框的每一行,以切片gdf,然后在切片gdf上使用.apply()
,但我想知道是否有一种方法可以避免第一个for
循环,或者是否有其他更快的解决方案。
注意
itertools
中的 combination
可能不是解决方案,因为几何列可能包含重复的值。
英文:
I am trying to calculate some kind of similarity between points with coordinates in geographical space.
I will use an example to make things a bit more clear:
import pandas as pd
import geopandas as gpd
from geopy import distance
from shapely import Point
df = pd.DataFrame({
'Name':['a','b','c','d'],
'Value':[1,2,3,4],
'geometry':[Point(1,0), Point(1,2), Point(1,0), Point(3,3)]
})
gdf = gpd.GeoDataFrame(df, geometry=df.geometry)
print(gdf)
Name Value geometry
0 a 1 POINT (1.00000 0.00000)
1 b 2 POINT (1.00000 2.00000)
2 c 3 POINT (1.00000 0.00000)
3 d 4 POINT (3.00000 3.00000)
I need a new dataframe containing the distance between each pair of points, their similarity (Manhattan distance in this case) and also their other possible variables (in this case there is only name
as additional variable).
My solution is the following:
def calc_values_for_row(row, sourcepoint): ## sourcepoint is a row of tdf
sourcename = sourcepoint['Name']
targetname = row['Name']
manhattan = abs(sourcepoint['Value']-row['Value'])
sourcecoord = sourcepoint['geometry']
targetcoord = row['geometry']
dist_meters = distance.distance(np.array(sourcecoord.coords), np.array(targetcoord.coords)).meters
new_row = [sourcename, targetname, manhattan, sourcecoord, targetcoord, dist_meters]
new_row = pd.Series(new_row)
new_row.index = ['SourceName','TargetName','Manhattan','SourceCoord','TargetCoord','Distance (m)']
return new_row
def calc_dist_df(df):
full_df = pd.DataFrame()
for i in df.index:
tdf = df.loc[df.index>i]
if tdf.empty == False:
sliced_df = tdf.apply(lambda x: calc_values_for_row(x, df.loc[i]), axis=1)
full_df = pd.concat([full_df, sliced_df])
return full_df.reset_index(drop=True)
calc_dist_df(gdf)
### EXPECTED RESULT
SourceName TargetName Manhattan SourceCoord TargetCoord Distance (m)
0 a b 1 POINT (1 0) POINT (1 2) 222605.296097
1 a c 2 POINT (1 0) POINT (1 0) 0.000000
2 a d 3 POINT (1 0) POINT (3 3) 400362.335920
3 b c 1 POINT (1 2) POINT (1 0) 222605.296097
4 b d 2 POINT (1 2) POINT (3 3) 247555.571681
5 c d 1 POINT (1 0) POINT (3 3) 400362.335920
It works good as expected, BUT it is extremely slow for big datasets.
I am iterating once over each row of the dataframe to slice the gdf once and then I use .apply()
on the sliced gdf, but I was wondering if there is a way to avoid the first for
loop or maybe another solution to make this algorithm much faster.
NOTE
combination
from itertools might not be the solution because the geometry column can contain repeated values
EDIT
This is the distribution of repeated values for the 'geometry' column. As you can see most of the points are repeated and only a few are unique.
答案1
得分: 2
你可以使用 scipy.spatial.distance_matrix
。使用 .x
和 .y
属性从 Shapely Point 中提取坐标:
from scipy.spatial import distance_matrix
RADIUS = 6371.009 * 1e3 # 米
cx = gdf.add_prefix('Source').merge(gdf.add_prefix('Target'), how='cross')
coords = np.radians(np.stack([gdf['geometry'].x, gdf['geometry'].y], axis=1))
cx['Distance'] = distance_matrix(coords, coords, p=2).ravel() * RADIUS
r, c = np.triu_indices(len(gdf), k=1)
cx = cx.loc[r * len(df) + c]
输出:
>>> cx
SourceName SourceValue Sourcegeometry TargetName TargetValue Targetgeometry Distance
1 a 1 POINT (1.00000 0.00000) b 2 POINT (1.00000 2.00000) 222390.167448
2 a 1 POINT (1.00000 0.00000) c 3 POINT (1.00000 0.00000) 0.000000
3 a 1 POINT (1.00000 0.00000) d 4 POINT (3.00000 3.00000) 400919.575947
6 b 2 POINT (1.00000 2.00000) c 3 POINT (1.00000 0.00000) 222390.167448
7 b 2 POINT (1.00000 2.00000) d 4 POINT (3.00000 3.00000) 248639.765971
11 c 3 POINT (1.00000 0.00000) d 4 POINT (3.00000 3.00000) 400919.575947
英文:
You can use scipy.spatial.distance_matrix
. Use .x
and .y
properties to extract coordinates from shapely Point:
from scipy.spatial import distance_matrix
RADIUS = 6371.009 * 1e3 # meters
cx = gdf.add_prefix('Source').merge(gdf.add_prefix('Target'), how='cross')
coords = np.radians(np.stack([gdf['geometry'].x, gdf['geometry'].y], axis=1))
cx['Distance'] = distance_matrix(coords, coords, p=2).ravel() * RADIUS
r, c = np.triu_indices(len(gdf), k=1)
cx = cx.loc[r * len(df) + c]
Output:
>>> cx
SourceName SourceValue Sourcegeometry TargetName TargetValue Targetgeometry Distance
1 a 1 POINT (1.00000 0.00000) b 2 POINT (1.00000 2.00000) 222390.167448
2 a 1 POINT (1.00000 0.00000) c 3 POINT (1.00000 0.00000) 0.000000
3 a 1 POINT (1.00000 0.00000) d 4 POINT (3.00000 3.00000) 400919.575947
6 b 2 POINT (1.00000 2.00000) c 3 POINT (1.00000 0.00000) 222390.167448
7 b 2 POINT (1.00000 2.00000) d 4 POINT (3.00000 3.00000) 248639.765971
11 c 3 POINT (1.00000 0.00000) d 4 POINT (3.00000 3.00000) 400919.575947
答案2
得分: 1
以下是您要翻译的内容:
如果点的基数足够低,您可以预先计算所有唯一点对之间的距离(因为 x.distance(y) == y.distance(x)
),然后将其应用于 df
。
对于我的随机数生成器示例,有39,800行和19,900个点对
import math
import random
from itertools import product
import pandas as pd
from shapely import Point
# 生成示例数据框
rng = random.Random(1309510)
N = 200
df = pd.DataFrame({
'Name': [f'n{x}' for x in range(N)],
'Value': [math.sin(x) for x in range(N)],
'geometry': [Point(round(rng.uniform(0, 10), 2), round(rng.uniform(0, 10), 2)) for x in range(N)],
})
# 获取数据框中的所有唯一点对
point_pairs = {frozenset((x, y)): (x, y) for (x, y) in product(df.geometry, df.geometry) if x != y}
# 计算所有点对之间的距离
point_distances = {pair_key: pair[0].distance(pair[1]) for pair_key, pair in point_pairs.items()}
# 生成包含所有点对及其相关数据的数据框
df = df.merge(df, how='cross', suffixes=('_1', '_2'))
df = df[df.Name_1 != df.Name_2]
# 从预先计算的字典中读取距离
df["distance"] = df.apply(lambda x: point_distances[frozenset((x.geometry_1, x.geometry_2))], axis=1)
print(df)
英文:
How about something like this? If the cardinality of the points is low enough, you can precalculate the distances between all unique pairs (since x.distance(y) == y.distance(x)
), and just apply that to the df
.
For my RNG example, there's 39,800 rows and 19,900 pairs
import math
import random
from itertools import product
import pandas as pd
from shapely import Point
# Generate example dataframe
rng = random.Random(1309510)
N = 200
df = pd.DataFrame({
'Name': [f'n{x}' for x in range(N)],
'Value': [math.sin(x) for x in range(N)],
'geometry': [Point(round(rng.uniform(0, 10), 2), round(rng.uniform(0, 10), 2)) for x in range(N)],
})
# Get all unique point pairs within the dataframe
point_pairs = {frozenset((x, y)): (x, y) for (x, y) in product(df.geometry, df.geometry) if x != y}
# Calculate distances between all point pairs
point_distances = {pair_key: pair[0].distance(pair[1]) for pair_key, pair in point_pairs.items()}
# Generate dataframe with all point pairs and their associated data
df = df.merge(df, how='cross', suffixes=('_1', '_2'))
df = df[df.Name_1 != df.Name_2]
# Read distances from precalculated dictionary
df["distance"] = df.apply(lambda x: point_distances[frozenset((x.geometry_1, x.geometry_2))], axis=1)
print(df)
答案3
得分: -1
In pandas, you have the method diff() which calculates the difference between one value and its previous index in the same column. In a case like this, you need to create a row with a continuous value and apply the .diff(). Remember that you need to insert a NaN value as the first index since the diff() method will create a list of values with len-1. This is:
import pandas as pd
df = pd.DataFrame({
'Name': ['a', 'b', 'c', 'd'],
'Value': [1, 2, 3, 4],
'geometry': [(1, 0), (1, 2), (1, 0), (3, 3)]
})
df['first_val'] = df.geometry.str[0]
df['second_val'] = df.geometry.str[1]
df['first_diff'] = df.first_val.diff()
df['second_diff'] = df.second_val.diff()
row_list = []
for idx, rows in df.iterrows():
my_list = [rows.first_diff, rows.second_diff]
row_list.append(my_list)
df['geometrical_distance'] = row_list
print(df)
In case you are trying to calculate the distance between geographical points, you can use haversine. It has a method to calculate the distance between geographical points passing their geometrical coordinate. This method is:
import haversine
from haversine import Unit
loc1 = (35.526954, 44.659832)
loc2 = (36.215489, 45.625896)
haversine.haversine(loc1, loc2, unit=Unit.METERS)
distances = []
for row_index in range(len(df)):
distances.append(
haversine.haversine(
df['geometrical_point_1'].iloc[row_index],
df['geometrical_point_2'].iloc[row_index], unit=Unit.METERS)
)
英文:
In pandas you have the method diff() which calculates the difference between one value and it's preivous index in the same column. In a case like this you need to create a row with a continuous value and apply the .diff(). Remember that you need to insert a NaN value as the first index since the diff() method will create a list of values with len-1. This is:
import pandas as pd
df = pd.DataFrame({
'Name':['a','b','c','d'],
'Value':[1,2,3,4],
'geometry':[(1,0), (1,2), (1,0), (3,3)]
})
df['first_val'] = df.geometry.str[0]
df['second_val'] = df.geometry.str[1]
df['first_diff'] = df.first_val.diff()
df['second_diff'] = df.second_val.diff()
row_list = []
for idx, rows in df.iterrows():
my_list = [rows.first_diff, rows.second_diff]
row_list.append(my_list)
df['geometrical_distance'] = row_list
print(df)
Name Value geometry first_val second_val first_diff geometrical_distance
0 a 1 (1, 0) 1 0 NaN NaN [nan, nan]
1 b 2 (1, 2) 1 2 0.0 2.0 [0.0, 2.0]
2 c 3 (1, 0) 1 0 0.0 -2.0 [0.0, -2.0]
3 d 4 (3, 3) 3 3 2.0 3.0 [2.0, 3.0]
In case you are tyring to calculate the distance between geographical points you can use haversine. It has a method to calculate the distance between geographical points passing their geometrical coordinate. This method is:
import haversine
from haversine import Unit
loc1=(35.526954, 44.659832)
loc2=(36.215489, 45.625896)
haversine.haversine(loc1, loc2, unit=Unit.METERS)
distances = []
for row_index in range(len(df)):
distances.append(
haversine.haversine(
df['geometrical_point_1'].iloc[row_index],
df['geometrical_point_2'].iloc[row_index], unit=Unit.METERS)
)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论