英文:
How do I make the calculation for this distance matrix faster?
问题
我正在处理一个包含地理数据的聚类任务。我想计算结合地理和时间距离的自定义距离矩阵。我的数据(np.array
)包含纬度、经度和时间戳。我目前使用以下代码:
np_data = df.to_numpy()
# 将纬度和经度转换为弧度
lat_lon_rad = np.radians(np_data[:,:2].astype(float))
# 计算Haversine距离矩阵
haversine_matrix = haversine_distances(lat_lon_rad)
haversine_matrix /= np.max(haversine_matrix)
# 计算时间差矩阵
timestamps = np_data[:,2]
time_matrix = np.abs(np.subtract.outer(timestamps, timestamps)) # 这一行很慢
time_matrix /= np.max(time_matrix)
combined_matrix = 0.5 * haversine_matrix + 0.5 * time_matrix
问题是:当数据集为1000行时,这段代码需要大约25秒才能完成,主要是由于计算time_matrix
(Haversine矩阵非常快)。问题是:我必须处理大约200-500k行的数据集。仅使用Haversine函数仍然可以,但计算time_matrix
将花费太长时间。
我的问题是:如何加快time_matrix
的计算? 我找不到任何方法可以更快地执行np.subtract.outer(timestamps, timestamps)
的计算。
英文:
I am working on a clustering task with geospatial data. I want to compute my own distance matrix that combines both geographical and temporal distance. My data (np.array
) contains latitude, longitude, and timestamp. A sample of my DataFrame df
(dict to reproduce):
latitude longitude timestamp
412671 52.506136 6.068709 2017-01-01 00:00:23.518
412672 52.503316 6.071496 2017-01-01 00:01:30.764
412673 52.505122 6.068912 2017-01-01 00:02:30.858
412674 52.501792 6.068605 2017-01-01 00:03:38.194
412675 52.508105 6.075160 2017-01-01 00:06:41.116
I currently use the following code:
np_data = df.to_numpy()
# convert latitudes and longitudes to radians
lat_lon_rad = np.radians(np_data[:,:2].astype(float))
# compute Haversine distance matrix
haversine_matrix = haversine_distances(lat_lon_rad)
haversine_matrix /= np.max(haversine_matrix)
# compute time difference matrix
timestamps = np_data[:,2]
time_matrix = np.abs(np.subtract.outer(timestamps, timestamps)) # This line is SLOW
time_matrix /= np.max(time_matrix)
combined_matrix = 0.5 * haversine_matrix + 0.5 * time_matrix
This produces the desired result. However, when my data set is 1000 rows, this code takes +- 25 seconds to complete, mainly due to the calculation of the time_matrix
(the haversine matrix is very fast). The problem is: I have to work with data sets of +- 200-500k rows. Using only the Haversine function is then still fine, but calculating my time_matrix
will take way too long.
My question: how do I speed up the calculation of the time_matrix
? I cannot find any way to perform the np.subtract.outer(timestamps, timestamps)
calculation faster.
答案1
得分: 1
如何将时间戳转换为浮点数,并使用NumPy的广播功能来计算时间差呢?时间的整数表示避免了使用pandas时间戳时所带来的昂贵开销。即
timestamps_sec = np.array([(ts - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s") for ts in np_data[:, 2]])
timestamps_sec = timestamps_sec[:, np.newaxis]
现在,您可以使用广播来计算时间差矩阵
time_matrix = np.abs(timestamps_sec - timestamps_sec.T)
time_matrix = time_matrix.astype(float) # 转换为浮点数以避免整数除法
time_matrix /= np.max(time_matrix)
英文:
How about converting the timestamps to floats and using Numpy's broadcasting feature to compute your time differences? The integer representation of time avoids the costly overhead associated using pandas timestamps. i.e
timestamps_sec = np.array([(ts - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s") for ts in np_data[:, 2]])
timestamps_sec = timestamps_sec[:, np.newaxis]
Now you can compute the time difference matrix using broadcasting
time_matrix = np.abs(timestamps_sec - timestamps_sec.T)
time_matrix = time_matrix.astype(float) # Convert to float to avoid integer division
time_matrix /= np.max(time_matrix)
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论