2023年4月13日 22:11:42go评论116阅读模式

英文:

How do I make the calculation for this distance matrix faster?

问题

我正在处理一个包含地理数据的聚类任务。我想计算结合地理和时间距离的自定义距离矩阵。我的数据（np.array）包含纬度、经度和时间戳。我目前使用以下代码：

np_data = df.to_numpy()
# 将纬度和经度转换为弧度
lat_lon_rad = np.radians(np_data[:,:2].astype(float))
    
# 计算Haversine距离矩阵
haversine_matrix = haversine_distances(lat_lon_rad)
haversine_matrix /= np.max(haversine_matrix)
    
# 计算时间差矩阵
timestamps = np_data[:,2]
time_matrix = np.abs(np.subtract.outer(timestamps, timestamps)) # 这一行很慢
time_matrix /= np.max(time_matrix)
    
combined_matrix = 0.5 * haversine_matrix + 0.5 * time_matrix

问题是：当数据集为1000行时，这段代码需要大约25秒才能完成，主要是由于计算time_matrix（Haversine矩阵非常快）。问题是：我必须处理大约200-500k行的数据集。仅使用Haversine函数仍然可以，但计算time_matrix将花费太长时间。

我的问题是：如何加快time_matrix的计算？ 我找不到任何方法可以更快地执行np.subtract.outer(timestamps, timestamps)的计算。

英文:

I am working on a clustering task with geospatial data. I want to compute my own distance matrix that combines both geographical and temporal distance. My data (np.array) contains latitude, longitude, and timestamp. A sample of my DataFrame df (dict to reproduce):

	    latitude	longitude	timestamp
412671	52.506136	6.068709	2017-01-01 00:00:23.518
412672	52.503316	6.071496	2017-01-01 00:01:30.764
412673	52.505122	6.068912	2017-01-01 00:02:30.858
412674	52.501792	6.068605	2017-01-01 00:03:38.194
412675	52.508105	6.075160	2017-01-01 00:06:41.116

I currently use the following code:

np_data = df.to_numpy()
# convert latitudes and longitudes to radians
lat_lon_rad = np.radians(np_data[:,:2].astype(float))
# compute Haversine distance matrix
haversine_matrix = haversine_distances(lat_lon_rad)
haversine_matrix /= np.max(haversine_matrix)
# compute time difference matrix
timestamps = np_data[:,2]
time_matrix = np.abs(np.subtract.outer(timestamps, timestamps)) # This line is SLOW
time_matrix /= np.max(time_matrix)
combined_matrix = 0.5 * haversine_matrix + 0.5 * time_matrix

This produces the desired result. However, when my data set is 1000 rows, this code takes +- 25 seconds to complete, mainly due to the calculation of the time_matrix (the haversine matrix is very fast). The problem is: I have to work with data sets of +- 200-500k rows. Using only the Haversine function is then still fine, but calculating my time_matrix will take way too long.

My question: how do I speed up the calculation of the time_matrix? I cannot find any way to perform the np.subtract.outer(timestamps, timestamps) calculation faster.

答案1

得分: 1

如何将时间戳转换为浮点数，并使用NumPy的广播功能来计算时间差呢？时间的整数表示避免了使用pandas时间戳时所带来的昂贵开销。即

timestamps_sec = np.array([(ts - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s") for ts in np_data[:, 2]])
timestamps_sec = timestamps_sec[:, np.newaxis]

现在，您可以使用广播来计算时间差矩阵

time_matrix = np.abs(timestamps_sec - timestamps_sec.T)
time_matrix = time_matrix.astype(float)  # 转换为浮点数以避免整数除法
time_matrix /= np.max(time_matrix)

英文:

How about converting the timestamps to floats and using Numpy's broadcasting feature to compute your time differences? The integer representation of time avoids the costly overhead associated using pandas timestamps. i.e

timestamps_sec = np.array([(ts - pd.Timestamp(&quot;1970-01-01&quot;)) // pd.Timedelta(&quot;1s&quot;) for ts in np_data[:, 2]])
timestamps_sec = timestamps_sec[:, np.newaxis]

Now you can compute the time difference matrix using broadcasting

time_matrix = np.abs(timestamps_sec - timestamps_sec.T)
time_matrix = time_matrix.astype(float)  # Convert to float to avoid integer division
time_matrix /= np.max(time_matrix)

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何加快这个距离矩阵的计算速度？

问题

答案1

矩阵与其转置之间的乘法不是对称的且不是半正定的。

Google Slides – 使用 API 在网页链接中发布一篇文章

Nextcord发送消息的特点而不是其内容

如何在使用多进程池时传递可迭代对象的索引

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。