如何加快这个距离矩阵的计算速度?

huangapple go评论81阅读模式
英文:

How do I make the calculation for this distance matrix faster?

问题

我正在处理一个包含地理数据的聚类任务。我想计算结合地理和时间距离的自定义距离矩阵。我的数据(np.array)包含纬度、经度和时间戳。我目前使用以下代码:

np_data = df.to_numpy()

# 将纬度和经度转换为弧度
lat_lon_rad = np.radians(np_data[:,:2].astype(float))
    
# 计算Haversine距离矩阵
haversine_matrix = haversine_distances(lat_lon_rad)
haversine_matrix /= np.max(haversine_matrix)
    
# 计算时间差矩阵
timestamps = np_data[:,2]
time_matrix = np.abs(np.subtract.outer(timestamps, timestamps)) # 这一行很慢
time_matrix /= np.max(time_matrix)
    
combined_matrix = 0.5 * haversine_matrix + 0.5 * time_matrix

问题是:当数据集为1000行时,这段代码需要大约25秒才能完成,主要是由于计算time_matrix(Haversine矩阵非常快)。问题是:我必须处理大约200-500k行的数据集。仅使用Haversine函数仍然可以,但计算time_matrix将花费太长时间。

我的问题是:如何加快time_matrix的计算? 我找不到任何方法可以更快地执行np.subtract.outer(timestamps, timestamps)的计算。

英文:

I am working on a clustering task with geospatial data. I want to compute my own distance matrix that combines both geographical and temporal distance. My data (np.array) contains latitude, longitude, and timestamp. A sample of my DataFrame df (dict to reproduce):

	    latitude	longitude	timestamp
412671	52.506136	6.068709	2017-01-01 00:00:23.518
412672	52.503316	6.071496	2017-01-01 00:01:30.764
412673	52.505122	6.068912	2017-01-01 00:02:30.858
412674	52.501792	6.068605	2017-01-01 00:03:38.194
412675	52.508105	6.075160	2017-01-01 00:06:41.116

I currently use the following code:

np_data = df.to_numpy()

# convert latitudes and longitudes to radians
lat_lon_rad = np.radians(np_data[:,:2].astype(float))

# compute Haversine distance matrix
haversine_matrix = haversine_distances(lat_lon_rad)
haversine_matrix /= np.max(haversine_matrix)

# compute time difference matrix
timestamps = np_data[:,2]
time_matrix = np.abs(np.subtract.outer(timestamps, timestamps)) # This line is SLOW
time_matrix /= np.max(time_matrix)

combined_matrix = 0.5 * haversine_matrix + 0.5 * time_matrix

This produces the desired result. However, when my data set is 1000 rows, this code takes +- 25 seconds to complete, mainly due to the calculation of the time_matrix (the haversine matrix is very fast). The problem is: I have to work with data sets of +- 200-500k rows. Using only the Haversine function is then still fine, but calculating my time_matrix will take way too long.

My question: how do I speed up the calculation of the time_matrix? I cannot find any way to perform the np.subtract.outer(timestamps, timestamps) calculation faster.

答案1

得分: 1

如何将时间戳转换为浮点数,并使用NumPy的广播功能来计算时间差呢?时间的整数表示避免了使用pandas时间戳时所带来的昂贵开销。即

timestamps_sec = np.array([(ts - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s") for ts in np_data[:, 2]])
timestamps_sec = timestamps_sec[:, np.newaxis]

现在,您可以使用广播来计算时间差矩阵

time_matrix = np.abs(timestamps_sec - timestamps_sec.T)
time_matrix = time_matrix.astype(float)  # 转换为浮点数以避免整数除法
time_matrix /= np.max(time_matrix)
英文:

How about converting the timestamps to floats and using Numpy's broadcasting feature to compute your time differences? The integer representation of time avoids the costly overhead associated using pandas timestamps. i.e

timestamps_sec = np.array([(ts - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s") for ts in np_data[:, 2]])
timestamps_sec = timestamps_sec[:, np.newaxis]

Now you can compute the time difference matrix using broadcasting

time_matrix = np.abs(timestamps_sec - timestamps_sec.T)
time_matrix = time_matrix.astype(float)  # Convert to float to avoid integer division
time_matrix /= np.max(time_matrix)

huangapple
  • 本文由 发表于 2023年4月13日 22:11:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76006453.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定