如何加快这个距离矩阵的计算速度?

huangapple go评论116阅读模式
英文:

How do I make the calculation for this distance matrix faster?

问题

我正在处理一个包含地理数据的聚类任务。我想计算结合地理和时间距离的自定义距离矩阵。我的数据(np.array)包含纬度、经度和时间戳。我目前使用以下代码:

  1. np_data = df.to_numpy()
  2. # 将纬度和经度转换为弧度
  3. lat_lon_rad = np.radians(np_data[:,:2].astype(float))
  4. # 计算Haversine距离矩阵
  5. haversine_matrix = haversine_distances(lat_lon_rad)
  6. haversine_matrix /= np.max(haversine_matrix)
  7. # 计算时间差矩阵
  8. timestamps = np_data[:,2]
  9. time_matrix = np.abs(np.subtract.outer(timestamps, timestamps)) # 这一行很慢
  10. time_matrix /= np.max(time_matrix)
  11. combined_matrix = 0.5 * haversine_matrix + 0.5 * time_matrix

问题是:当数据集为1000行时,这段代码需要大约25秒才能完成,主要是由于计算time_matrix(Haversine矩阵非常快)。问题是:我必须处理大约200-500k行的数据集。仅使用Haversine函数仍然可以,但计算time_matrix将花费太长时间。

我的问题是:如何加快time_matrix的计算? 我找不到任何方法可以更快地执行np.subtract.outer(timestamps, timestamps)的计算。

英文:

I am working on a clustering task with geospatial data. I want to compute my own distance matrix that combines both geographical and temporal distance. My data (np.array) contains latitude, longitude, and timestamp. A sample of my DataFrame df (dict to reproduce):

  1. latitude longitude timestamp
  2. 412671 52.506136 6.068709 2017-01-01 00:00:23.518
  3. 412672 52.503316 6.071496 2017-01-01 00:01:30.764
  4. 412673 52.505122 6.068912 2017-01-01 00:02:30.858
  5. 412674 52.501792 6.068605 2017-01-01 00:03:38.194
  6. 412675 52.508105 6.075160 2017-01-01 00:06:41.116

I currently use the following code:

  1. np_data = df.to_numpy()
  2. # convert latitudes and longitudes to radians
  3. lat_lon_rad = np.radians(np_data[:,:2].astype(float))
  4. # compute Haversine distance matrix
  5. haversine_matrix = haversine_distances(lat_lon_rad)
  6. haversine_matrix /= np.max(haversine_matrix)
  7. # compute time difference matrix
  8. timestamps = np_data[:,2]
  9. time_matrix = np.abs(np.subtract.outer(timestamps, timestamps)) # This line is SLOW
  10. time_matrix /= np.max(time_matrix)
  11. combined_matrix = 0.5 * haversine_matrix + 0.5 * time_matrix

This produces the desired result. However, when my data set is 1000 rows, this code takes +- 25 seconds to complete, mainly due to the calculation of the time_matrix (the haversine matrix is very fast). The problem is: I have to work with data sets of +- 200-500k rows. Using only the Haversine function is then still fine, but calculating my time_matrix will take way too long.

My question: how do I speed up the calculation of the time_matrix? I cannot find any way to perform the np.subtract.outer(timestamps, timestamps) calculation faster.

答案1

得分: 1

如何将时间戳转换为浮点数,并使用NumPy的广播功能来计算时间差呢?时间的整数表示避免了使用pandas时间戳时所带来的昂贵开销。即

  1. timestamps_sec = np.array([(ts - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s") for ts in np_data[:, 2]])
  2. timestamps_sec = timestamps_sec[:, np.newaxis]

现在,您可以使用广播来计算时间差矩阵

  1. time_matrix = np.abs(timestamps_sec - timestamps_sec.T)
  2. time_matrix = time_matrix.astype(float) # 转换为浮点数以避免整数除法
  3. time_matrix /= np.max(time_matrix)
英文:

How about converting the timestamps to floats and using Numpy's broadcasting feature to compute your time differences? The integer representation of time avoids the costly overhead associated using pandas timestamps. i.e

  1. timestamps_sec = np.array([(ts - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s") for ts in np_data[:, 2]])
  2. timestamps_sec = timestamps_sec[:, np.newaxis]

Now you can compute the time difference matrix using broadcasting

  1. time_matrix = np.abs(timestamps_sec - timestamps_sec.T)
  2. time_matrix = time_matrix.astype(float) # Convert to float to avoid integer division
  3. time_matrix /= np.max(time_matrix)

huangapple
  • 本文由 发表于 2023年4月13日 22:11:42
  • 转载请务必保留本文链接:https://go.coder-hub.com/76006453.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定