英文:
What algorithm would be most efficient when trying to find the nearest city given a set of coordinates?
问题
我有一个包含1000个美国最大城市经度和纬度的数据集。我正在设计一个API,根据用户的经度/纬度输入返回用户最近的城市。
我可以使用什么最有效的算法来计算最近的城市?我知道可以使用haversine公式来计算用户坐标与每个城市之间的距离,但对于所有1000个城市都这样做似乎效率低下。我之前使用了k-d树来解决平面上的最近邻问题 - 是否有类似的解决方案可用于地球的情况?
编辑:保持简单 - 我寻找的距离是直线距离。在这个阶段不考虑道路或路径。
英文:
I have a dataset which contains the longitude and latitude of the 1000 largest US cities. I'm designing an API which returns the user's nearest city, given an input of the user's longitude/latitude.
What is the most efficient algorithm I can use to calculate the nearest city? I know that I can use the haversine formula to calculate the distance between the user's coordinate and each cities, but it seems inefficient to have to do this for all 1000 cities. I've previously used a k-d tree to solve nearest neighbour problems on a plane - is there a similar solution that can be used in the context of a globe?
Edit: keeping this simple - distance I'm looking for is as the crow flies. Not taking roads or routes into account at this stage.
答案1
得分: 2
首先,https://gisgeography.com/python-libraries-gis-mapping/ 包含了很多用于处理地理信息的Python库。或者,你可以将问题交给一个数据库,例如 https://www.gaia-gis.it/fossil/libspatialite/index,并在那里解决它。我建议在自行开发之前先考虑这些想法。
话虽如此,过去我为美国城市所做的是将所有数据放入一个经纬度的二维结构中,获取城市,构建一个经纬度边界框(利用经纬度可以转换成近似距离的事实),使用它来返回一些最接近的候选城市,然后从候选城市中进行复杂的计算。我个人见过数据库在每小时执行这个计算数十万次,并处理了大量其他流量。而且这是20年前的情况,现在可能更高效。
对于全球数据集,我会将城市放入几个这样的结构中,使用不同的经纬度极点选择。对于给定的城市,我会选择将该城市远离任何极点的轴,然后按照之前的方法进行处理。
英文:
First, https://gisgeography.com/python-libraries-gis-mapping/ has a bunch of Python libraries for dealing with geography. Or you can push the problem to a database, for example with https://www.gaia-gis.it/fossil/libspatialite/index, and solve it there. I'd recommend looking at those ideas before wrapping your own.
That said, in the past what I've done for US cities is to stick everything into a lat/long 2-D structure, take the city, construct a lat/long bounding box (using the fact that lat/long can be turned into approximate distances), use that to return a handful of candidate closest cities, and then do the hard calculation from the candidate cities. I've personally seen a database successfully perform this calculation several hundred thousands times/hour while handling a lot of other traffic. And this was 20 years ago - I'd expect more now.
For a worldwide dataset I'd put the cities into a handful of such structures, with different choices of poles for lat/long. For a given city I'd pick the axis which puts the city farthest from either pole, then proceed as before.
答案2
得分: 1
你可以将地图分割成不重叠的正方形,覆盖整个美国地图(即,你会得到一个网格)。你将使用它们左上角的坐标来为这些正方形编号(即,每个正方形将有一个唯一的ID),然后进行预处理,将每个城市分配到其所属的正方形的ID。你将找到用户所在的正方形,并仅检查位于该正方形以及距离其一步的城市(总共:9个正方形)内的城市。如果这些正方形中没有城市,那么你将检查距离它两步的城市,依此类推。这样,平均来说,你将检查更少的城市以找到最近的城市。
英文:
You can split the map into squares that do not overlap and they cover the whole US map (i.e., you will have a grid). You will number the squares using the coordinates of their upper left corner (i.e., each one will have a unique ID) and you will do a preprocessing where each city will be assigned with the ID of the square where it belongs. You will find the square where the user lies into and then you will check only the cities that lie into this square and the ones that are one step from this (total: 9 squares). If these are empty of cities, you will check the ones that are two steps of it etc. In this way, on average you will check much less cities to find the closest
答案3
得分: 1
这个答案与ckc的答案非常相似。
首先,将1000个城市分成两组:一个大组位于加拿大和墨西哥之间,另外一些城市位于这个矩形之外(如阿拉斯加、夏威夷等)。
在处理坐标时,检查它们是否属于小组中的城市:如果是这种情况,无需优化。
为了优化另一种情况,可以将地图划分成矩形区域(例如5°纬度 x 7°经度),并为每个矩形关联包含在该矩形内的城市列表。
要找到最近的城市,考虑包含该点的矩形R。
计算到该矩形内城市的距离。
通过计算点到R旁边的8个矩形的距离来处理与R相邻的矩形:然后可以消除那些距离大于已找到的最佳距离的相邻矩形。
重复这个过程到下一个级别,即下一个圆环(位于由5x5矩形组成的区域外部,这些矩形的中心是R)。
英文:
This answer is very similar to that of ckc.
First, spilt the 1000 cities in 2 groups : a big one located located between Canada and Mexico and the few others cities located outside this rectangle (i.e Alaska, Hawai, ...).
When processing coordinates, check if they belong to the small group : in this case, no optimisation needed.
To optimize the other case, you may divide the map in rectangles (example 5°lat x 7° lon) and associate to each rectangle the list of cities belonging to each rectangle.
To find the nearest city, consider the rectangle R containing the point.
Compute the distance to the cities of the rectangle.
Process the 8 rectangles adjacent to R by computing the distance of the point to each rectangle : you may then eliminate the adjacent rectangles whose distance is greater than the best distance already found.
Iterate the process to a next level, i.e. the next crown (rectangles located on the outside of the area composed of 5x5 rectangles whose center is R).
答案4
得分: 0
这里是使用Haversine和sklearn
的示例代码:
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree
cities = pd.read_json("https://gist.githubusercontent.com/Miserlou/c5cd8364bf9b2420bb29/raw/2bf258763cdddd704f8ffd3ea9a3e81d25e2c6f6/cities.json")
cities_gps = cities[['latitude','longitude']].values
cities_radians = np.radians(cities_gps)
上面的代码加载了我在某个Git仓库上找到的1000个城市数据。Pandas只是用来解析JSON,不是必需的计算部分。
你可以按照以下方式构建一棵树(叶子大小可以稍微调整,对性能影响较小):
tree = BallTree(cities_radians, leaf_size=15, metric='haversine')
现在假设我们有10000个经纬度坐标,你可以按如下方式查询:
random_geo = np.random.normal(loc=(30,-80), scale=(8,8), size=(10000,2))
random_geo_radians = np.radians(random_geo)
如果只查询1个坐标,请确保不要为每个查询重新构建树。这可能比仅查询1个经纬度对花费更多时间,因此保持树的状态并对其进行查询。
distances, idx = tree.query(random_geo_radians, k=1)
distances
包含单位圆距离,你需要通过与地球半径相乘来将其转换为英里/千米。最好使用美国的平均半径。
你可以通过以下方式获取城市名称:
cities.city[idx[:,0]]
注意:示例代码中的"
和'
是HTML实体编码,可以在代码中直接使用双引号和单引号来代替。
英文:
There is some confusion in the discussion, here is an example using Haversine, with sklearn
.
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree
cities = pd.read_json("https://gist.githubusercontent.com/Miserlou/c5cd8364bf9b2420bb29/raw/2bf258763cdddd704f8ffd3ea9a3e81d25e2c6f6/cities.json")
cities_gps = cities[['latitude','longitude']].values
cities_radians = np.radians(cities_gps)
The above code loads 1000 cities I found on some git. Pandas is just being used to parse the JSON, it is not needed to calculate.
You can build a tree as follows (leafsize can be tweaked a little, has minor effect)
tree = BallTree(cities_radians, leaf_size=15, metric='haversine')
Now say we have 10000 lat/longs, you can query as follows:
random_geo = np.random.normal(loc=(30,-80), scale=(8,8), size=(10000,2))
random_geo_radians = np.radians(random_geo)
If you just query 1, make sure not to re-build the tree for each query. That probably takes more time than query for just 1 lat/long pair. SO keep the tree alive and fire queries at it.
distances, idx = tree.query( random_geo_radians, k=1)
The distances
contains the unit circle distance, you need to convert to miles/km by multiplying with the earth radius. The mean radius of the USA would be best.
You could get the names with
cities.city[ idx[:,0] ]
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论