创建、保存和加载空间索引使用GeoPandas

huangapple go评论122阅读模式
英文:

Create, save, and load spatial index using GeoPandas

问题

我想使用GeoPandas创建一个空间索引,并将其保存到文件中,而不是每次都重新创建它。我该如何做到这一点?

英文:

I want to create a spatial index using GeoPandas once and save it to a file instead of recreating it every time. How do I do this?

答案1

得分: 1

你可以使用Python的pickle库将GeoDataFrame写入磁盘。这将产生与使用Pandas方法.to_pickle()相同的结果。

  1. # 使用pickle保存GeoDataFrame
  2. with open('gdf_polygons.pkl', 'wb') as pickle_file:
  3. pickle.dump(gdf_polygons, pickle_file)
  4. # 使用pickle加载GeoDataFrame
  5. with open('gdf_polygons.pkl', 'rb') as pickle_file:
  6. gdf_polygons = pickle.load(pickle_file)

值得注意的是,对于点在多边形分析,使用空间连接比使用迭代查找匹配要快。您可以使用GeoPandas的.sjoin()方法。作为一个额外的好处,当您执行空间连接时,如果还没有空间索引,GeoPandas会自动创建一个。我改进了Tyler的示例以显示这种方法更快:

  1. import geopandas as gpd
  2. from shapely.geometry import Point
  3. import os
  4. import pandas as pd
  5. import time
  6. shapefile_path = "C:/Data/GIS/US/ZipCodes/tl_2022_us_zcta520.shp"
  7. gdf_zip_codes_file = "gdf_zip_codes.pkl"
  8. # 如果数据框已经创建,则加载数据框。
  9. if os.path.exists(gdf_zip_codes_file):
  10. gdf_zip_codes = pd.read_pickle(gdf_zip_codes_file)
  11. else:
  12. # 创建并保存数据框和空间索引。
  13. gdf_zip_codes = gpd.read_file(shapefile_path)
  14. gdf_zip_codes.sindex
  15. gdf_zip_codes.to_pickle(gdf_zip_codes_file)
  16. # 查找给定纬度和经度的邮政编码的函数。
  17. def find_zip_code(latitude, longitude):
  18. point = Point(longitude, latitude)
  19. possible_matches = list(gdf_zip_codes.sindex.intersection((point.x, point.y)))
  20. for idx in possible_matches:
  21. if gdf_zip_codes.geometry.iloc[idx].contains(point):
  22. return gdf_zip_codes['ZCTA5CE20'].iloc[idx]
  23. return None
  24. def find_zip_code_sjoin(test_coordinates: list) -> gpd.GeoDataFrame:
  25. # 创建Point几何对象的列表
  26. geometry = [Point(lon, lat) for lat, lon in test_coordinates]
  27. # 创建GeoDataFrame - 假设点位于NAD83坐标系,epsg 4269
  28. gdf_points = gpd.GeoDataFrame(geometry=geometry, columns=['lat', 'lng'], crs=gdf_zip_codes.crs)
  29. # 向DataFrame中添加'lat'和'lng'值
  30. gdf_points['lat'] = [lat for lat, lon in test_coordinates]
  31. gdf_points['lng'] = [lon for lat, lon in test_coordinates]
  32. # 执行空间连接 - 为每个点找到相交的多边形
  33. gdf_joined = gpd.sjoin(gdf_points, gdf_zip_codes, how='left', predicate='intersects', lsuffix='point',
  34. rsuffix='poly')
  35. for index, row in gdf_joined.iterrows():
  36. print(f"The zip code for latitude {row['lat']}, longitude {row['lng']} is {row['ZCTA5CE20']}.")
  37. return gdf_joined
  38. # 示例坐标(纬度,经度)。
  39. test_coordinates = [
  40. (40.7128, -74.0060), # 纽约市,纽约
  41. (34.0522, -118.2437), # 洛杉矶,加利福尼亚
  42. (41.8781, -87.6298), # 芝加哥,伊利诺伊
  43. # ... (其他坐标)
  44. ]
  45. # 使用迭代测试函数的每一对纬度和经度。
  46. # 方法#1 使用迭代
  47. start_time = time.time()
  48. for latitude, longitude in test_coordinates:
  49. zip_code = find_zip_code(latitude, longitude)
  50. if zip_code:
  51. print(f"The zip code for latitude {latitude}, longitude {longitude} is {zip_code}.")
  52. else:
  53. print(f"No zip code found for latitude {latitude}, longitude {longitude}.")
  54. end_time = time.time()
  55. total_time = end_time - start_time
  56. print(f'Time for method with iteration: {total_time:.6f} seconds')
  57. # 方法#2 使用空间连接。
  58. start_time = time.time()
  59. find_zip_code_sjoin(test_coordinates)
  60. end_time = time.time()
  61. total_time = end_time - start_time
  62. print(f'Time with spatial join: {total_time:.6f} seconds')

结果:

  1. Time for method with iteration: 0.189 seconds
  2. ...
  3. Time with spatial join: 0.036 seconds
英文:

You can use Python's pickle library to write a GeoDataFrame to disk. It will give identical results to using Pandas method .to_pickle().

  1. # Save GeoDataFrame with pickle
  2. with open('gdf_polygons.pkl', 'wb') as pickle_file:
  3. pickle.dump(gdf_polygons, pickle_file)
  4. # Load GeoDataFrame with pickle
  5. with open('gdf_polygons.pkl', 'rb') as pickle_file:
  6. gdf_polygons = pickle.load(pickle_file)

It is worth noting that for a points-in-polygons analysis, it is faster to use a spatial join, rather than using iteration to find a match. You can use GeoPandas' .sjoin() method. As a bonus, when you do a spatial join, GeoPandas will automatically create a spatial index if there is not one already. I reworked Tyler's example to show that this method is faster:

  1. import geopandas as gpd
  2. from shapely.geometry import Point
  3. import os
  4. import pandas as pd
  5. import time
  6. shapefile_path = "C:/Data/GIS/US/ZipCodes/tl_2022_us_zcta520.shp"
  7. gdf_zip_codes_file = "gdf_zip_codes.pkl"
  8. # Load the dataframe if it's already been created.
  9. if os.path.exists(gdf_zip_codes_file):
  10. gdf_zip_codes = pd.read_pickle(gdf_zip_codes_file)
  11. else:
  12. # Create and save the dataframe and spatial index.
  13. gdf_zip_codes = gpd.read_file(shapefile_path)
  14. gdf_zip_codes.sindex
  15. gdf_zip_codes.to_pickle(gdf_zip_codes_file)
  16. # Function to find the zip code at a given latitude and longitude.
  17. def find_zip_code(latitude, longitude):
  18. point = Point(longitude, latitude)
  19. possible_matches = list(gdf_zip_codes.sindex.intersection((point.x, point.y)))
  20. for idx in possible_matches:
  21. if gdf_zip_codes.geometry.iloc[idx].contains(point):
  22. return gdf_zip_codes['ZCTA5CE20'].iloc[idx]
  23. return None
  24. def find_zip_code_sjoin(test_coordinates: list) -> gpd.GeoDataFrame:
  25. # Create a list of Point geometries
  26. geometry = [Point(lon, lat) for lat, lon in test_coordinates]
  27. # Create a GeoDataFrame - assume points are in NAD83, epsg 4269
  28. gdf_points = gpd.GeoDataFrame(geometry=geometry, columns=['lat', 'lng'], crs=gdf_zip_codes.crs)
  29. # Add the 'lat' and 'lng' values to the DataFrame
  30. gdf_points['lat'] = [lat for lat, lon in test_coordinates]
  31. gdf_points['lng'] = [lon for lat, lon in test_coordinates]
  32. # Perform the spatial join - finds the intersecting polygon for each point
  33. gdf_joined = gpd.sjoin(gdf_points, gdf_zip_codes, how='left', predicate='intersects', lsuffix='point',
  34. rsuffix='poly')
  35. for index, row in gdf_joined.iterrows():
  36. print(f"The zip code for latitude {row['lat']}, longitude {row['lng']} is {row['ZCTA5CE20']}.")
  37. return gdf_joined
  38. # Example coordinates (latitude, longitude).
  39. test_coordinates = [
  40. (40.7128, -74.0060), # New York City, NY
  41. (34.0522, -118.2437), # Los Angeles, CA
  42. (41.8781, -87.6298), # Chicago, IL
  43. (29.7604, -95.3698), # Houston, TX
  44. (33.4484, -112.0740), # Phoenix, AZ
  45. (39.9526, -75.1652), # Philadelphia, PA
  46. (32.7157, -117.1611), # San Diego, CA
  47. (29.9511, -90.0715), # New Orleans, LA
  48. (37.7749, -122.4194), # San Francisco, CA
  49. (38.9072, -77.0369), # Washington, D.C.
  50. (33.7490, -84.3880), # Atlanta, GA
  51. (35.2271, -80.8431), # Charlotte, NC
  52. (42.3601, -71.0589), # Boston, MA
  53. (36.7783, -119.4179), # Fresno, CA
  54. (30.2672, -97.7431), # Austin, TX
  55. (32.7767, -96.7970), # Dallas, TX
  56. (25.7617, -80.1918), # Miami, FL
  57. (39.7684, -86.1581), # Indianapolis, IN
  58. (47.6062, -122.3321), # Seattle, WA
  59. (35.7796, -78.6382), # Raleigh, NC
  60. ]
  61. # Test the function for each pair of latitude and longitude.
  62. # Method #1 with iteration
  63. start_time = time.time()
  64. for latitude, longitude in test_coordinates:
  65. zip_code = find_zip_code(latitude, longitude)
  66. if zip_code:
  67. print(f"The zip code for latitude {latitude}, longitude {longitude} is {zip_code}.")
  68. else:
  69. print(f"No zip code found for latitude {latitude}, longitude {longitude}.")
  70. end_time = time.time()
  71. total_time = end_time - start_time
  72. print(f'Time for method with iteration: {total_time:.6f} seconds')
  73. # Method #2 with spatial joins.
  74. start_time = time.time()
  75. find_zip_code_sjoin(test_coordinates)
  76. end_time = time.time()
  77. total_time = end_time - start_time
  78. print(f'Time with spatial join: {total_time:.6f} seconds')

Results:

  1. Time for method with iteration: 0.189 seconds
  2. ...
  3. Time with spatial join: 0.036 seconds

答案2

得分: 0

您可以通过将GeoPandas数据帧进行pickling来避免每次都重新创建空间索引。

2022年美国人口普查邮政编码划分区域为例:

  1. import geopandas as gpd
  2. from shapely.geometry import Point
  3. import os
  4. import pandas as pd
  5. import datetime
  6. start_time = datetime.datetime.now().time().strftime('%H:%M:%S')
  7. shapefile_path = "tl_2022_us_zcta520.shp"
  8. gdf_zip_codes_file = "gdf_zip_codes.pkl"
  9. # 如果已经创建了数据帧,则加载数据帧。
  10. if os.path.exists(gdf_zip_codes_file):
  11. gdf_zip_codes = pd.read_pickle(gdf_zip_codes_file)
  12. # 否则,创建并保存数据帧和空间索引。
  13. else:
  14. gdf_zip_codes = gpd.read_file(shapefile_path)
  15. gdf_zip_codes.sindex
  16. gdf_zip_codes.to_pickle(gdf_zip_codes_file)
  17. # 用于查找给定纬度和经度的邮政编码的函数。
  18. def find_zip_code(latitude, longitude):
  19. point = Point(longitude, latitude)
  20. possible_matches = list(gdf_zip_codes.sindex.intersection((point.x, point.y)))
  21. for idx in possible_matches:
  22. if gdf_zip_codes.geometry.iloc[idx].contains(point):
  23. return gdf_zip_codes['ZCTA5CE20'].iloc[idx]
  24. return None
  25. # 示例坐标(纬度,经度)。
  26. test_coordinates = [
  27. (40.7128, -74.0060), # 纽约市,纽约
  28. (34.0522, -118.2437), # 洛杉矶,加利福尼亚
  29. (41.8781, -87.6298), # 芝加哥,伊利诺伊
  30. (29.7604, -95.3698), # 休斯敦,得克萨斯
  31. (33.4484, -112.0740), # 凤凰城,亚利桑那
  32. (39.9526, -75.1652), # 费城,宾夕法尼亚
  33. (32.7157, -117.1611), # 圣迭戈,加利福尼亚
  34. (29.9511, -90.0715), # 新奥尔良,路易斯安那
  35. (37.7749, -122.4194), # 旧金山,加利福尼亚
  36. (38.9072, -77.0369), # 华盛顿特区
  37. (33.7490, -84.3880), # 亚特兰大,佐治亚
  38. (35.2271, -80.8431), # 夏洛特,北卡罗来纳
  39. (42.3601, -71.0589), # 波士顿,马萨诸塞
  40. (36.7783, -119.4179), # 弗雷斯诺,加利福尼亚
  41. (30.2672, -97.7431), # 奥斯汀,得克萨斯
  42. (32.7767, -96.7970), # 达拉斯,得克萨斯
  43. (25.7617, -80.1918), # 迈阿密,佛罗里达
  44. (39.7684, -86.1581), # 印第安纳波利斯,印第安纳
  45. (47.6062, -122.3321), # 西雅图,华盛顿
  46. (35.7796, -78.6382), # 里尔,北卡罗来纳
  47. ]
  48. # 为每对纬度和经度测试函数。
  49. for latitude, longitude in test_coordinates:
  50. zip_code = find_zip_code(latitude, longitude)
  51. if zip_code:
  52. print(f"纬度{latitude},经度{longitude}的邮政编码是{zip_code}。")
  53. else:
  54. print(f"未找到纬度{latitude},经度{longitude}的邮政编码。")
  55. end_time = datetime.datetime.now().time().strftime('%H:%M:%S')
  56. total_time = (datetime.datetime.strptime(end_time, '%H:%M:%S') - datetime.datetime.strptime(start_time, '%H:%M:%S'))
  57. print('总时间:' + str(total_time))

第一次运行:

  1. 纬度40.7128,经度-74.006的邮政编码为10007
  2. 纬度34.0522,经度-118.2437的邮政编码为90012
  3. 纬度41.8781,经度-87.6298的邮政编码为60604
  4. ...
  5. 总时间:0:01:00

第二次运行:

  1. ...
  2. 总时间:0:00:03
英文:

You can avoid recreating the spatial index each time by pickling the GeoPandas dataframe.

Using 2022 US Census Zip Code Tabulation Areas as an example:

  1. import geopandas as gpd
  2. from shapely.geometry import Point
  3. import os
  4. import pandas as pd
  5. import datetime
  6. start_time = datetime.datetime.now().time().strftime('%H:%M:%S')
  7. shapefile_path = "tl_2022_us_zcta520.shp"
  8. gdf_zip_codes_file = "gdf_zip_codes.pkl"
  9. # Load the dataframe if it's already been created.
  10. if os.path.exists(gdf_zip_codes_file):
  11. gdf_zip_codes = pd.read_pickle(gdf_zip_codes_file)
  12. # Create and save the dataframe and spatial index.
  13. else:
  14. gdf_zip_codes = gpd.read_file(shapefile_path)
  15. gdf_zip_codes.sindex
  16. gdf_zip_codes.to_pickle(gdf_zip_codes_file)
  17. # Function to find the zip code at a given latitude and longitude.
  18. def find_zip_code(latitude, longitude):
  19. point = Point(longitude, latitude)
  20. possible_matches = list(gdf_zip_codes.sindex.intersection((point.x, point.y)))
  21. for idx in possible_matches:
  22. if gdf_zip_codes.geometry.iloc[idx].contains(point):
  23. return gdf_zip_codes['ZCTA5CE20'].iloc[idx]
  24. return None
  25. # Example coordinates (latitude, longitude).
  26. test_coordinates = [
  27. (40.7128, -74.0060), # New York City, NY
  28. (34.0522, -118.2437), # Los Angeles, CA
  29. (41.8781, -87.6298), # Chicago, IL
  30. (29.7604, -95.3698), # Houston, TX
  31. (33.4484, -112.0740), # Phoenix, AZ
  32. (39.9526, -75.1652), # Philadelphia, PA
  33. (32.7157, -117.1611), # San Diego, CA
  34. (29.9511, -90.0715), # New Orleans, LA
  35. (37.7749, -122.4194), # San Francisco, CA
  36. (38.9072, -77.0369), # Washington, D.C.
  37. (33.7490, -84.3880), # Atlanta, GA
  38. (35.2271, -80.8431), # Charlotte, NC
  39. (42.3601, -71.0589), # Boston, MA
  40. (36.7783, -119.4179), # Fresno, CA
  41. (30.2672, -97.7431), # Austin, TX
  42. (32.7767, -96.7970), # Dallas, TX
  43. (25.7617, -80.1918), # Miami, FL
  44. (39.7684, -86.1581), # Indianapolis, IN
  45. (47.6062, -122.3321), # Seattle, WA
  46. (35.7796, -78.6382), # Raleigh, NC
  47. ]
  48. # Test the function for each pair of latitude and longitude.
  49. for latitude, longitude in test_coordinates:
  50. zip_code = find_zip_code(latitude, longitude)
  51. if zip_code:
  52. print(f"The zip code for latitude {latitude}, longitude {longitude} is {zip_code}.")
  53. else:
  54. print(f"No zip code found for latitude {latitude}, longitude {longitude}.")
  55. end_time = datetime.datetime.now().time().strftime('%H:%M:%S')
  56. total_time = (datetime.datetime.strptime(end_time,'%H:%M:%S') - datetime.datetime.strptime(start_time,'%H:%M:%S'))
  57. print('Total time: ' + str(total_time))

First run:

  1. The zip code for latitude 40.7128, longitude -74.006 is 10007.
  2. The zip code for latitude 34.0522, longitude -118.2437 is 90012.
  3. The zip code for latitude 41.8781, longitude -87.6298 is 60604.
  4. ...
  5. Total time: 0:01:00

Second run:

  1. ...
  2. Total time: 0:00:03

huangapple
  • 本文由 发表于 2023年7月31日 23:00:20
  • 转载请务必保留本文链接:https://go.coder-hub.com/76804871.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定