英文:
Iteratively load large spatial dataset and perform intersection in R
问题
我有一个20GB的GeoPackage需要与一个远小于它的GeoPackage(100MB)进行交集操作。如果文件大小较小,那么可以这样操作:
library(sf)
BigDataset <- st_read("https://automaticknowledge.org/gb/greenspace/Manchester_greenspace.gpkg")
Dataset <- st_read("https://automaticknowledge.org/gb/wards/Manchester_wards.gpkg")
Intersection <- Dataset %>%
st_intersection(., BigDataset) %>%
st_collection_extract(., "POLYGON")
然而,由于20GB文件的大小,我的R会话崩溃。是否有一种方法可以分批加载20GB GeoPackage 中的一部分(比如将其分成1000行的小组),然后对每个小组进行交集操作,最终在加载和处理所有数据后生成单个'Intersection'输出?
英文:
I have a 20GB GeoPackage that I need to intersect with a much smaller GeoPackage (100mb). If the file sizes were smaller, then this would work:
library(sf)
BigDataset <- st_read("https://automaticknowledge.org/gb/greenspace/Manchester_greenspace.gpkg")
Dataset <- st_read("https://automaticknowledge.org/gb/wards/Manchester_wards.gpkg")
Intersection <- Dataset %>%
st_intersection(., BigDataset) %>%
st_collection_extract(., "POLYGON")
However, due to the size of the 20GB file my R session crashes. Is there a way to iteratively load only part of the 20GB GeoPackage at a time (say chunking it into 1000 row groups) and perform the intersection on each chunk before ultimately generating a single 'Intersection' output once all the data has been loaded and processed?
答案1
得分: 2
你很幸运正在处理geopackage。因为geopackage是一个数据库(技术上是SQLite),可以使用SQL进行查询。
考虑一下这段代码;它的作用是:
- 首先创建一个用于操作的geopackage(您的原始代码不够可复制)
- 列出数据库的图层,以便您知道如何构建您的from子句
- 执行SQL查询,选择5到17之间的行;当然,这些数字是任意的,您将希望提供您自己的数字
要构建迭代的SQL代码,考虑使用 {glue}
包。
此外,这可能有点超出范围,但应该可以完全在数据库中使用空间SQL函数执行操作,而不使用受内存限制的R。
library(sf)
# 加载备受喜爱的North Carolina shapefile
shape <- st_read(system.file("shape/nc.shp", package="sf"))
# 创建一个geopackage,任何一个geopackage都可以...
st_write(shape, "nc.gpkg")
# 了解您的图层并构建您的from子句
st_layers("nc.gpkg")
iter_read <- st_read("nc.gpkg", query = "select * from nc where rowid between 5 and 17;")
英文:
You are in luck that you are dealing with geopackages. For a geopackage is a database (sqlite to be technical) and can be queried using SQL.
Consider this piece of code; what it does is:
- first it creates a geopackage to work with (your original code was not quite reproducible)
- lists the layers of your database, so that you know how to construct your from clause
- performs a sql query selecting rows between 5 and 17; these numbers are of course arbitrary and you will want to supply your own
To construct the sql code for iteration consider using the {glue}
package..
Also, and this may be somewhat of out of scope, it should be possible to perform the operation entirely in database using spatial SQL functions, omitting R which is memory constrained.
library(sf)
# load the much loved North Carollina shapefile
shape <- st_read(system.file("shape/nc.shp", package="sf"))
# create a geopackage, any geopackage...
st_write(shape, "nc.gpkg")
# to know your layers & construct your from clause
st_layers("nc.gpkg")
iter_read <- st_read("nc.gpkg", query = "select * from nc where rowid between 5 and 17;")
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论