在R中迭代加载大型空间数据集并执行交叉操作。

huangapple go评论71阅读模式
英文:

Iteratively load large spatial dataset and perform intersection in R

问题

我有一个20GB的GeoPackage需要与一个远小于它的GeoPackage(100MB)进行交集操作。如果文件大小较小,那么可以这样操作:

library(sf)

BigDataset <- st_read("https://automaticknowledge.org/gb/greenspace/Manchester_greenspace.gpkg") 
Dataset <- st_read("https://automaticknowledge.org/gb/wards/Manchester_wards.gpkg")

Intersection <- Dataset %>%
    st_intersection(., BigDataset) %>%
    st_collection_extract(., "POLYGON")

然而,由于20GB文件的大小,我的R会话崩溃。是否有一种方法可以分批加载20GB GeoPackage 中的一部分(比如将其分成1000行的小组),然后对每个小组进行交集操作,最终在加载和处理所有数据后生成单个'Intersection'输出?

英文:

I have a 20GB GeoPackage that I need to intersect with a much smaller GeoPackage (100mb). If the file sizes were smaller, then this would work:

library(sf)

BigDataset &lt;- st_read(&quot;https://automaticknowledge.org/gb/greenspace/Manchester_greenspace.gpkg&quot;) 
Dataset &lt;- st_read(&quot;https://automaticknowledge.org/gb/wards/Manchester_wards.gpkg&quot;)

Intersection &lt;- Dataset %&gt;%
	st_intersection(., BigDataset) %&gt;%
	st_collection_extract(., &quot;POLYGON&quot;)

However, due to the size of the 20GB file my R session crashes. Is there a way to iteratively load only part of the 20GB GeoPackage at a time (say chunking it into 1000 row groups) and perform the intersection on each chunk before ultimately generating a single 'Intersection' output once all the data has been loaded and processed?

答案1

得分: 2

你很幸运正在处理geopackage。因为geopackage是一个数据库(技术上是SQLite),可以使用SQL进行查询。

考虑一下这段代码;它的作用是:

  • 首先创建一个用于操作的geopackage(您的原始代码不够可复制)
  • 列出数据库的图层,以便您知道如何构建您的from子句
  • 执行SQL查询,选择5到17之间的行;当然,这些数字是任意的,您将希望提供您自己的数字

要构建迭代的SQL代码,考虑使用 {glue} 包。

此外,这可能有点超出范围,但应该可以完全在数据库中使用空间SQL函数执行操作,而不使用受内存限制的R。

library(sf)

# 加载备受喜爱的North Carolina shapefile
shape <- st_read(system.file("shape/nc.shp", package="sf"))

# 创建一个geopackage,任何一个geopackage都可以...
st_write(shape, "nc.gpkg")

# 了解您的图层并构建您的from子句
st_layers("nc.gpkg")

iter_read <- st_read("nc.gpkg", query = "select * from nc where rowid between 5 and 17;")
英文:

You are in luck that you are dealing with geopackages. For a geopackage is a database (sqlite to be technical) and can be queried using SQL.

Consider this piece of code; what it does is:

  • first it creates a geopackage to work with (your original code was not quite reproducible)
  • lists the layers of your database, so that you know how to construct your from clause
  • performs a sql query selecting rows between 5 and 17; these numbers are of course arbitrary and you will want to supply your own

To construct the sql code for iteration consider using the {glue} package..

Also, and this may be somewhat of out of scope, it should be possible to perform the operation entirely in database using spatial SQL functions, omitting R which is memory constrained.

library(sf)

# load the much loved North Carollina shapefile
shape &lt;- st_read(system.file(&quot;shape/nc.shp&quot;, package=&quot;sf&quot;))

# create a geopackage, any geopackage...
st_write(shape, &quot;nc.gpkg&quot;)

# to know your layers &amp; construct your from clause
st_layers(&quot;nc.gpkg&quot;)

iter_read &lt;- st_read(&quot;nc.gpkg&quot;, query = &quot;select * from nc where rowid between 5 and 17;&quot;)

huangapple
  • 本文由 发表于 2023年6月22日 07:33:22
  • 转载请务必保留本文链接:https://go.coder-hub.com/76527758.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定