R Spatstat:识别最近的邻居以供进一步使用

huangapple go评论81阅读模式
英文:

R Spatstat: Identify nearest neighbours for further use

问题

我有一个包含488个GPS点(经度和纬度)的数据框。对于这488个点,我想找到它们的两个最近邻居。

到目前为止,我已经创建了一个点模式对象,并计算了距离最近的两个点(如下所示)。但是,我想进一步,能够通过它们在原始数据集中的ID来识别这些最近的点。

目前,我的脚本工作方式如下:

# 1. 将x和y坐标存储在两个向量中
lon <- data$longitude
lat <- data$latitude

# 2. 创建两个具有包含所有点的三角形维度的向量xrange和yrange
xrange <- range(lon, na.rm=T)
yrange <- range(lat, na.rm=T)

# 3. 创建ppp
lf <- ppp(lon, lat, xrange, yrange)

plot(lf)

nndist(lf, k = 1:2)

给我返回的结果(前5行示例)是:

                 dist.1       dist.2
      [1,] 1.426925e-03 0.0017007414
      [2,] 1.017287e-03 0.0015574895
      [3,] 6.502012e-04 0.0010172867
      [4,] 6.502012e-04 0.0007202307
      [5,] 7.202307e-04 0.0010472445

但是我想要能够将这些结果链接回原始数据集中的"hhid",类似于以下方式:

  hhid         dist.1  dist.1.hhid         dist.2     dist.1.hhid
  1    1.426925e-03             7  0.0017007414                 3
  2    1.017287e-03             6  0.0015574895                 4
  3    6.502012e-04            10  0.0010172867                 5
  4    6.502012e-04             2  0.0007202307                 8
  5    7.202307e-04             1  0.0010472445                13

原始数据集的前20行如下:

    structure(list(hhid = c(2004L, 2006L, 2009L, 2012L, 2013L, 2020L, 
    2022L, 2023L, 2028L, 2029L, 2035L, 2036L, 2043L, 2046L, 2047L, 
    2059L, 2062L, 2063L, 2065L, 2066L), longitude = c(-1.478302479, 
    -1.477469802, -1.476488709, -1.476146936, -1.47547996, -1.475799441, 
    -1.475903392, -1.476232767, -1.476053953, -1.477196693, -1.476906657, 
    -1.478778243, -1.480723381, -1.433436394, -1.433033824, -1.428791046, 
    -1.431989908, -1.432058454, -1.43134892, -1.430848002), latitude = c(12.10552216, 
    12.10700512, 12.10673618, 12.10618305, 12.10645485, 12.10846806, 
    12.1080761, 12.10830975, 12.11114883, 12.11076546, 12.11197853, 
    12.11345387, 12.10725021, 12.1183548, 12.11699867, 12.11466122, 
    12.1154108, 12.11545277, 12.11554337, 12.11567497)), row.names = c(NA, 
    20L), class = "data.frame")

希望这可以帮助你将结果链接回原始数据集。

英文:

I have a dataframe with 488 GPS points (long and lat). For each 488 points I would like to find their 2 closest neighbours.

So far I have created a point pattern object and computed the distance from the nearest two points (below). However, I would like to go a step further and be able to identify these nearest points by their ID from the original dataset.

Currently, my script works like:

# 1. store x and y coords in two vectors
lon &lt;- data$longitude
lat &lt;- data$latitude

# 2. create two vectors xrange and yrange with dimensions of triangle that contain all points
xrange &lt;- range(lon, na.rm=T)
yrange &lt;- range(lat, na.rm=T)

# 3. create ppp
lf &lt;- ppp(lon, lat, xrange, yrange)

plot(lf)

nndist(lf, k = 1:2)

Giving me (example of top 5 results):

             dist.1       dist.2
  [1,] 1.426925e-03 0.0017007414
  [2,] 1.017287e-03 0.0015574895
  [3,] 6.502012e-04 0.0010172867
  [4,] 6.502012e-04 0.0007202307
  [5,] 7.202307e-04 0.0010472445

But I would like to be able to link this back to the "hhid" from the original dataset to something like this:

  hhid         dist.1  dist.1.hhid         dist.2     dist.1.hhid
  1    1.426925e-03             7  0.0017007414                 3
  2    1.017287e-03             6  0.0015574895                 4
  3    6.502012e-04            10  0.0010172867                 5
  4    6.502012e-04             2  0.0007202307                 8
  5    7.202307e-04             1  0.0010472445                13

First 20 rows of original dataset :

structure(list(hhid = c(2004L, 2006L, 2009L, 2012L, 2013L, 2020L, 
2022L, 2023L, 2028L, 2029L, 2035L, 2036L, 2043L, 2046L, 2047L, 
2059L, 2062L, 2063L, 2065L, 2066L), longitude = c(-1.478302479, 
-1.477469802, -1.476488709, -1.476146936, -1.47547996, -1.475799441, 
-1.475903392, -1.476232767, -1.476053953, -1.477196693, -1.476906657, 
-1.478778243, -1.480723381, -1.433436394, -1.433033824, -1.428791046, 
-1.431989908, -1.432058454, -1.43134892, -1.430848002), latitude = c(12.10552216, 
12.10700512, 12.10673618, 12.10618305, 12.10645485, 12.10846806, 
12.1080761, 12.10830975, 12.11114883, 12.11076546, 12.11197853, 
12.11345387, 12.10725021, 12.1183548, 12.11699867, 12.11466122, 
12.1154108, 12.11545277, 12.11554337, 12.11567497)), row.names = c(NA, 
20L), class = &quot;data.frame&quot;)

答案1

得分: 2

自从您使用了spatstat命令nndist来计算最近邻距离,您可以直接使用相应的命令nnwhich来获取最近邻的标识符(序列号)。

然而,纬度和经度坐标不能视为x和y坐标。即使在地球的一个小区域内,一单位的纬度与一单位的经度并不相等。

未来版本的spatstat将处理纬度/经度坐标。目前,您可能最好使用sprgeos包,如jpsmith在另一个答案中所描述的。

英文:

Since you used the spatstat command nndist to compute the nearest-neighbour distances, you could just use the corresponding command nnwhich to obtain the identifiers (serial numbers) of the nearest neighbours.

However, latitude and longitude coordinates cannot be treated as x and y coordinates. Even in a small region of the globe, one unit of latitude is not the same distance as one unit of longitude.

Future versions of spatstat will handle latitude/longitude coordinates. For now, you are probably better advised to use the sp and rgeos packages as described in the separate answer by jpsmith.

答案2

得分: 2

如果您计划在进一步分析中使用spatstat,您首先需要投影到平面坐标,然后可以使用nnwhich(),如@adrian-baddely所述。请注意,没有通用最佳坐标参考系统(CRS)可供使用,因此需要您进行选择。有一个名为crsuggest的R软件包可以帮助您找到相关的CRS。在这里我使用覆盖布基纳法索的UTM区域30N(srid=32630):

# 代码部分,不要翻译

spatstat中的“序列号”将是给定点在列表中的顺序的索引。下面是这些索引以及hhid

# 代码部分,不要翻译
英文:

If you plan to use spatstat for further analysis you need to project to
planar coordinates first and then you can use nnwhich() as described by
@adrian-baddely. Note that there is no universaly best coordinate reference
system (CRS) to use, so it requires a choice from you. There is an R package
crsuggest that can help you find relevant ones. Here I use UTM zone 30N
which covers all of Burkina Faso as far as I can tell (srid=32630):

library(sf)
library(spatstat)

hhid &lt;- c(2004L, 2006L, 2009L, 2012L, 2013L, 2020L, 
          2022L, 2023L, 2028L, 2029L, 2035L, 2036L, 2043L, 2046L, 2047L, 
          2059L, 2062L, 2063L, 2065L, 2066L)
longitude &lt;- c(-1.478302479, 
              -1.477469802, -1.476488709, -1.476146936, -1.47547996, -1.475799441, 
              -1.475903392, -1.476232767, -1.476053953, -1.477196693, -1.476906657, 
              -1.478778243, -1.480723381, -1.433436394, -1.433033824, -1.428791046, 
              -1.431989908, -1.432058454, -1.43134892, -1.430848002) 
latitude &lt;- c(12.10552216, 
              12.10700512, 12.10673618, 12.10618305, 12.10645485, 12.10846806, 
              12.1080761, 12.10830975, 12.11114883, 12.11076546, 12.11197853, 
              12.11345387, 12.10725021, 12.1183548, 12.11699867, 12.11466122, 
              12.1154108, 12.11545277, 12.11554337, 12.11567497)
data &lt;- data.frame(hhid, longitude, latitude)

dat_gps &lt;- st_as_sf(data, crs = 4326, coords = c(&quot;longitude&quot;, &quot;latitude&quot;))
dat_proj &lt;- st_transform(dat_gps, crs = 32630)
dat_ppp &lt;- as.ppp(dat_proj)
ids &lt;- marks(dat_ppp)

nd &lt;- nndist(dat_ppp, k = 1:2)
nw &lt;- nnwhich(dat_ppp, k = 1:2)

The “serial number” in spatstat will just be the index of the given point
in the order the points are listed. Below are these as well as hhid:

data.frame(round(nd,2), nw, hhid = ids, dist.1.hhid = ids[nw[,1]], dist.2.hhid = ids[nw[,2]])
#&gt;    dist.1 dist.2 which.1 which.2 hhid dist.1.hhid dist.2.hhid
#&gt; 1  187.42 238.78       2       3 2004        2006        2009
#&gt; 2  110.86 170.31       3       4 2006        2009        2012
#&gt; 3   71.61 110.86       4       2 2009        2012        2006
#&gt; 4   71.61  78.58       3       5 2012        2009        2013
#&gt; 5   78.58 114.13       4       3 2013        2012        2009
#&gt; 6   44.81  50.31       7       8 2020        2022        2023
#&gt; 7   44.20  44.81       8       6 2022        2023        2020
#&gt; 8   44.20  50.31       7       6 2023        2022        2020
#&gt; 9  130.53 131.42      11      10 2028        2035        2029
#&gt; 10 131.42 137.85       9      11 2029        2028        2035
#&gt; 11 130.53 137.85       9      10 2035        2028        2029
#&gt; 12 261.03 343.62      11      10 2036        2035        2029
#&gt; 13 325.55 355.20       1       2 2043        2004        2006
#&gt; 14 156.28 354.33      15      18 2046        2047        2063
#&gt; 15 156.28 201.28      14      18 2047        2046        2063
#&gt; 16 250.42 295.03      20      19 2059        2066        2065
#&gt; 17   8.79  71.30      18      19 2062        2063        2065
#&gt; 18   8.79  77.88      17      19 2063        2062        2065
#&gt; 19  56.44  71.30      20      17 2065        2066        2062
#&gt; 20  56.44 127.69      19      17 2066        2065        2062

答案3

得分: 0

这似乎是对此处提出的问题的扩展。在该问题的已接受答案的基础上,扩展到您特定的情况,考虑最接近的两个邻居,您可以执行以下操作:

library(sp)
library(rgeos)
# 问题中的数据结构赋值为 "df"

spatialDF <- df
coordinates(spatialDF) <- ~longitude + latitude
dists <- gDistance(spatialDF, byid = TRUE)
min.2dists <- apply(dists, 1, function(x) order(x, decreasing = FALSE)[2:3])

# 最近的邻居
df$hhid1 <- df[min.2dists[1,], "hhid"]
df$dist1 <- apply(dists, 1, function(x) sort(x, decreasing = FALSE)[2])

# 第二近的邻居
df$hhid2 <- df[min.2dists[2,], "hhid"]
df$dist2 <- apply(dists, 1, function(x) sort(x, decreasing = FALSE)[3])

输出:

#    hhid longitude latitude hhid1        dist1 hhid2        dist2
# 1  2004 -1.478302 12.10552  2006 1.700741e-03  2009 0.0021825687
# 2  2006 -1.477470 12.10701  2009 1.017287e-03  2012 0.0015574895
# 3  2009 -1.476489 12.10674  2012 6.502012e-04  2006 0.0010172867
# 4  2012 -1.476147 12.10618  2009 6.502012e-04  2013 0.0007202307
# 5  2013 -1.475480 12.10645  2012 7.202307e-04  2009 0.0010472445
# ...
英文:

This seems to be good an extension of the question posed here. Building off of that question's accepted answer to extend to your specific situation examining the closest two neighbors, you could do:

library(sp)
library(rgeos)
# dput structure in question assigned as &quot;df&quot;

spatialDF &lt;- df
coordinates(spatialDF) &lt;- ~longitude + latitude
dists &lt;- gDistance(spatialDF, byid = TRUE)
min.2dists &lt;- apply(dists, 1, function(x) order(x, decreasing = FALSE)[2:3])

# closest
df$hhid1 &lt;- df[min.2dists[1,],&quot;hhid&quot;]
df$dist1 &lt;- apply(dists, 1, function(x) sort(x, decreasing = FALSE)[2])

# second closest
df$hhid2 &lt;- df[min.2dists[2,],&quot;hhid&quot;]
df$dist2 &lt;- apply(dists, 1, function(x) sort(x, decreasing = FALSE)[3])

Output:

#    hhid longitude latitude hhid1        dist1 hhid2        dist2
# 1  2004 -1.478302 12.10552  2006 1.700741e-03  2009 0.0021825687
# 2  2006 -1.477470 12.10701  2009 1.017287e-03  2012 0.0015574895
# 3  2009 -1.476489 12.10674  2012 6.502012e-04  2006 0.0010172867
# 4  2012 -1.476147 12.10618  2009 6.502012e-04  2013 0.0007202307
# 5  2013 -1.475480 12.10645  2012 7.202307e-04  2009 0.0010472445
# ...

huangapple
  • 本文由 发表于 2023年6月29日 03:07:37
  • 转载请务必保留本文链接:https://go.coder-hub.com/76576057.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定