英文:
How do I create a subset of previously surveyed areas which covers all survey teams and land class types?
问题
我目前只使用虚拟数据(附加),但情况是:
- 我将有大量不同的原始调查区域(在虚拟数据中为100个)
- 每个原始调查区域都将由一个调查团队进行调查(在我的虚拟数据中,我包括了5个不同的调查团队:a、b、c、d、e)
- 每个调查区域还被分配了一个土地类别(在我的虚拟数据中,我包括了9个土地类别:1 - 9)
我想编写一个脚本,它将为我确定重新调查的某个数量(举例来说,假设为25%)的调查区域。这些被确定重新调查的区域必须:
- 尽可能均匀地覆盖所有调查团队(即每个团队5个)
并且作为其中的子集 - 尽可能均匀地覆盖所有土地类别
在R中是否可能实现这一点?或者是其他系统?我也可以访问AGOL和ArcPRO。
虚拟数据:(代码部分不翻译)
英文:
I am currently only working with dummy data (attached) but the situation is:
- I will have a high number of different original survey areas (in dummy data, 100)
- Each original survey area will have been surveyed by a survey team (in my dummy data, I have included 5 different survey teams: a, b, c, d, e)
- Each survey area has also been allocated a land class type (in my dummy data, I have included 9 land classes: 1 - 9)
I want to write a script which will identify a certain number (for examples sake, let's say 25%) of survey areas for me to resurvey for quality assurance. These identified areas for resurvey must:
- Evenly (as much as possible) cover all survey teams (i.e. 5 per team)
AND as a subset of that - Evenly (as much as possible) cover all land classes
Is this possible within R? Or an alternate system? I have access to AGOL and ArcPRO too.
Dummy data:
Completed survey area | Survey team |Land Class
1 a 1
2 b 2
3 c 3
4 d 4
5 e 5
6 a 6
7 b 7
8 c 8
9 d 9
10 e 1
11 a 2
12 b 3
13 c 4
14 d 5
15 e 6
16 a 7
17 b 8
18 c 9
19 d 1
20 e 2
21 a 3
22 b 4
23 c 5
24 d 6
25 e 7
26 a 8
27 b 9
28 c 1
29 d 2
30 e 3
31 a 4
32 b 5
33 c 6
34 d 7
35 e 8
36 a 9
37 b 1
38 c 2
39 d 3
40 e 4
41 a 5
42 b 6
43 c 7
44 d 8
45 e 9
46 a 1
47 b 2
48 c 3
49 d 4
50 e 5
51 a 6
52 b 7
53 c 8
54 d 9
55 e 1
56 a 2
57 b 3
58 c 4
59 d 5
60 e 6
61 a 7
62 b 8
63 c 9
64 d 1
65 e 2
66 a 3
67 b 4
68 c 5
69 d 6
70 e 7
71 a 8
72 b 9
73 c 1
74 d 2
75 e 3
76 a 4
77 b 5
78 c 6
79 d 7
80 e 8
81 a 9
82 b 1
83 c 2
84 d 3
85 e 4
86 a 5
87 b 6
88 c 7
89 d 8
90 e 9
91 a 1
92 b 2
93 c 3
94 d 4
95 e 5
96 a 6
97 b 7
98 c 8
99 d 9
100 e 1
I am yet to try anything as not sure where to begin.
答案1
得分: 0
这是你想要的吗?我创建了一个更大的示例数据集(n = 10,000),因为你提供的示例数据集可能太小,无法得到你想要的结果。请注意,由于"Survey team"值相对于"Land Class"的分布,略多于25%的调查站点被返回。观察到一点:我建议使用语法上有效的列名,例如避免空格和特殊字符,这样你在声明列名时就不必每次都使用反引号。
英文:
Is this what your after? I created a larger example dataset (n = 10,000) as the example dataset you gave was I believe too small to get the result you're wanting. Note also that due to the distribution of the "Survey team" values relative to "Land Class", slightly more than 25% of survey sites are returned:
library(tidyr)
library(dplyr)
# set.seed() to make results below reproducible
set.seed(1)
# Example df based on values given in your example df
df <- data.frame(`Completed survey area` = 1:10000,
`Survey team` = rep(letters[1:5], 2000),
`Land Class` = c(rep(1:9, 1111), 1),
check.names = FALSE)
# Return ~25% sample of dataset with even distribution of teams and land classes
resurvey <- df %>%
group_by(`Survey team`, `Land Class`) %>%
sample_frac(size=.25) # Sample size as fraction of total dataset e.g. 25%
resurvey
# # A tibble: 2,520 × 3
# # Groups: Survey team, Land Class [45]
# `Completed survey area` `Survey team` `Land Class`
# <int> <chr> <dbl>
# 1 3016 a 1
# 2 7471 a 1
# 3 5761 a 1
# 4 7246 a 1
# 5 9631 a 1
# 6 1891 a 1
# 7 586 a 1
# 8 9406 a 1
# 9 8371 a 1
# 10 2251 a 1
# # … with 2,510 more rows# ℹ Use `print(n = ...)` to see more rows
# Check sample distribution
check <- resurvey %>% group_by(`Survey team`,`Land Class`) %>% tally()
print(check, n = nrow(check))
# # A tibble: 45 × 3
# # Groups: Survey team [5]
# `Survey team` `Land Class` n
# <chr> <dbl> <int>
# 1 a 1 56
# 2 a 2 56
# 3 a 3 56
# 4 a 4 56
# 5 a 5 56
# 6 a 6 56
# 7 a 7 56
# 8 a 8 56
# 9 a 9 56
# 10 b 1 56
# 11 b 2 56
# 12 b 3 56
# 13 b 4 56
# 14 b 5 56
# 15 b 6 56
# 16 b 7 56
# 17 b 8 56
# 18 b 9 56
# 19 c 1 56
# 20 c 2 56
# 21 c 3 56
# 22 c 4 56
# 23 c 5 56
# 24 c 6 56
# 25 c 7 56
# 26 c 8 56
# 27 c 9 56
# 28 d 1 56
# 29 d 2 56
# 30 d 3 56
# 31 d 4 56
# 32 d 5 56
# 33 d 6 56
# 34 d 7 56
# 35 d 8 56
# 36 d 9 56
# 37 e 1 56
# 38 e 2 56
# 39 e 3 56
# 40 e 4 56
# 41 e 5 56
# 42 e 6 56
# 43 e 7 56
# 44 e 8 56
# 45 e 9 56
One observation: I would recommend using syntactically valid column names - e.g. avoid spaces and special characters - that way you don't have to use back ticks every time you declare a column name.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论