英文:
Sequence detection in data.frame
问题
我有一个数据框(tibble)。我正在寻找一种检测数据中特定变量序列的方法。在这个示例中有3个变量,但可以有数十个。我展示了70行数据,但可能有数十万行。我有一个用于检测命名列表中数据框的序列。在示例中,有标记为A和B的2个序列,但在实际情况中可能有大约100个,所以我选择了这种结构来存储它们。
数据:
library(tidyverse)
data1 <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70), x1 = c("z", "z", "z",
"z", "z", "z", "z", "y", "y", "y", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "a", "z", "z", "z", "z", "z", "z", "z", "z", "z", "z", "z", "z",
"z", "z", "y", "y", "y", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "a", "z",
"z", "z"), x2 = c("z", "z", "z", "z", "z", "z", "z", "y", "y",
"y", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "a", "z", "z", "z", "z", "z",
"z", "z", "z", "z", "z", "z", "z", "z", "z", "y", "y", "y", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "a", "z", "z", "z"), x3 = c("c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "z", "z", "z",
"z", "z", "z", "z", "z", "z", "z", "f", "f", "f", "f", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "z", "z", "z", "z", "z",
"z", "z", "z", "z", "z", "f", "f", "f", "f", "c", "c", "c", "c",
"c", "c", "c")), row.names = c(NA, -70L), class = c("tbl_df",
"tbl", "data.frame"))
序列检
英文:
I have a dataframe (tibble). I'm looking for a method to detect specific sequences of variables in a data. There are 3 variables in the reprex, but there can be dozens of them. I'm showing 70 rows of data, and there could be several hundred thousand of them. I have a sequence to detect dataframes in a named list. In the reprex there are 2 sequences marked A and B, but in practice there can be about 100 of them, so I chose this structure to store them.
Data:
library(tidyverse)
data1 <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70), x1 = c("z", "z", "z",
"z", "z", "z", "z", "y", "y", "y", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"a", "z", "z", "z", "z", "z", "z", "z", "z", "z", "z", "z", "z",
"z", "z", "y", "y", "y", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "a", "z",
"z", "z"), x2 = c("z", "z", "z", "z", "z", "z", "z", "y", "y",
"y", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "a", "z", "z", "z", "z", "z",
"z", "z", "z", "z", "z", "z", "z", "z", "z", "y", "y", "y", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "a", "z", "z", "z"), x3 = c("c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "z", "z", "z",
"z", "z", "z", "z", "z", "z", "z", "f", "f", "f", "f", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c",
"c", "c", "c", "c", "c", "c", "c", "c", "z", "z", "z", "z", "z",
"z", "z", "z", "z", "z", "f", "f", "f", "f", "c", "c", "c", "c",
"c", "c", "c")), row.names = c(NA, -70L), class = c("tbl_df",
"tbl", "data.frame"))
<sup>Created on 2023-07-17 with reprex v2.0.2</sup>
Sequences to detection:
seqs <- list(A = structure(list(ID = c(1, 2, 3, 4, 5),
x1 = c("y", "y", "y", "c", "c"),
x2 = c("y", "y", "y", "c", "c"),
x3 = c("c", "c", "c", "c", "c")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L)),
B = structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8),
x1 = c("c", "c", "c", "c", "c", "c", "c", "a"),
x2 = c("c", "c", "c", "c", "c", "c", "c", "a"),
x3 = c("f", "f", "f", "f", "c", "c", "c", "c")),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -8L)))
<sup>Created on 2023-07-17 with reprex v2.0.2</sup>
I would like to get such a result, where in the column I get information in which second a sequence starts. The searched sequences in the reprex are separated by other sequences that are not relevant to me. It is important that the detection of the sequence is the detection of the sequence for all variables (sequences may differ very slightly, only by one value of one variable). I only need to find the beginning of the sequence, because its duration is known (the number of lines of the data frame with the pattern of the sequence).
ID x1 x2 x3 det_seq
<dbl> <chr> <chr> <chr> <chr>
1 1 z z c NA
2 2 z z c NA
3 3 z z c NA
4 4 z z c NA
5 5 z z c NA
6 6 z z c NA
7 7 z z c NA
8 8 y y c A
9 9 y y c NA
10 10 y y c NA
11 11 c c c NA
12 12 c c c NA
13 13 c c z NA
14 14 c c z NA
15 15 c c z NA
16 16 c c z NA
17 17 c c z NA
18 18 c c z NA
19 19 c c z NA
20 20 c c z NA
21 21 c c z NA
22 22 c c z NA
23 23 c c f B
24 24 c c f NA
25 25 c c f NA
26 26 c c f NA
27 27 c c c NA
28 28 c c c NA
29 29 c c c NA
30 30 a a c NA
31 31 z z c NA
32 32 z z c NA
33 33 z z c NA
34 34 z z c NA
35 35 z z c NA
36 36 z z c NA
37 37 z z c NA
38 38 z z c NA
39 39 z z c NA
40 40 z z c NA
41 41 z z c NA
42 42 z z c NA
43 43 z z c NA
44 44 z z c NA
45 45 y y c A
46 46 y y c NA
47 47 y y c NA
48 48 c c c NA
49 49 c c c NA
50 50 c c z NA
51 51 c c z NA
52 52 c c z NA
53 53 c c z NA
54 54 c c z NA
55 55 c c z NA
56 56 c c z NA
57 57 c c z NA
58 58 c c z NA
59 59 c c z NA
60 60 c c f B
61 61 c c f NA
62 62 c c f NA
63 63 c c f NA
64 64 c c c NA
65 65 c c c NA
66 66 c c c NA
67 67 a a c NA
68 68 z z c NA
69 69 z z c NA
70 70 z z c NA
答案1
得分: 0
以下是翻译好的代码部分:
这是一种方法:
```R
data1 %>%
mutate(det_seq = map_chr(seq_along(1:nrow(data1)),
~ case_when(identical(data1[.x:(.x+4), 2:4], seqs$A[,2:4]) ~ "A",
identical(data1[.x:(.x+7), 2:4], seqs$B[,2:4]) ~ "B",
TRUE ~ "NA")))
更新:为了使其能够匹配任何大小的seqs
数据框列表,使用以下代码块代替:
data1 %>%
mutate(det_seq = map_chr(seq_along(1:nrow(data1)),
\(x) first(names(seqs)[map_lgl(seqs,
\(s) identical(data1[x:(x+nrow(s)-1), 2:4], s[,2:4]))])))
英文:
Here's one approach:
data1 %>%
mutate(det_seq = map_chr(seq_along(1:nrow(data1)),
~ case_when(identical(data1[.x:(.x+4), 2:4], seqs$A[,2:4]) ~ "A",
identical(data1[.x:(.x+7), 2:4], seqs$B[,2:4]) ~ "B",
TRUE ~ "NA")))
Update: To make it so that it can match a seqs
list of dataframes of any size, use the following chunk of code instead:
data1 %>%
mutate(det_seq = map_chr(seq_along(1:nrow(data1)),
\(x) first(names(seqs)[map_lgl(seqs,
\(s) identical(data1[x:(x+nrow(s)-1), 2:4], s[,2:4]))])))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论