英文:
dplyr - filter rows prior to a certain condition, and expanding non-contiguous time values
问题
First problem:
要解决第一个问题,您需要筛选出只包含有退出代码(exit code)以及随后的moveType
的positionID
。您可以尝试以下步骤:
-
使用
dplyr
库来进行数据处理。确保您已加载该库。 -
首先,根据
positionID
分组数据。
library(dplyr)
df <- df %>% group_by(positionID)
- 接下来,您可以创建一个新的逻辑列,表示每个
positionID
是否包含了退出代码。
df <- df %>% mutate(hasExitCode = !is.na(exitCode))
- 现在,您可以筛选出包含退出代码的
positionID
以及其随后的moveType
。
df_filtered <- df %>% filter(hasExitCode | lag(hasExitCode, default = FALSE))
这将筛选出包含退出代码或其随后moveType
的positionID
的行,包括退出代码之前的行。
Second problem:
第二个问题涉及在moveEndDate
和随后的moveStartDate
日期之间插入新行,新行的moveStartDate
为前一行的moveEndDate
+ 1天,moveEndDate
为同一positionID
的下一个moveStartDate
。这是一个相对复杂的操作,需要一些额外的处理。
您可以尝试以下步骤:
-
继续使用之前的数据框
df_filtered
,确保已应用第一个问题的筛选。 -
使用
dplyr
库,首先对数据按positionID
分组,然后按moveStartDate
升序排序。
df_filtered <- df_filtered %>% group_by(positionID) %>% arrange(moveStartDate)
- 现在,您可以使用循环来遍历每个
positionID
的数据,检查moveEndDate
和随后的moveStartDate
之间是否存在间隙,如果有,则插入新行。
请注意,这是一种复杂的数据处理,需要仔细考虑每个情况的边界条件和处理方法。具体的R代码会相当复杂,因此在实际应用时需要仔细测试和调试。
希望这些指导能帮助您解决您的两个问题。如果需要更多帮助,可以提供更多细节或具体的数据示例。
英文:
I have a series of position and personnel changes in a HR data set. The data set represents a history of personnel movements within a given set of positionIDs
.
df <- structure(list(index = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39), positionID = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6), personID = c(114,
115, 113, 109, 108, 108, 100, 108, 100, 108, 101, 110, 110, 110,
110, 103, 102, 112, 112, 112, 102, 102, 117, 102, 107, 107, 104,
109, 118, 118, 105, 118, 118, 111, 106, 106, 120, 120, 120),
moveStartDate = c("1/07/2021", "4/07/2021", "28/06/2021",
"17/01/2022", "15/04/2022", "7/05/2022", "1/07/2022", "1/07/2022",
"26/07/2022", "26/07/2022", "31/12/2020", "1/10/2020", "1/01/2021",
"1/01/2021", "28/01/2021", "31/03/2021", "1/07/2021", "19/07/2021",
"30/04/2022", "3/06/2022", "1/02/2022", "1/07/2022", "29/08/2022",
"24/09/2022", "1/07/2021", "20/08/2021", "5/10/2020", "1/08/2022",
"8/08/2022", "7/09/2022", "2/10/2022", "14/10/2022", "29/10/2022",
"16/10/2020", "1/08/2020", "22/12/2020", "31/01/2022", "24/12/2022",
"3/02/2023"), moveEndDate = c("4/07/2021", "4/07/2021", "17/10/2021",
"14/04/2022", "6/05/2022", "30/06/2022", "25/07/2022", "25/07/2022",
"30/09/2022", "31/12/2022", "31/12/2020", "31/12/2020", "27/01/2021",
"27/01/2021", NA, "31/01/2022", "31/01/2022", "29/04/2022",
"3/06/2022", "3/06/2022", "30/06/2022", "23/09/2022", "25/09/2022",
"31/03/2023", "20/08/2021", "20/08/2021", "1/12/2021", "28/10/2022",
"6/09/2022", "30/09/2022", "2/10/2022", "28/10/2022", NA,
"16/10/2020", "21/12/2020", "21/12/2021", "23/12/2022", "2/02/2023",
NA), moveType = c("temporary appointment", "moved agency",
"temporary appointment", "transfer", "transfer", "transfer",
"redesignation", "transfer", "redesignation", "redesignation",
"moved agency", "relief", "relief", "transfer", "promotion",
"redesignation", "transfer", "transfer", "relief", "resignation",
"transfer", "transfer", "temporary appointment", "transfer",
"relief", "moved agency", "restructure", "relief", "relief",
"relief", "end of contract", "relief", "relief", "moved agency",
"relief", "promotion", "relief", "relief", "relief"), exitCode = c(NA,
"A", NA, NA, NA, NA, NA, NA, NA, NA, "A", NA, NA, NA, NA,
NA, NA, NA, NA, "B", NA, NA, NA, NA, NA, "A", NA, NA, NA,
NA, "E", NA, NA, "A", NA, NA, NA, NA, NA)), row.names = c(NA,
-39L), class = "data.frame")
- Each position has a unique
positionID
Each staff member has a
uniquepersonID
Each position can have different staff moved in and
out over time. For example a person could resign or take extended
leave and be replaced with a differentpersonID
. - If the
personID
leaves the organisation they will have an exit code. Internal
movements with exiting areNA
. - The movement dates are ordered
sequentially, and if the person is still in the position, they do not
have a move end date.
There are two problems I am trying to solve, as follows:
First problem:
- I would like to filter the set such that only
positionID
s where there is an exit code and the subsequentmoveTypes
of eachpositionID
are retained. - I tried a
group_by
forpositonID
andmoveEndDate
to create a sequence of increments within each group. This created the increments, but I would not exclude the rows in each group prior to the non-NAexitCode
. Within this group I would like to exclude any rows prior to the firstexitCode
. - What I want to is create a new group where the
moveEndDate
starts at the point where there is anexitCode
. For examplepositonID
1 group would include all rows frompersonID
115 until designated topersonID
108 on 31/12/2022.
Second problem:
The time periods between a moveEndDate
and a subsequent moveStart
date are not contiguous. For example indexes 31 and 32. I would like to be able to insert a row that has a moveStartDate
that is equal to the moveEndDate + 1day
for immediate prior moveEndDate
for that row, and a move endEndDate
for the next moveStart
date in the group (if there is one).
I’ll be honest. I don’t even know where to start with this one.
Any pointers are greatly appreciated.
答案1
得分: 4
以下是代码的翻译部分:
要注意的是,大多数情况下,间隔为一天。我假设这些情况符合您对“不连续”的描述,但如果不符合,可以通过将 `> 0` 修改为 `> 1` 来轻松忽略它们。
希望这有所帮助。如果您有任何其他问题,请随时提出。
英文:
With all credit to @jared_mamrot for the first part - see their comment above - you can filter()
on cumsum()
to very neatly achieve what you want:
x <-
df %>%
group_by(positionID) %>%
filter(cumsum(!is.na(exitCode)) >= 1)
For the second part, I can suggest a solution based on my understanding of your description of the problem. That is, you want to insert the new row between each row pair where the number of days (the "gap") between the moveEndDate
and subsequent moveStartDate
is greater than zero. (So cases having a negative gap are ignored.) Note that this operates within a given positionID
.
We can continue from where left off above, having created a new data.frame x
and removed the unwanted rows. We again group by positionID
and create a couple of new columns, using lead()
to allow us to easily compare the current moveEndDate
and subsequent moveStartDate
. (Note we are using the lubridate
package here.)
library(lubridate)
x <-
x %>%
group_by(positionID) %>%
mutate(
nextStartDate = lead(moveStartDate),
gapToNextStartDate = (dmy(nextStartDate) - dmy(moveEndDate)) %>% as.numeric()
)
... which gives us (first five rows shown):
index positionID personID moveStartDate moveEndDate moveType exitCode nextStartDate gapToNextStartDate
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2 1 115 4/07/2021 4/07/2021 moved agency A 28/06/2021 -6
2 3 1 113 28/06/2021 17/10/2021 temporary appointment NA 17/01/2022 92
3 4 1 109 17/01/2022 14/04/2022 transfer NA 15/04/2022 1
4 5 1 108 15/04/2022 6/05/2022 transfer NA 7/05/2022 1
5 6 1 108 7/05/2022 30/06/2022 transfer NA 1/07/2022 1
We are interested in those rows having gapToNextStartDate
(in days) being positive. It is after these rows that we want to insert our new rows.
This next section is inelegant but should do the trick. Not knowing what you wish to set certain columns in the new rows I've made them NA
, with the exception of moveType
which I've set to "*** inserted row ***" just to highlight these newly inserted rows.
# create a new data frame having the same columns but zero rows
out <- x[0, ]
for (i in seq_len(nrow(x))) {
# add the current row from x to the output data.frame
out <- bind_rows(out, x[i, ])
# check if we need to insert a new row after this one,
# based on the condition of a positive number of days between
# this row's moveEndDate and the next row's moveStartDate
# (within a given positionID)
if (!is.na(x[i, ]$gapToNextStartDate) & x[i, ]$gapToNextStartDate > 0) {
out <-
bind_rows(
out,
data.frame(
index = NA,
positionID = x[i, ]$positionID,
personID = NA,
moveStartDate = (dmy(x[i, ]$moveEndDate) + 1) %>% format("%d/%m/%Y"),
moveEndDate = x[i, ]$nextStartDate,
moveType = "*** inserted row ***",
exitCode = NA,
nextStartDate = NA,
gapNextStartDate = NA
)
)
}
}
Our original data frame x
had 31 rows (after executing 'part one') - we have now added an additional 15 rows for cases meeting the condition as described above. The first ten rows of the output out
are shown:
index positionID personID moveStartDate moveEndDate moveType exitCode nextStartDate gapToNextStar…¹ gapNe…²
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <lgl>
1 2 1 115 4/07/2021 4/07/2021 moved agency A 28/06/2021 -6 NA
2 3 1 113 28/06/2021 17/10/2021 temporary appointment NA 17/01/2022 92 NA
3 NA 1 NA 18/10/2021 17/01/2022 *** inserted row *** NA NA NA NA
4 4 1 109 17/01/2022 14/04/2022 transfer NA 15/04/2022 1 NA
5 NA 1 NA 15/04/2022 15/04/2022 *** inserted row *** NA NA NA NA
6 5 1 108 15/04/2022 6/05/2022 transfer NA 7/05/2022 1 NA
7 NA 1 NA 07/05/2022 7/05/2022 *** inserted row *** NA NA NA NA
8 6 1 108 7/05/2022 30/06/2022 transfer NA 1/07/2022 1 NA
9 NA 1 NA 01/07/2022 1/07/2022 *** inserted row *** NA NA NA NA
10 7 1 100 1/07/2022 25/07/2022 redesignation NA 1/07/2022 -24 NA
The temporary columns nextStartDate
and gapToNextStartDate
have been retained for clarity but could be removed. Desired values for index
, personID
, etc. could be added if necessary.
Note: Most of the cases where gaps occur have a gap of a single day. I've assumed these fit your description of "non-contiguous", but if not they could be easily ignored by modifying > 0
to > 1
.
答案2
得分: 1
以下是您要翻译的代码部分:
First change the date columns to Date
s:
library(dplyr) # >= v1.1.0
library(lubridate)
df <- df %>%
mutate(across(moveStartDate:moveEndDate, dmy))
For your first problem, filter using dplyr::cumany()
, .by
PositionID
:
df <- df %>
filter(cumany(!is.na(exitCode)), .by = positionID)
For your second problem, bring in the next start date using dplyr::lead()
, filter to rows with >1 day difference, manipulate the dates to fill the gap, and bind back to your original dataframe. You can then arrange()
and re-compute the index
column.
gaps <- df %>%
group_by(positionID) %>%
mutate(nextStartDate = lead(moveStartDate)) %>%
filter(nextStartDate - moveEndDate > days(1)) %>%
transmute(
moveStartDate = moveEndDate + days(1),
moveEndDate = nextStartDate
) %>%
ungroup()
df <- df %>
bind_rows(gaps) %>
arrange(positionID, moveEndDate) %>
mutate(index = row_number())
Result:
#> print(as_tibble(df), n = 35)
# A tibble: 35 × 7
index positionID personID moveStartDate moveEndDate moveType exitCode
<int> <dbl> <dbl> <date> <date> <chr> <chr>
1 1 1 115 2021-07-04 2021-07-04 moved agency A
2 2 1 113 2021-06-28 2021-10-17 temporary appoi… <NA>
3 3 1 NA 2021-10-18 2022-01-17 <NA> <NA>
4 4 1 109 2022-01-17 2022-04-14 transfer <NA>
5 5 1 108 2022-04-15 2022-05-06 transfer <NA>
6 6 1 108 2022-05-07 2022-06-30 transfer <NA>
7 7 1 100 2022-07-01 2022-07-25 redesignation <NA>
8 8 1 108 2022-07-01 2022-07-25 transfer <NA>
9 9 1 100 2022-07-26 2022-09-30 redesignation <NA>
10 10 1 108 2022-07-26 2022-12-31 redesignation <NA>
11 11 2 101 2020-12-31 2020-12-31 moved agency A
12 12 2 110 2020-10-01 2020-12-31 relief <NA>
13 13 2 110 2021-01-01 2021-01-27 relief <NA>
14 14 2 110 2021-01-01 2021-01-27 transfer <NA>
15 15 2 110 2021-01-28 NA promotion <NA>
16 16 3 112 2022-06-03 2022-06-03 resignation B
17 17 3 102 2022-02-01 2022-06-30 transfer <NA>
18 18 3 102 2022-07-01 2022-09-23 transfer <NA>
19 19 3 117 2022-08-29 2022-09-25 temporary appoi… <NA>
20 20 3 102 2022-09-24 2023-03-31 transfer <NA>
21 21 4 107 2021-08-20 2021-08-20 moved agency A
22 22 4 104 2020-10-05 2021-12-01 restructure <NA>
23 23 4 NA 2021-12-02 2022-08-01 <NA> <NA>
24 24 4 109 2022-08-01 2022-10-28 relief <NA>
25 25 5 105 2022-10-02 2022-10-02 end of contract E
26 26 5 NA 2022-10-03 2022-10-14 <NA> <NA>
27 27 5 118 2022-10-14 2022-10-28 relief <NA>
28 28 5 118 2022-10-29 NA relief <NA>
29 29 6 111 2020-10-16 2020-10-16 moved agency A
30 30 6 106 2020-08-01 2020-12-21 relief <NA>
31 31 6 106 2020-12-22 2021-12-21 promotion <NA>
32 32 6 NA 2021-12-22 2022-01-31 <NA> <
<details>
<summary>英文:</summary>
First change the date columns to `Date`s:
library(dplyr) # >= v1.1.0
library(lubridate)
df <- df %>%
mutate(across(moveStartDate:moveEndDate, dmy))
For your first problem, filter using `dplyr::cumany()`, `.by` `PositionID`:
df <- df %>%
filter(cumany(!is.na(exitCode)), .by = positionID)
For your second problem, bring in the next start date using `dplyr::lead()`, filter to rows with >1 day difference, manipulate the dates to fill the gap, and bind back to your original dataframe. You can then `arrange()` and re-compute the `index` column.
gaps <- df %>%
group_by(positionID) %>%
mutate(nextStartDate = lead(moveStartDate)) %>%
filter(nextStartDate - moveEndDate > days(1)) %>%
transmute(
moveStartDate = moveEndDate + days(1),
moveEndDate = nextStartDate
) %>%
ungroup()
df <- df %>%
bind_rows(gaps) %>%
arrange(positionID, moveEndDate) %>%
mutate(index = row_number())
Result:
#> print(as_tibble(df), n = 35)
A tibble: 35 × 7
index positionID personID moveStartDate moveEndDate moveType exitCode
<int> <dbl> <dbl> <date> <date> <chr> <chr>
1 1 1 115 2021-07-04 2021-07-04 moved agency A
2 2 1 113 2021-06-28 2021-10-17 temporary appoi… <NA>
3 3 1 NA 2021-10-18 2022-01-17 <NA> <NA>
4 4 1 109 2022-01-17 2022-04-14 transfer <NA>
5 5 1 108 2022-04-15 2022-05-06 transfer <NA>
6 6 1 108 2022-05-07 2022-06-30 transfer <NA>
7 7 1 100 2022-07-01 2022-07-25 redesignation <NA>
8 8 1 108 2022-07-01 2022-07-25 transfer <NA>
9 9 1 100 2022-07-26 2022-09-30 redesignation <NA>
10 10 1 108 2022-07-26 2022-12-31 redesignation <NA>
11 11 2 101 2020-12-31 2020-12-31 moved agency A
12 12 2 110 2020-10-01 2020-12-31 relief <NA>
13 13 2 110 2021-01-01 2021-01-27 relief <NA>
14 14 2 110 2021-01-01 2021-01-27 transfer <NA>
15 15 2 110 2021-01-28 NA promotion <NA>
16 16 3 112 2022-06-03 2022-06-03 resignation B
17 17 3 102 2022-02-01 2022-06-30 transfer <NA>
18 18 3 102 2022-07-01 2022-09-23 transfer <NA>
19 19 3 117 2022-08-29 2022-09-25 temporary appoi… <NA>
20 20 3 102 2022-09-24 2023-03-31 transfer <NA>
21 21 4 107 2021-08-20 2021-08-20 moved agency A
22 22 4 104 2020-10-05 2021-12-01 restructure <NA>
23 23 4 NA 2021-12-02 2022-08-01 <NA> <NA>
24 24 4 109 2022-08-01 2022-10-28 relief <NA>
25 25 5 105 2022-10-02 2022-10-02 end of contract E
26 26 5 NA 2022-10-03 2022-10-14 <NA> <NA>
27 27 5 118 2022-10-14 2022-10-28 relief <NA>
28 28 5 118 2022-10-29 NA relief <NA>
29 29 6 111 2020-10-16 2020-10-16 moved agency A
30 30 6 106 2020-08-01 2020-12-21 relief <NA>
31 31 6 106 2020-12-22 2021-12-21 promotion <NA>
32 32 6 NA 2021-12-22 2022-01-31 <NA> <NA>
33 33 6 120 2022-01-31 2022-12-23 relief <NA>
34 34 6 120 2022-12-24 2023-02-02 relief <NA>
35 35 6 120 2023-02-03 NA relief <NA>
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论