英文:
dplyr - filter rows prior to a certain condition, and expanding non-contiguous time values
问题
First problem:
要解决第一个问题,您需要筛选出只包含有退出代码(exit code)以及随后的moveType的positionID。您可以尝试以下步骤:
-
使用
dplyr库来进行数据处理。确保您已加载该库。 -
首先,根据
positionID分组数据。
library(dplyr)
df <- df %>% group_by(positionID)
- 接下来,您可以创建一个新的逻辑列,表示每个
positionID是否包含了退出代码。
df <- df %>% mutate(hasExitCode = !is.na(exitCode))
- 现在,您可以筛选出包含退出代码的
positionID以及其随后的moveType。
df_filtered <- df %>% filter(hasExitCode | lag(hasExitCode, default = FALSE))
这将筛选出包含退出代码或其随后moveType的positionID的行,包括退出代码之前的行。
Second problem:
第二个问题涉及在moveEndDate和随后的moveStartDate日期之间插入新行,新行的moveStartDate为前一行的moveEndDate + 1天,moveEndDate为同一positionID的下一个moveStartDate。这是一个相对复杂的操作,需要一些额外的处理。
您可以尝试以下步骤:
-
继续使用之前的数据框
df_filtered,确保已应用第一个问题的筛选。 -
使用
dplyr库,首先对数据按positionID分组,然后按moveStartDate升序排序。
df_filtered <- df_filtered %>% group_by(positionID) %>% arrange(moveStartDate)
- 现在,您可以使用循环来遍历每个
positionID的数据,检查moveEndDate和随后的moveStartDate之间是否存在间隙,如果有,则插入新行。
请注意,这是一种复杂的数据处理,需要仔细考虑每个情况的边界条件和处理方法。具体的R代码会相当复杂,因此在实际应用时需要仔细测试和调试。
希望这些指导能帮助您解决您的两个问题。如果需要更多帮助,可以提供更多细节或具体的数据示例。
英文:
I have a series of position and personnel changes in a HR data set. The data set represents a history of personnel movements within a given set of positionIDs.
df <- structure(list(index = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39), positionID = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3,
3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6), personID = c(114,
115, 113, 109, 108, 108, 100, 108, 100, 108, 101, 110, 110, 110,
110, 103, 102, 112, 112, 112, 102, 102, 117, 102, 107, 107, 104,
109, 118, 118, 105, 118, 118, 111, 106, 106, 120, 120, 120),
moveStartDate = c("1/07/2021", "4/07/2021", "28/06/2021",
"17/01/2022", "15/04/2022", "7/05/2022", "1/07/2022", "1/07/2022",
"26/07/2022", "26/07/2022", "31/12/2020", "1/10/2020", "1/01/2021",
"1/01/2021", "28/01/2021", "31/03/2021", "1/07/2021", "19/07/2021",
"30/04/2022", "3/06/2022", "1/02/2022", "1/07/2022", "29/08/2022",
"24/09/2022", "1/07/2021", "20/08/2021", "5/10/2020", "1/08/2022",
"8/08/2022", "7/09/2022", "2/10/2022", "14/10/2022", "29/10/2022",
"16/10/2020", "1/08/2020", "22/12/2020", "31/01/2022", "24/12/2022",
"3/02/2023"), moveEndDate = c("4/07/2021", "4/07/2021", "17/10/2021",
"14/04/2022", "6/05/2022", "30/06/2022", "25/07/2022", "25/07/2022",
"30/09/2022", "31/12/2022", "31/12/2020", "31/12/2020", "27/01/2021",
"27/01/2021", NA, "31/01/2022", "31/01/2022", "29/04/2022",
"3/06/2022", "3/06/2022", "30/06/2022", "23/09/2022", "25/09/2022",
"31/03/2023", "20/08/2021", "20/08/2021", "1/12/2021", "28/10/2022",
"6/09/2022", "30/09/2022", "2/10/2022", "28/10/2022", NA,
"16/10/2020", "21/12/2020", "21/12/2021", "23/12/2022", "2/02/2023",
NA), moveType = c("temporary appointment", "moved agency",
"temporary appointment", "transfer", "transfer", "transfer",
"redesignation", "transfer", "redesignation", "redesignation",
"moved agency", "relief", "relief", "transfer", "promotion",
"redesignation", "transfer", "transfer", "relief", "resignation",
"transfer", "transfer", "temporary appointment", "transfer",
"relief", "moved agency", "restructure", "relief", "relief",
"relief", "end of contract", "relief", "relief", "moved agency",
"relief", "promotion", "relief", "relief", "relief"), exitCode = c(NA,
"A", NA, NA, NA, NA, NA, NA, NA, NA, "A", NA, NA, NA, NA,
NA, NA, NA, NA, "B", NA, NA, NA, NA, NA, "A", NA, NA, NA,
NA, "E", NA, NA, "A", NA, NA, NA, NA, NA)), row.names = c(NA,
-39L), class = "data.frame")
- Each position has a unique
positionIDEach staff member has a
uniquepersonIDEach position can have different staff moved in and
out over time. For example a person could resign or take extended
leave and be replaced with a differentpersonID. - If the
personIDleaves the organisation they will have an exit code. Internal
movements with exiting areNA. - The movement dates are ordered
sequentially, and if the person is still in the position, they do not
have a move end date.
There are two problems I am trying to solve, as follows:
First problem:
- I would like to filter the set such that only
positionIDs where there is an exit code and the subsequentmoveTypesof eachpositionIDare retained. - I tried a
group_byforpositonIDandmoveEndDateto create a sequence of increments within each group. This created the increments, but I would not exclude the rows in each group prior to the non-NAexitCode. Within this group I would like to exclude any rows prior to the firstexitCode. - What I want to is create a new group where the
moveEndDatestarts at the point where there is anexitCode. For examplepositonID1 group would include all rows frompersonID115 until designated topersonID108 on 31/12/2022.
Second problem:
The time periods between a moveEndDate and a subsequent moveStart date are not contiguous. For example indexes 31 and 32. I would like to be able to insert a row that has a moveStartDate that is equal to the moveEndDate + 1day for immediate prior moveEndDate for that row, and a move endEndDate for the next moveStart date in the group (if there is one).
I’ll be honest. I don’t even know where to start with this one.
Any pointers are greatly appreciated.
答案1
得分: 4
以下是代码的翻译部分:
要注意的是,大多数情况下,间隔为一天。我假设这些情况符合您对“不连续”的描述,但如果不符合,可以通过将 `> 0` 修改为 `> 1` 来轻松忽略它们。
希望这有所帮助。如果您有任何其他问题,请随时提出。
英文:
With all credit to @jared_mamrot for the first part - see their comment above - you can filter() on cumsum() to very neatly achieve what you want:
x <-
df %>%
group_by(positionID) %>%
filter(cumsum(!is.na(exitCode)) >= 1)
For the second part, I can suggest a solution based on my understanding of your description of the problem. That is, you want to insert the new row between each row pair where the number of days (the "gap") between the moveEndDate and subsequent moveStartDate is greater than zero. (So cases having a negative gap are ignored.) Note that this operates within a given positionID.
We can continue from where left off above, having created a new data.frame x and removed the unwanted rows. We again group by positionID and create a couple of new columns, using lead() to allow us to easily compare the current moveEndDate and subsequent moveStartDate. (Note we are using the lubridate package here.)
library(lubridate)
x <-
x %>%
group_by(positionID) %>%
mutate(
nextStartDate = lead(moveStartDate),
gapToNextStartDate = (dmy(nextStartDate) - dmy(moveEndDate)) %>% as.numeric()
)
... which gives us (first five rows shown):
index positionID personID moveStartDate moveEndDate moveType exitCode nextStartDate gapToNextStartDate
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2 1 115 4/07/2021 4/07/2021 moved agency A 28/06/2021 -6
2 3 1 113 28/06/2021 17/10/2021 temporary appointment NA 17/01/2022 92
3 4 1 109 17/01/2022 14/04/2022 transfer NA 15/04/2022 1
4 5 1 108 15/04/2022 6/05/2022 transfer NA 7/05/2022 1
5 6 1 108 7/05/2022 30/06/2022 transfer NA 1/07/2022 1
We are interested in those rows having gapToNextStartDate (in days) being positive. It is after these rows that we want to insert our new rows.
This next section is inelegant but should do the trick. Not knowing what you wish to set certain columns in the new rows I've made them NA, with the exception of moveType which I've set to "*** inserted row ***" just to highlight these newly inserted rows.
# create a new data frame having the same columns but zero rows
out <- x[0, ]
for (i in seq_len(nrow(x))) {
# add the current row from x to the output data.frame
out <- bind_rows(out, x[i, ])
# check if we need to insert a new row after this one,
# based on the condition of a positive number of days between
# this row's moveEndDate and the next row's moveStartDate
# (within a given positionID)
if (!is.na(x[i, ]$gapToNextStartDate) & x[i, ]$gapToNextStartDate > 0) {
out <-
bind_rows(
out,
data.frame(
index = NA,
positionID = x[i, ]$positionID,
personID = NA,
moveStartDate = (dmy(x[i, ]$moveEndDate) + 1) %>% format("%d/%m/%Y"),
moveEndDate = x[i, ]$nextStartDate,
moveType = "*** inserted row ***",
exitCode = NA,
nextStartDate = NA,
gapNextStartDate = NA
)
)
}
}
Our original data frame x had 31 rows (after executing 'part one') - we have now added an additional 15 rows for cases meeting the condition as described above. The first ten rows of the output out are shown:
index positionID personID moveStartDate moveEndDate moveType exitCode nextStartDate gapToNextStar…¹ gapNe…²
<dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <lgl>
1 2 1 115 4/07/2021 4/07/2021 moved agency A 28/06/2021 -6 NA
2 3 1 113 28/06/2021 17/10/2021 temporary appointment NA 17/01/2022 92 NA
3 NA 1 NA 18/10/2021 17/01/2022 *** inserted row *** NA NA NA NA
4 4 1 109 17/01/2022 14/04/2022 transfer NA 15/04/2022 1 NA
5 NA 1 NA 15/04/2022 15/04/2022 *** inserted row *** NA NA NA NA
6 5 1 108 15/04/2022 6/05/2022 transfer NA 7/05/2022 1 NA
7 NA 1 NA 07/05/2022 7/05/2022 *** inserted row *** NA NA NA NA
8 6 1 108 7/05/2022 30/06/2022 transfer NA 1/07/2022 1 NA
9 NA 1 NA 01/07/2022 1/07/2022 *** inserted row *** NA NA NA NA
10 7 1 100 1/07/2022 25/07/2022 redesignation NA 1/07/2022 -24 NA
The temporary columns nextStartDate and gapToNextStartDate have been retained for clarity but could be removed. Desired values for index, personID, etc. could be added if necessary.
Note: Most of the cases where gaps occur have a gap of a single day. I've assumed these fit your description of "non-contiguous", but if not they could be easily ignored by modifying > 0 to > 1.
答案2
得分: 1
以下是您要翻译的代码部分:
First change the date columns to Dates:
library(dplyr) # >= v1.1.0
library(lubridate)
df <- df %>%
mutate(across(moveStartDate:moveEndDate, dmy))
For your first problem, filter using dplyr::cumany(), .by PositionID:
df <- df %>
filter(cumany(!is.na(exitCode)), .by = positionID)
For your second problem, bring in the next start date using dplyr::lead(), filter to rows with >1 day difference, manipulate the dates to fill the gap, and bind back to your original dataframe. You can then arrange() and re-compute the index column.
gaps <- df %>%
group_by(positionID) %>%
mutate(nextStartDate = lead(moveStartDate)) %>%
filter(nextStartDate - moveEndDate > days(1)) %>%
transmute(
moveStartDate = moveEndDate + days(1),
moveEndDate = nextStartDate
) %>%
ungroup()
df <- df %>
bind_rows(gaps) %>
arrange(positionID, moveEndDate) %>
mutate(index = row_number())
Result:
#> print(as_tibble(df), n = 35)
# A tibble: 35 × 7
index positionID personID moveStartDate moveEndDate moveType exitCode
<int> <dbl> <dbl> <date> <date> <chr> <chr>
1 1 1 115 2021-07-04 2021-07-04 moved agency A
2 2 1 113 2021-06-28 2021-10-17 temporary appoi… <NA>
3 3 1 NA 2021-10-18 2022-01-17 <NA> <NA>
4 4 1 109 2022-01-17 2022-04-14 transfer <NA>
5 5 1 108 2022-04-15 2022-05-06 transfer <NA>
6 6 1 108 2022-05-07 2022-06-30 transfer <NA>
7 7 1 100 2022-07-01 2022-07-25 redesignation <NA>
8 8 1 108 2022-07-01 2022-07-25 transfer <NA>
9 9 1 100 2022-07-26 2022-09-30 redesignation <NA>
10 10 1 108 2022-07-26 2022-12-31 redesignation <NA>
11 11 2 101 2020-12-31 2020-12-31 moved agency A
12 12 2 110 2020-10-01 2020-12-31 relief <NA>
13 13 2 110 2021-01-01 2021-01-27 relief <NA>
14 14 2 110 2021-01-01 2021-01-27 transfer <NA>
15 15 2 110 2021-01-28 NA promotion <NA>
16 16 3 112 2022-06-03 2022-06-03 resignation B
17 17 3 102 2022-02-01 2022-06-30 transfer <NA>
18 18 3 102 2022-07-01 2022-09-23 transfer <NA>
19 19 3 117 2022-08-29 2022-09-25 temporary appoi… <NA>
20 20 3 102 2022-09-24 2023-03-31 transfer <NA>
21 21 4 107 2021-08-20 2021-08-20 moved agency A
22 22 4 104 2020-10-05 2021-12-01 restructure <NA>
23 23 4 NA 2021-12-02 2022-08-01 <NA> <NA>
24 24 4 109 2022-08-01 2022-10-28 relief <NA>
25 25 5 105 2022-10-02 2022-10-02 end of contract E
26 26 5 NA 2022-10-03 2022-10-14 <NA> <NA>
27 27 5 118 2022-10-14 2022-10-28 relief <NA>
28 28 5 118 2022-10-29 NA relief <NA>
29 29 6 111 2020-10-16 2020-10-16 moved agency A
30 30 6 106 2020-08-01 2020-12-21 relief <NA>
31 31 6 106 2020-12-22 2021-12-21 promotion <NA>
32 32 6 NA 2021-12-22 2022-01-31 <NA> <
<details>
<summary>英文:</summary>
First change the date columns to `Date`s:
library(dplyr) # >= v1.1.0
library(lubridate)
df <- df %>%
mutate(across(moveStartDate:moveEndDate, dmy))
For your first problem, filter using `dplyr::cumany()`, `.by` `PositionID`:
df <- df %>%
filter(cumany(!is.na(exitCode)), .by = positionID)
For your second problem, bring in the next start date using `dplyr::lead()`, filter to rows with >1 day difference, manipulate the dates to fill the gap, and bind back to your original dataframe. You can then `arrange()` and re-compute the `index` column.
gaps <- df %>%
group_by(positionID) %>%
mutate(nextStartDate = lead(moveStartDate)) %>%
filter(nextStartDate - moveEndDate > days(1)) %>%
transmute(
moveStartDate = moveEndDate + days(1),
moveEndDate = nextStartDate
) %>%
ungroup()
df <- df %>%
bind_rows(gaps) %>%
arrange(positionID, moveEndDate) %>%
mutate(index = row_number())
Result:
#> print(as_tibble(df), n = 35)
A tibble: 35 × 7
index positionID personID moveStartDate moveEndDate moveType exitCode
<int> <dbl> <dbl> <date> <date> <chr> <chr>
1 1 1 115 2021-07-04 2021-07-04 moved agency A
2 2 1 113 2021-06-28 2021-10-17 temporary appoi… <NA>
3 3 1 NA 2021-10-18 2022-01-17 <NA> <NA>
4 4 1 109 2022-01-17 2022-04-14 transfer <NA>
5 5 1 108 2022-04-15 2022-05-06 transfer <NA>
6 6 1 108 2022-05-07 2022-06-30 transfer <NA>
7 7 1 100 2022-07-01 2022-07-25 redesignation <NA>
8 8 1 108 2022-07-01 2022-07-25 transfer <NA>
9 9 1 100 2022-07-26 2022-09-30 redesignation <NA>
10 10 1 108 2022-07-26 2022-12-31 redesignation <NA>
11 11 2 101 2020-12-31 2020-12-31 moved agency A
12 12 2 110 2020-10-01 2020-12-31 relief <NA>
13 13 2 110 2021-01-01 2021-01-27 relief <NA>
14 14 2 110 2021-01-01 2021-01-27 transfer <NA>
15 15 2 110 2021-01-28 NA promotion <NA>
16 16 3 112 2022-06-03 2022-06-03 resignation B
17 17 3 102 2022-02-01 2022-06-30 transfer <NA>
18 18 3 102 2022-07-01 2022-09-23 transfer <NA>
19 19 3 117 2022-08-29 2022-09-25 temporary appoi… <NA>
20 20 3 102 2022-09-24 2023-03-31 transfer <NA>
21 21 4 107 2021-08-20 2021-08-20 moved agency A
22 22 4 104 2020-10-05 2021-12-01 restructure <NA>
23 23 4 NA 2021-12-02 2022-08-01 <NA> <NA>
24 24 4 109 2022-08-01 2022-10-28 relief <NA>
25 25 5 105 2022-10-02 2022-10-02 end of contract E
26 26 5 NA 2022-10-03 2022-10-14 <NA> <NA>
27 27 5 118 2022-10-14 2022-10-28 relief <NA>
28 28 5 118 2022-10-29 NA relief <NA>
29 29 6 111 2020-10-16 2020-10-16 moved agency A
30 30 6 106 2020-08-01 2020-12-21 relief <NA>
31 31 6 106 2020-12-22 2021-12-21 promotion <NA>
32 32 6 NA 2021-12-22 2022-01-31 <NA> <NA>
33 33 6 120 2022-01-31 2022-12-23 relief <NA>
34 34 6 120 2022-12-24 2023-02-02 relief <NA>
35 35 6 120 2023-02-03 NA relief <NA>
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论