dplyr – 在特定条件之前筛选行,并扩展非连续时间值。

huangapple go评论58阅读模式
英文:

dplyr - filter rows prior to a certain condition, and expanding non-contiguous time values

问题

First problem:

要解决第一个问题,您需要筛选出只包含有退出代码(exit code)以及随后的moveTypepositionID。您可以尝试以下步骤:

  1. 使用dplyr库来进行数据处理。确保您已加载该库。

  2. 首先,根据positionID分组数据。

library(dplyr)
df <- df %>% group_by(positionID)
  1. 接下来,您可以创建一个新的逻辑列,表示每个positionID是否包含了退出代码。
df <- df %>% mutate(hasExitCode = !is.na(exitCode))
  1. 现在,您可以筛选出包含退出代码的positionID以及其随后的moveType
df_filtered <- df %>% filter(hasExitCode | lag(hasExitCode, default = FALSE))

这将筛选出包含退出代码或其随后moveTypepositionID的行,包括退出代码之前的行。

Second problem:

第二个问题涉及在moveEndDate和随后的moveStartDate日期之间插入新行,新行的moveStartDate为前一行的moveEndDate + 1天,moveEndDate为同一positionID的下一个moveStartDate。这是一个相对复杂的操作,需要一些额外的处理。

您可以尝试以下步骤:

  1. 继续使用之前的数据框df_filtered,确保已应用第一个问题的筛选。

  2. 使用dplyr库,首先对数据按positionID分组,然后按moveStartDate升序排序。

df_filtered <- df_filtered %>% group_by(positionID) %>% arrange(moveStartDate)
  1. 现在,您可以使用循环来遍历每个positionID的数据,检查moveEndDate和随后的moveStartDate之间是否存在间隙,如果有,则插入新行。

请注意,这是一种复杂的数据处理,需要仔细考虑每个情况的边界条件和处理方法。具体的R代码会相当复杂,因此在实际应用时需要仔细测试和调试。

希望这些指导能帮助您解决您的两个问题。如果需要更多帮助,可以提供更多细节或具体的数据示例。

英文:

I have a series of position and personnel changes in a HR data set. The data set represents a history of personnel movements within a given set of positionIDs.

df &lt;- structure(list(index = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39), positionID = c(1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 
3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6), personID = c(114, 
115, 113, 109, 108, 108, 100, 108, 100, 108, 101, 110, 110, 110, 
110, 103, 102, 112, 112, 112, 102, 102, 117, 102, 107, 107, 104, 
109, 118, 118, 105, 118, 118, 111, 106, 106, 120, 120, 120), 
    moveStartDate = c(&quot;1/07/2021&quot;, &quot;4/07/2021&quot;, &quot;28/06/2021&quot;, 
    &quot;17/01/2022&quot;, &quot;15/04/2022&quot;, &quot;7/05/2022&quot;, &quot;1/07/2022&quot;, &quot;1/07/2022&quot;, 
    &quot;26/07/2022&quot;, &quot;26/07/2022&quot;, &quot;31/12/2020&quot;, &quot;1/10/2020&quot;, &quot;1/01/2021&quot;, 
    &quot;1/01/2021&quot;, &quot;28/01/2021&quot;, &quot;31/03/2021&quot;, &quot;1/07/2021&quot;, &quot;19/07/2021&quot;, 
    &quot;30/04/2022&quot;, &quot;3/06/2022&quot;, &quot;1/02/2022&quot;, &quot;1/07/2022&quot;, &quot;29/08/2022&quot;, 
    &quot;24/09/2022&quot;, &quot;1/07/2021&quot;, &quot;20/08/2021&quot;, &quot;5/10/2020&quot;, &quot;1/08/2022&quot;, 
    &quot;8/08/2022&quot;, &quot;7/09/2022&quot;, &quot;2/10/2022&quot;, &quot;14/10/2022&quot;, &quot;29/10/2022&quot;, 
    &quot;16/10/2020&quot;, &quot;1/08/2020&quot;, &quot;22/12/2020&quot;, &quot;31/01/2022&quot;, &quot;24/12/2022&quot;, 
    &quot;3/02/2023&quot;), moveEndDate = c(&quot;4/07/2021&quot;, &quot;4/07/2021&quot;, &quot;17/10/2021&quot;, 
    &quot;14/04/2022&quot;, &quot;6/05/2022&quot;, &quot;30/06/2022&quot;, &quot;25/07/2022&quot;, &quot;25/07/2022&quot;, 
    &quot;30/09/2022&quot;, &quot;31/12/2022&quot;, &quot;31/12/2020&quot;, &quot;31/12/2020&quot;, &quot;27/01/2021&quot;, 
    &quot;27/01/2021&quot;, NA, &quot;31/01/2022&quot;, &quot;31/01/2022&quot;, &quot;29/04/2022&quot;, 
    &quot;3/06/2022&quot;, &quot;3/06/2022&quot;, &quot;30/06/2022&quot;, &quot;23/09/2022&quot;, &quot;25/09/2022&quot;, 
    &quot;31/03/2023&quot;, &quot;20/08/2021&quot;, &quot;20/08/2021&quot;, &quot;1/12/2021&quot;, &quot;28/10/2022&quot;, 
    &quot;6/09/2022&quot;, &quot;30/09/2022&quot;, &quot;2/10/2022&quot;, &quot;28/10/2022&quot;, NA, 
    &quot;16/10/2020&quot;, &quot;21/12/2020&quot;, &quot;21/12/2021&quot;, &quot;23/12/2022&quot;, &quot;2/02/2023&quot;, 
    NA), moveType = c(&quot;temporary appointment&quot;, &quot;moved agency&quot;, 
    &quot;temporary appointment&quot;, &quot;transfer&quot;, &quot;transfer&quot;, &quot;transfer&quot;, 
    &quot;redesignation&quot;, &quot;transfer&quot;, &quot;redesignation&quot;, &quot;redesignation&quot;, 
    &quot;moved agency&quot;, &quot;relief&quot;, &quot;relief&quot;, &quot;transfer&quot;, &quot;promotion&quot;, 
    &quot;redesignation&quot;, &quot;transfer&quot;, &quot;transfer&quot;, &quot;relief&quot;, &quot;resignation&quot;, 
    &quot;transfer&quot;, &quot;transfer&quot;, &quot;temporary appointment&quot;, &quot;transfer&quot;, 
    &quot;relief&quot;, &quot;moved agency&quot;, &quot;restructure&quot;, &quot;relief&quot;, &quot;relief&quot;, 
    &quot;relief&quot;, &quot;end of contract&quot;, &quot;relief&quot;, &quot;relief&quot;, &quot;moved agency&quot;, 
    &quot;relief&quot;, &quot;promotion&quot;, &quot;relief&quot;, &quot;relief&quot;, &quot;relief&quot;), exitCode = c(NA, 
    &quot;A&quot;, NA, NA, NA, NA, NA, NA, NA, NA, &quot;A&quot;, NA, NA, NA, NA, 
    NA, NA, NA, NA, &quot;B&quot;, NA, NA, NA, NA, NA, &quot;A&quot;, NA, NA, NA, 
    NA, &quot;E&quot;, NA, NA, &quot;A&quot;, NA, NA, NA, NA, NA)), row.names = c(NA, 
-39L), class = &quot;data.frame&quot;)

  • Each position has a unique positionID Each staff member has a
    unique personID Each position can have different staff moved in and
    out over time. For example a person could resign or take extended
    leave and be replaced with a different personID.
  • If the personID leaves the organisation they will have an exit code. Internal
    movements with exiting are NA.
  • The movement dates are ordered
    sequentially, and if the person is still in the position, they do not
    have a move end date.

There are two problems I am trying to solve, as follows:

First problem:

  • I would like to filter the set such that only positionIDs where there is an exit code and the subsequent moveTypes of each positionID are retained.
  • I tried a group_by for positonID and moveEndDate to create a sequence of increments within each group. This created the increments, but I would not exclude the rows in each group prior to the non-NA exitCode. Within this group I would like to exclude any rows prior to the first exitCode.
  • What I want to is create a new group where the moveEndDate starts at the point where there is an exitCode. For example positonID 1 group would include all rows from personID 115 until designated to personID 108 on 31/12/2022.

Second problem:

The time periods between a moveEndDate and a subsequent moveStart date are not contiguous. For example indexes 31 and 32. I would like to be able to insert a row that has a moveStartDate that is equal to the moveEndDate + 1day for immediate prior moveEndDate for that row, and a move endEndDate for the next moveStart date in the group (if there is one).
I’ll be honest. I don’t even know where to start with this one.

Any pointers are greatly appreciated.

答案1

得分: 4

以下是代码的翻译部分:

要注意的是,大多数情况下,间隔为一天。我假设这些情况符合您对“不连续”的描述,但如果不符合,可以通过将 `> 0` 修改为 `> 1` 来轻松忽略它们。

希望这有所帮助。如果您有任何其他问题,请随时提出。

英文:

With all credit to @jared_mamrot for the first part - see their comment above - you can filter() on cumsum() to very neatly achieve what you want:

x &lt;-
  df %&gt;%
  group_by(positionID) %&gt;%
  filter(cumsum(!is.na(exitCode)) &gt;= 1)

For the second part, I can suggest a solution based on my understanding of your description of the problem. That is, you want to insert the new row between each row pair where the number of days (the "gap") between the moveEndDate and subsequent moveStartDate is greater than zero. (So cases having a negative gap are ignored.) Note that this operates within a given positionID.

We can continue from where left off above, having created a new data.frame x and removed the unwanted rows. We again group by positionID and create a couple of new columns, using lead() to allow us to easily compare the current moveEndDate and subsequent moveStartDate. (Note we are using the lubridate package here.)

library(lubridate)

x &lt;-
  x %&gt;%
  group_by(positionID) %&gt;%
  mutate(
    nextStartDate = lead(moveStartDate),
    gapToNextStartDate = (dmy(nextStartDate) - dmy(moveEndDate)) %&gt;% as.numeric()
  )

... which gives us (first five rows shown):

  index positionID personID moveStartDate moveEndDate moveType              exitCode nextStartDate gapToNextStartDate
  &lt;dbl&gt;      &lt;dbl&gt;    &lt;dbl&gt; &lt;chr&gt;         &lt;chr&gt;       &lt;chr&gt;                 &lt;chr&gt;    &lt;chr&gt;                      &lt;dbl&gt;
1     2          1      115 4/07/2021     4/07/2021   moved agency          A        28/06/2021                    -6
2     3          1      113 28/06/2021    17/10/2021  temporary appointment NA       17/01/2022                    92
3     4          1      109 17/01/2022    14/04/2022  transfer              NA       15/04/2022                     1
4     5          1      108 15/04/2022    6/05/2022   transfer              NA       7/05/2022                      1
5     6          1      108 7/05/2022     30/06/2022  transfer              NA       1/07/2022                      1

We are interested in those rows having gapToNextStartDate (in days) being positive. It is after these rows that we want to insert our new rows.

This next section is inelegant but should do the trick. Not knowing what you wish to set certain columns in the new rows I've made them NA, with the exception of moveType which I've set to "*** inserted row ***" just to highlight these newly inserted rows.

# create a new data frame having the same columns but zero rows
out &lt;- x[0, ]

for (i in seq_len(nrow(x))) {

  # add the current row from x to the output data.frame
  out &lt;- bind_rows(out, x[i, ])  
   
  # check if we need to insert a new row after this one,
  # based on the condition of a positive number of days between
  # this row&#39;s moveEndDate and the next row&#39;s moveStartDate
  # (within a given positionID)
  if (!is.na(x[i, ]$gapToNextStartDate) &amp; x[i, ]$gapToNextStartDate &gt; 0) {
    out &lt;-
      bind_rows(
        out,
        data.frame(
          index = NA,
          positionID = x[i, ]$positionID,
          personID = NA,
          moveStartDate = (dmy(x[i, ]$moveEndDate) + 1) %&gt;% format(&quot;%d/%m/%Y&quot;),
          moveEndDate = x[i, ]$nextStartDate,
          moveType = &quot;*** inserted row ***&quot;,
          exitCode = NA,
          nextStartDate = NA,
          gapNextStartDate = NA
        )
      )
  }
}

Our original data frame x had 31 rows (after executing 'part one') - we have now added an additional 15 rows for cases meeting the condition as described above. The first ten rows of the output out are shown:

   index positionID personID moveStartDate moveEndDate moveType              exitCode nextStartDate gapToNextStar…&#185; gapNe…&#178;
   &lt;dbl&gt;      &lt;dbl&gt;    &lt;dbl&gt; &lt;chr&gt;         &lt;chr&gt;       &lt;chr&gt;                 &lt;chr&gt;    &lt;chr&gt;                   &lt;dbl&gt; &lt;lgl&gt;  
 1     2          1      115 4/07/2021     4/07/2021   moved agency          A        28/06/2021                 -6 NA     
 2     3          1      113 28/06/2021    17/10/2021  temporary appointment NA       17/01/2022                 92 NA     
 3    NA          1       NA 18/10/2021    17/01/2022  *** inserted row ***  NA       NA                         NA NA     
 4     4          1      109 17/01/2022    14/04/2022  transfer              NA       15/04/2022                  1 NA     
 5    NA          1       NA 15/04/2022    15/04/2022  *** inserted row ***  NA       NA                         NA NA     
 6     5          1      108 15/04/2022    6/05/2022   transfer              NA       7/05/2022                   1 NA     
 7    NA          1       NA 07/05/2022    7/05/2022   *** inserted row ***  NA       NA                         NA NA     
 8     6          1      108 7/05/2022     30/06/2022  transfer              NA       1/07/2022                   1 NA     
 9    NA          1       NA 01/07/2022    1/07/2022   *** inserted row ***  NA       NA                         NA NA     
10     7          1      100 1/07/2022     25/07/2022  redesignation         NA       1/07/2022                 -24 NA

The temporary columns nextStartDate and gapToNextStartDate have been retained for clarity but could be removed. Desired values for index, personID, etc. could be added if necessary.

Note: Most of the cases where gaps occur have a gap of a single day. I've assumed these fit your description of "non-contiguous", but if not they could be easily ignored by modifying &gt; 0 to &gt; 1.

答案2

得分: 1

以下是您要翻译的代码部分:

First change the date columns to Dates:

library(dplyr) # &gt;= v1.1.0
library(lubridate)

df &lt;- df %&gt;% 
  mutate(across(moveStartDate:moveEndDate, dmy)) 

For your first problem, filter using dplyr::cumany(), .by PositionID:

df &lt;- df %&gt;
  filter(cumany(!is.na(exitCode)), .by = positionID) 

For your second problem, bring in the next start date using dplyr::lead(), filter to rows with >1 day difference, manipulate the dates to fill the gap, and bind back to your original dataframe. You can then arrange() and re-compute the index column.

gaps &lt;- df %&gt;%
  group_by(positionID) %&gt;%
  mutate(nextStartDate = lead(moveStartDate)) %&gt;%
  filter(nextStartDate - moveEndDate &gt; days(1)) %&gt;%
  transmute(
    moveStartDate = moveEndDate + days(1),
    moveEndDate = nextStartDate
  ) %&gt;%
  ungroup()
  
df &lt;- df %&gt;
  bind_rows(gaps) %&gt;
  arrange(positionID, moveEndDate) %&gt;
  mutate(index = row_number())

Result:

#&gt; print(as_tibble(df), n = 35)
# A tibble: 35 &#215; 7
   index positionID personID moveStartDate moveEndDate moveType         exitCode
   &lt;int&gt;      &lt;dbl&gt;    &lt;dbl&gt; &lt;date&gt;        &lt;date&gt;      &lt;chr&gt;            &lt;chr&gt;   
 1     1          1      115 2021-07-04    2021-07-04  moved agency     A       
 2     2          1      113 2021-06-28    2021-10-17  temporary appoi… &lt;NA&gt;    
 3     3          1       NA 2021-10-18    2022-01-17  &lt;NA&gt;             &lt;NA&gt;    
 4     4          1      109 2022-01-17    2022-04-14  transfer         &lt;NA&gt;    
 5     5          1      108 2022-04-15    2022-05-06  transfer         &lt;NA&gt;    
 6     6          1      108 2022-05-07    2022-06-30  transfer         &lt;NA&gt;    
 7     7          1      100 2022-07-01    2022-07-25  redesignation    &lt;NA&gt;    
 8     8          1      108 2022-07-01    2022-07-25  transfer         &lt;NA&gt;    
 9     9          1      100 2022-07-26    2022-09-30  redesignation    &lt;NA&gt;    
10    10          1      108 2022-07-26    2022-12-31  redesignation    &lt;NA&gt;    
11    11          2      101 2020-12-31    2020-12-31  moved agency     A       
12    12          2      110 2020-10-01    2020-12-31  relief           &lt;NA&gt;    
13    13          2      110 2021-01-01    2021-01-27  relief           &lt;NA&gt;    
14    14          2      110 2021-01-01    2021-01-27  transfer         &lt;NA&gt;    
15    15          2      110 2021-01-28    NA          promotion        &lt;NA&gt;    
16    16          3      112 2022-06-03    2022-06-03  resignation      B       
17    17          3      102 2022-02-01    2022-06-30  transfer         &lt;NA&gt;    
18    18          3      102 2022-07-01    2022-09-23  transfer         &lt;NA&gt;    
19    19          3      117 2022-08-29    2022-09-25  temporary appoi… &lt;NA&gt;    
20    20          3      102 2022-09-24    2023-03-31  transfer         &lt;NA&gt;    
21    21          4      107 2021-08-20    2021-08-20  moved agency     A       
22    22          4      104 2020-10-05    2021-12-01  restructure      &lt;NA&gt;    
23    23          4       NA 2021-12-02    2022-08-01  &lt;NA&gt;             &lt;NA&gt;    
24    24          4      109 2022-08-01    2022-10-28  relief           &lt;NA&gt;    
25    25          5      105 2022-10-02    2022-10-02  end of contract  E       
26    26          5       NA 2022-10-03    2022-10-14  &lt;NA&gt;             &lt;NA&gt;    
27    27          5      118 2022-10-14    2022-10-28  relief           &lt;NA&gt;    
28    28          5      118 2022-10-29    NA          relief           &lt;NA&gt;    
29    29          6      111 2020-10-16    2020-10-16  moved agency     A       
30    30          6      106 2020-08-01    2020-12-21  relief           &lt;NA&gt;    
31    31          6      106 2020-12-22    2021-12-21  promotion        &lt;NA&gt;    
32    32          6       NA 2021-12-22    2022-01-31  &lt;NA&gt;             &lt;

<details>
<summary>英文:</summary>

First change the date columns to `Date`s:

library(dplyr) # >= v1.1.0
library(lubridate)

df <- df %>%
mutate(across(moveStartDate:moveEndDate, dmy))

For your first problem, filter using `dplyr::cumany()`, `.by` `PositionID`:

df <- df %>%
filter(cumany(!is.na(exitCode)), .by = positionID)

For your second problem, bring in the next start date using `dplyr::lead()`, filter to rows with &gt;1 day difference, manipulate the dates to fill the gap, and bind back to your original dataframe. You can then `arrange()` and re-compute the `index` column.

gaps <- df %>%
group_by(positionID) %>%
mutate(nextStartDate = lead(moveStartDate)) %>%
filter(nextStartDate - moveEndDate > days(1)) %>%
transmute(
moveStartDate = moveEndDate + days(1),
moveEndDate = nextStartDate
) %>%
ungroup()

df <- df %>%
bind_rows(gaps) %>%
arrange(positionID, moveEndDate) %>%
mutate(index = row_number())

Result:

#> print(as_tibble(df), n = 35)

A tibble: 35 × 7

index positionID personID moveStartDate moveEndDate moveType exitCode
<int> <dbl> <dbl> <date> <date> <chr> <chr>
1 1 1 115 2021-07-04 2021-07-04 moved agency A
2 2 1 113 2021-06-28 2021-10-17 temporary appoi… <NA>
3 3 1 NA 2021-10-18 2022-01-17 <NA> <NA>
4 4 1 109 2022-01-17 2022-04-14 transfer <NA>
5 5 1 108 2022-04-15 2022-05-06 transfer <NA>
6 6 1 108 2022-05-07 2022-06-30 transfer <NA>
7 7 1 100 2022-07-01 2022-07-25 redesignation <NA>
8 8 1 108 2022-07-01 2022-07-25 transfer <NA>
9 9 1 100 2022-07-26 2022-09-30 redesignation <NA>
10 10 1 108 2022-07-26 2022-12-31 redesignation <NA>
11 11 2 101 2020-12-31 2020-12-31 moved agency A
12 12 2 110 2020-10-01 2020-12-31 relief <NA>
13 13 2 110 2021-01-01 2021-01-27 relief <NA>
14 14 2 110 2021-01-01 2021-01-27 transfer <NA>
15 15 2 110 2021-01-28 NA promotion <NA>
16 16 3 112 2022-06-03 2022-06-03 resignation B
17 17 3 102 2022-02-01 2022-06-30 transfer <NA>
18 18 3 102 2022-07-01 2022-09-23 transfer <NA>
19 19 3 117 2022-08-29 2022-09-25 temporary appoi… <NA>
20 20 3 102 2022-09-24 2023-03-31 transfer <NA>
21 21 4 107 2021-08-20 2021-08-20 moved agency A
22 22 4 104 2020-10-05 2021-12-01 restructure <NA>
23 23 4 NA 2021-12-02 2022-08-01 <NA> <NA>
24 24 4 109 2022-08-01 2022-10-28 relief <NA>
25 25 5 105 2022-10-02 2022-10-02 end of contract E
26 26 5 NA 2022-10-03 2022-10-14 <NA> <NA>
27 27 5 118 2022-10-14 2022-10-28 relief <NA>
28 28 5 118 2022-10-29 NA relief <NA>
29 29 6 111 2020-10-16 2020-10-16 moved agency A
30 30 6 106 2020-08-01 2020-12-21 relief <NA>
31 31 6 106 2020-12-22 2021-12-21 promotion <NA>
32 32 6 NA 2021-12-22 2022-01-31 <NA> <NA>
33 33 6 120 2022-01-31 2022-12-23 relief <NA>
34 34 6 120 2022-12-24 2023-02-02 relief <NA>
35 35 6 120 2023-02-03 NA relief <NA>



</details>



huangapple
  • 本文由 发表于 2023年4月4日 17:11:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75927533.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定