如何将数据重塑为长格式,同时删除NA值?

huangapple go评论64阅读模式
英文:

How to reshape to long format, removing NAs cells?

问题

我有一个包含社会研究员和再访者的表格,这些地方在第一次尝试时没有得到服务,我需要将我的数据框重新整理成长格式,没有空白或NA值,这对我来说似乎很容易,但我在尝试中没有成功。

zone situation Researcher1 Researcher2 Researcher3 Researcher4 revisit1 revisit2 revisit3
1    3  Answered          NA          NA           4          NA       NA       10       NA
2    3   Refusal          NA          NA          15          NA       NA        5       NA
3    1  Answered          10          NA          NA          NA        2       NA       NA
4    1   Refusal           7          NA          NA          NA        5       NA       NA
5    2  Answered          NA          15          NA          NA        3       NA       NA
6    2   Refusal          NA           3          NA          NA        0       NA       NA
7    4  Answered          NA          NA          NA          13       NA       NA        4
8    4   Refusal          NA          NA          NA           8       NA       NA        4

我期望得到类似这样的结果:

如何将数据重塑为长格式,同时删除NA值?

注意: 上面的代码部分已被排除,只提供翻译的内容。

英文:

I have a table with social researchers and revisiters in places that didn't get service on the first try, I need to reshape my dataframe to long format without whitespace or NAs, that seemed easy to me, but I'm not getting success in my attempts.

  zone situation Researcher1 Researcher2 Researcher3 Researcher4 revisit1 revisit2 revisit3
1    3  Answered          NA          NA           4          NA       NA       10       NA
2    3   Refusal          NA          NA          15          NA       NA        5       NA
3    1  Answered          10          NA          NA          NA        2       NA       NA
4    1   Refusal           7          NA          NA          NA        5       NA       NA
5    2  Answered          NA          15          NA          NA        3       NA       NA
6    2   Refusal          NA           3          NA          NA        0       NA       NA
7    4  Answered          NA          NA          NA          13       NA       NA        4
8    4   Refusal          NA          NA          NA           8       NA       NA        4


long_rsch <- reshape(survey, direction="long", timevar="Situation", 
                     idvar="zone", 
                     varying=list(c("Researcher 1", "Researcher 2", 
                                    "Researcher 3", "Researcher 4"), 
                                  c("revisit 1", "revisit 2", "revisit3")), 
                     v.names=c("first team", "Revisit"))

Gives me:

<!-- language: lang-none -->

Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying,  : 
  &#39;varying&#39; arguments must be the same length

I expected something like this

如何将数据重塑为长格式,同时删除NA值?

答案1

得分: 4

reshape 实际上很简单,只需花点时间理解它是如何工作的。

reshape(survey, direction="long", varying=3:6, sep='', v.names='Result1', timevar='Researcher') |>
  reshape(direction="long", varying=3:5, sep='', idvar=c('id', 'Researcher'), 
          v.names=c('Result2'), timevar='Revisit') |>
  na.omit()

说明:

zone 不能用作 idvar,因为它不是唯一的。如果我们不指定 idvarreshape 会自动处理一个唯一的 idvar(只要没有名为 "id" 的变量),新的 id 列将添加到输出中。

需要进行两次 reshape,出现错误是因为 Researchersrevisits 引用不同的 "times"(否则 Researcher1 会得到 revisit1Researcher2 会得到 revisit2,依此类推),因此第一个 reshape 是在 Researcher* 上进行的;如果我们将 varying 指定为 Researcher* 列的范围,并使用 sep=''(与默认值 sep='.' 相反),reshape 可以正确地猜测名称,我们不需要手动输入它们。

相应地,第二个 reshape 是在 revisit* 上进行的,但现在使用 idResearcher 作为 idvar

最后,我们可以使用简单的 na.omit() 去除不需要的 NA 行。

英文:

reshape is actually straightforward if you take a second and understand how it works.

reshape(survey, direction=&quot;long&quot;, varying=3:6, sep=&#39;&#39;, v.names=&#39;Result1&#39;, timevar=&#39;Researcher&#39;) |&gt;
  reshape(direction=&quot;long&quot;, varying=3:5, sep=&#39;&#39;, idvar=c(&#39;id&#39;, &#39;Researcher&#39;), 
          v.names=c(&#39;Result2&#39;), timevar=&#39;Revisit&#39;) |&gt;
  na.omit()
#       zone situation Researcher Result1 id Revisit Result2
# 3.1.1    1  Answered          1      10  3       1       2
# 4.1.1    1   Refusal          1       7  4       1       5
# 5.2.1    2  Answered          2      15  5       1       3
# 6.2.1    2   Refusal          2       3  6       1       0
# 1.3.2    3  Answered          3       4  1       2      10
# 2.3.2    3   Refusal          3      15  2       2       5
# 7.4.3    4  Answered          4      13  7       3       4
# 8.4.3    4   Refusal          4       8  8       3       4

Explanation:

zone can't be used as idvar because it's not unique. reshape automatically takes care of a unique idvar if we don't specify one (and as long there is no variable called &quot;id&quot;), the new id column will be added in the output.

There are two reshapes needed, you get the error because Researchers and revisits refer to different "times" (else Researcher1 gets revisit1, Researcher2 gets revisit2, etc.), so the first reshape is on Researcher*; if we specify varying as a range of the Researcher* columns and use sep=&#39;&#39; (in contrast to e.g. sep=&#39;.&#39; which is the default), reshape can guess the names correctly, and we don't need to type them.

Correspondingly the second reshape is on revisit*, but now with id and Researcher as idvar.

Finally, we can omit the unwanted NA lines with a simple na.omit().


Data:

survey &lt;- structure(list(zone = c(3, 3, 1, 1, 2, 2, 4, 4), situation = c(&quot;Answered&quot;, 
&quot;Refusal&quot;, &quot;Answered&quot;, &quot;Refusal&quot;, &quot;Answered&quot;, &quot;Refusal&quot;, &quot;Answered&quot;, 
&quot;Refusal&quot;), Researcher1 = c(NA, NA, 10, 7, NA, NA, NA, NA), Researcher2 = c(NA, 
NA, NA, NA, 15, 3, NA, NA), Researcher3 = c(4, 15, NA, NA, NA, 
NA, NA, NA), Researcher4 = c(NA, NA, NA, NA, NA, NA, 13, 8), 
    revisit1 = c(NA, NA, 2, 5, 3, 0, NA, NA), revisit2 = c(10, 
    5, NA, NA, NA, NA, NA, NA), revisit3 = c(NA, NA, NA, NA, 
    NA, NA, 4, 4)), class = &quot;data.frame&quot;, row.names = c(NA, -8L
))

答案2

得分: 2

library(tidyverse)

df <- structure(list(zone = c(3, 3, 1, 1, 2, 2, 4, 4), situation = c("Answer", "Refusal", "Answer", "Refusal", "Answer", "Refusal", "Answer", "Refusal"), `Researcher 1` = c(NA, NA, 10, 7, NA, NA, NA, NA), `Researcher 2` = c(NA, NA, NA, NA, 15, 3, NA, NA), `Researcher 3` = c(4, 15, NA, NA, NA, NA, NA, NA), `Researcher 4` = c(NA, NA, NA, NA, NA, NA, 13, 8), `revisit 1` = c(NA, NA, 2, 5, 3, 0, NA, NA), `revisit 2` = c(10, 5, NA, NA, NA, NA, NA, NA), `revisit 3` = c(NA, NA, NA, NA, NA, NA, 4, 4)), class = "data.frame", row.names = c(NA, -8L))

df <- df %>%
  pivot_longer(cols = contains("Researcher"), names_to = "Researchers", values_to = "Result 1") %>%
  pivot_longer(cols = contains("revisit"), names_to = "revisit", values_to = "Result 2", values_drop_na = TRUE) %>%
  arrange(zone)

# A tibble: 8 × 6
   zone situation Researchers  `Result 1` revisit   `Result 2`
  <dbl> <chr>     <chr>             <dbl> <chr>          <dbl>
1     1 Answer    Researcher 1         10 revisit 1          2
2     1 Refusal   Researcher 1          7 revisit 1          5
3     2 Answer    Researcher 2         15 revisit 1          3
4     2 Refusal   Researcher 2          3 revisit 1          0
5     3 Answer    Researcher 3          4 revisit 2         10
6     3 Refusal   Researcher 3         15 revisit 2          5
7     4 Answer    Researcher 4         13 revisit 3          4
8     4 Refusal   Researcher 4          8 revisit 3          4

You could also then do a bit more cleaning to get to this:

mutate(df, situation = as.factor(situation), 
           Researchers = as.numeric(str_remove(Researchers, "Researcher ")),
              revisit = as.numeric(str_remove(revisit, "revisit ")))

# A tibble: 8 × 6
   zone situation Researchers `Result 1` revisit `Result 2`
  <dbl> <fct>           <dbl>      <dbl>   <dbl>      <dbl>
1     1 Answer              1         10       1          2
2     1 Refusal             1          7       1          5
3     2 Answer              2         15       1          3
4     2 Refusal             2          3       1          0
5     3 Answer              3          4       2         10
6     3 Refusal             3         15       2          5
7     4 Answer              4         13       3          4
8     4 Refusal             4          8       3          4

<details>
<summary>英文:</summary>

library(tidyverse)

df <- structure(list(zone = c(3, 3, 1, 1, 2, 2, 4, 4), situation = c("Answer",
"Refusal", "Answer", "Refusal", "Answer", "Refusal", "Answer",
"Refusal"), Researcher 1 = c(NA, NA, 10, 7, NA, NA, NA, NA),
Researcher 2 = c(NA, NA, NA, NA, 15, 3, NA, NA), Researcher 3 = c(4,
15, NA, NA, NA, NA, NA, NA), Researcher 4 = c(NA, NA, NA,
NA, NA, NA, 13, 8), revisit 1 = c(NA, NA, 2, 5, 3, 0, NA,
NA), revisit 2 = c(10, 5, NA, NA, NA, NA, NA, NA), revisit 3 = c(NA,
NA, NA, NA, NA, NA, 4, 4)), class = "data.frame", row.names = c(NA,
-8L))

df <- df %>%
pivot_longer(cols = contains("Researcher"), names_to = "Researchers", values_to = "Result 1") %>%
pivot_longer(cols = contains("revisit"), names_to = "revisit", values_to = "Result 2", values_drop_na = TRUE) %>%
arrange(zone)

A tibble: 8 × 6

zone situation Researchers Result 1 revisit Result 2
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 Answer Researcher 1 10 revisit 1 2
2 1 Refusal Researcher 1 7 revisit 1 5
3 2 Answer Researcher 2 15 revisit 1 3
4 2 Refusal Researcher 2 3 revisit 1 0
5 3 Answer Researcher 3 4 revisit 2 10
6 3 Refusal Researcher 3 15 revisit 2 5
7 4 Answer Researcher 4 13 revisit 3 4
8 4 Refusal Researcher 4 8 revisit 3 4

You could also then do a bit more cleaning to get to this:

mutate(df, situation = as.factor(situation),
Researchers = as.numeric(str_remove(Researchers, "Researcher ")),
revisit = as.numeric(str_remove(revisit, "revisit ")))

A tibble: 8 × 6

zone situation Researchers Result 1 revisit Result 2
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 Answer 1 10 1 2
2 1 Refusal 1 7 1 5
3 2 Answer 2 15 1 3
4 2 Refusal 2 3 1 0
5 3 Answer 3 4 2 10
6 3 Refusal 3 15 2 5
7 4 Answer 4 13 3 4
8 4 Refusal 4 8 3 4


</details>
# 答案3
**得分**: 1
考虑再次使用基本的R `reshape`。由于*Researcher*和*revisit*列不是相关的列,只需使用通用方法两次`reshape`,然后进行最终的`merge`:
```r
id_cols = c("zone", "situation")
researcher_cols <- names(df)[grep("Researcher", names(df))]
revisit_cols <- names(df)[grep("revisit", names(df))]
special_reshape <- function(data, id_cols, focused_cols, drop_cols, var_name, value_name) {
reshape(
data,
idvar = id_cols,
varying = focused_cols,
times = focused_cols,
drop = drop_cols,
v.names = value_name,
timevar = var_name,
new.row.names = 1:1E5,
direction = "long"
) |> subset(
!is.na(get(value_name))
)
}
df_long <- merge(
special_reshape(df, id_cols, researcher_cols, revisit_cols, "researcher", "result 1"), 
special_reshape(df, id_cols, revisit_cols, researcher_cols, "revisit", "result 2"),
by = id_cols
)
df_long
#   zone situation   researcher result 1   revisit result 2
# 1    1    Answer Researcher 1       10 revisit 1        2
# 2    1   Refusal Researcher 1        7 revisit 1        5
# 3    2    Answer Researcher 2       15 revisit 1        3
# 4    2   Refusal Researcher 2        3 revisit 1        0
# 5    3    Answer Researcher 3        4 revisit 2       10
# 6    3   Refusal Researcher 3       15 revisit 2        5
# 7    4    Answer Researcher 4       13 revisit 3        4
# 8    4   Refusal Researcher 4        8 revisit 3        4
英文:

Consider again base R's reshape. Since Researcher and revisit columns are not related columns, simply reshape twice via a generalized method with final merge:

id_cols = c(&quot;zone&quot;, &quot;situation&quot;)
researcher_cols &lt;- names(df)[grep(&quot;Researcher&quot;, names(df))]
revisit_cols &lt;- names(df)[grep(&quot;revisit&quot;, names(df))]

special_reshape &lt;- function(data, id_cols, focused_cols, drop_cols, var_name, value_name) {
  reshape(
    data,
    idvar = id_cols,
    varying = focused_cols,
    times = focused_cols,
    drop = drop_cols,
    v.names = value_name,
    timevar = var_name,
    new.row.names = 1:1E5,
    direction = &quot;long&quot;
  ) |&gt; subset(
    !is.na(get(value_name))
  )
}

df_long &lt;- merge(
  special_reshape(df, id_cols, researcher_cols, revisit_cols, &quot;researcher&quot;, &quot;result 1&quot;), 
  special_reshape(df, id_cols, revisit_cols, researcher_cols, &quot;revisit&quot;, &quot;result 2&quot;),
  by = id_cols
)

df_long
#   zone situation   researcher result 1   revisit result 2
# 1    1    Answer Researcher 1       10 revisit 1        2
# 2    1   Refusal Researcher 1        7 revisit 1        5
# 3    2    Answer Researcher 2       15 revisit 1        3
# 4    2   Refusal Researcher 2        3 revisit 1        0
# 5    3    Answer Researcher 3        4 revisit 2       10
# 6    3   Refusal Researcher 3       15 revisit 2        5
# 7    4    Answer Researcher 4       13 revisit 3        4
# 8    4   Refusal Researcher 4        8 revisit 3        4


</details>



huangapple
  • 本文由 发表于 2023年7月23日 19:34:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/76748004.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定