2023年7月23日 19:34:10go评论188阅读模式

英文:

How to reshape to long format, removing NAs cells?

问题

我有一个包含社会研究员和再访者的表格，这些地方在第一次尝试时没有得到服务，我需要将我的数据框重新整理成长格式，没有空白或NA值，这对我来说似乎很容易，但我在尝试中没有成功。

zone situation Researcher1 Researcher2 Researcher3 Researcher4 revisit1 revisit2 revisit3
1    3  Answered          NA          NA           4          NA       NA       10       NA
2    3   Refusal          NA          NA          15          NA       NA        5       NA
3    1  Answered          10          NA          NA          NA        2       NA       NA
4    1   Refusal           7          NA          NA          NA        5       NA       NA
5    2  Answered          NA          15          NA          NA        3       NA       NA
6    2   Refusal          NA           3          NA          NA        0       NA       NA
7    4  Answered          NA          NA          NA          13       NA       NA        4
8    4   Refusal          NA          NA          NA           8       NA       NA        4

我期望得到类似这样的结果：

注意： 上面的代码部分已被排除，只提供翻译的内容。

英文:

I have a table with social researchers and revisiters in places that didn't get service on the first try, I need to reshape my dataframe to long format without whitespace or NAs, that seemed easy to me, but I'm not getting success in my attempts.

  zone situation Researcher1 Researcher2 Researcher3 Researcher4 revisit1 revisit2 revisit3
1    3  Answered          NA          NA           4          NA       NA       10       NA
2    3   Refusal          NA          NA          15          NA       NA        5       NA
3    1  Answered          10          NA          NA          NA        2       NA       NA
4    1   Refusal           7          NA          NA          NA        5       NA       NA
5    2  Answered          NA          15          NA          NA        3       NA       NA
6    2   Refusal          NA           3          NA          NA        0       NA       NA
7    4  Answered          NA          NA          NA          13       NA       NA        4
8    4   Refusal          NA          NA          NA           8       NA       NA        4


long_rsch &lt;- reshape(survey, direction=&quot;long&quot;, timevar=&quot;Situation&quot;, 
                     idvar=&quot;zone&quot;, 
                     varying=list(c(&quot;Researcher 1&quot;, &quot;Researcher 2&quot;, 
                                    &quot;Researcher 3&quot;, &quot;Researcher 4&quot;), 
                                  c(&quot;revisit 1&quot;, &quot;revisit 2&quot;, &quot;revisit3&quot;)), 
                     v.names=c(&quot;first team&quot;, &quot;Revisit&quot;))

Gives me:

Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying,  : 
  &#39;varying&#39; arguments must be the same length

I expected something like this

答案1

得分: 4

reshape 实际上很简单，只需花点时间理解它是如何工作的。

reshape(survey, direction="long", varying=3:6, sep='', v.names='Result1', timevar='Researcher') |>
  reshape(direction="long", varying=3:5, sep='', idvar=c('id', 'Researcher'), 
          v.names=c('Result2'), timevar='Revisit') |>
  na.omit()

说明：

zone 不能用作 idvar，因为它不是唯一的。如果我们不指定 idvar，reshape 会自动处理一个唯一的 idvar（只要没有名为 "id" 的变量），新的 id 列将添加到输出中。

需要进行两次 reshape，出现错误是因为 Researchers 和 revisits 引用不同的 "times"（否则 Researcher1 会得到 revisit1，Researcher2 会得到 revisit2，依此类推），因此第一个 reshape 是在 Researcher* 上进行的；如果我们将 varying 指定为 Researcher* 列的范围，并使用 sep=''（与默认值 sep='.' 相反），reshape 可以正确地猜测名称，我们不需要手动输入它们。

相应地，第二个 reshape 是在 revisit* 上进行的，但现在使用 id 和 Researcher 作为 idvar。

最后，我们可以使用简单的 na.omit() 去除不需要的 NA 行。

英文:

reshape is actually straightforward if you take a second and understand how it works.

reshape(survey, direction=&quot;long&quot;, varying=3:6, sep=&#39;&#39;, v.names=&#39;Result1&#39;, timevar=&#39;Researcher&#39;) |&gt;
  reshape(direction=&quot;long&quot;, varying=3:5, sep=&#39;&#39;, idvar=c(&#39;id&#39;, &#39;Researcher&#39;), 
          v.names=c(&#39;Result2&#39;), timevar=&#39;Revisit&#39;) |&gt;
  na.omit()
#       zone situation Researcher Result1 id Revisit Result2
# 3.1.1    1  Answered          1      10  3       1       2
# 4.1.1    1   Refusal          1       7  4       1       5
# 5.2.1    2  Answered          2      15  5       1       3
# 6.2.1    2   Refusal          2       3  6       1       0
# 1.3.2    3  Answered          3       4  1       2      10
# 2.3.2    3   Refusal          3      15  2       2       5
# 7.4.3    4  Answered          4      13  7       3       4
# 8.4.3    4   Refusal          4       8  8       3       4

Explanation:

zone can't be used as idvar because it's not unique. reshape automatically takes care of a unique idvar if we don't specify one (and as long there is no variable called "id"), the new id column will be added in the output.

There are two reshapes needed, you get the error because Researchers and revisits refer to different "times" (else Researcher1 gets revisit1, Researcher2 gets revisit2, etc.), so the first reshape is on Researcher*; if we specify varying as a range of the Researcher* columns and use sep='' (in contrast to e.g. sep='.' which is the default), reshape can guess the names correctly, and we don't need to type them.

Correspondingly the second reshape is on revisit*, but now with id and Researcher as idvar.

Finally, we can omit the unwanted NA lines with a simple na.omit().

Data:

survey &lt;- structure(list(zone = c(3, 3, 1, 1, 2, 2, 4, 4), situation = c(&quot;Answered&quot;, 
&quot;Refusal&quot;, &quot;Answered&quot;, &quot;Refusal&quot;, &quot;Answered&quot;, &quot;Refusal&quot;, &quot;Answered&quot;, 
&quot;Refusal&quot;), Researcher1 = c(NA, NA, 10, 7, NA, NA, NA, NA), Researcher2 = c(NA, 
NA, NA, NA, 15, 3, NA, NA), Researcher3 = c(4, 15, NA, NA, NA, 
NA, NA, NA), Researcher4 = c(NA, NA, NA, NA, NA, NA, 13, 8), 
    revisit1 = c(NA, NA, 2, 5, 3, 0, NA, NA), revisit2 = c(10, 
    5, NA, NA, NA, NA, NA, NA), revisit3 = c(NA, NA, NA, NA, 
    NA, NA, 4, 4)), class = &quot;data.frame&quot;, row.names = c(NA, -8L
))

答案2

得分: 2

library(tidyverse)

df <- structure(list(zone = c(3, 3, 1, 1, 2, 2, 4, 4), situation = c("Answer", "Refusal", "Answer", "Refusal", "Answer", "Refusal", "Answer", "Refusal"), `Researcher 1` = c(NA, NA, 10, 7, NA, NA, NA, NA), `Researcher 2` = c(NA, NA, NA, NA, 15, 3, NA, NA), `Researcher 3` = c(4, 15, NA, NA, NA, NA, NA, NA), `Researcher 4` = c(NA, NA, NA, NA, NA, NA, 13, 8), `revisit 1` = c(NA, NA, 2, 5, 3, 0, NA, NA), `revisit 2` = c(10, 5, NA, NA, NA, NA, NA, NA), `revisit 3` = c(NA, NA, NA, NA, NA, NA, 4, 4)), class = "data.frame", row.names = c(NA, -8L))

df <- df %>%
  pivot_longer(cols = contains("Researcher"), names_to = "Researchers", values_to = "Result 1") %>%
  pivot_longer(cols = contains("revisit"), names_to = "revisit", values_to = "Result 2", values_drop_na = TRUE) %>%
  arrange(zone)

# A tibble: 8 × 6
   zone situation Researchers  `Result 1` revisit   `Result 2`
  <dbl> <chr>     <chr>             <dbl> <chr>          <dbl>
1     1 Answer    Researcher 1         10 revisit 1          2
2     1 Refusal   Researcher 1          7 revisit 1          5
3     2 Answer    Researcher 2         15 revisit 1          3
4     2 Refusal   Researcher 2          3 revisit 1          0
5     3 Answer    Researcher 3          4 revisit 2         10
6     3 Refusal   Researcher 3         15 revisit 2          5
7     4 Answer    Researcher 4         13 revisit 3          4
8     4 Refusal   Researcher 4          8 revisit 3          4

You could also then do a bit more cleaning to get to this:

mutate(df, situation = as.factor(situation), 
           Researchers = as.numeric(str_remove(Researchers, "Researcher ")),
              revisit = as.numeric(str_remove(revisit, "revisit ")))

# A tibble: 8 × 6
   zone situation Researchers `Result 1` revisit `Result 2`
  <dbl> <fct>           <dbl>      <dbl>   <dbl>      <dbl>
1     1 Answer              1         10       1          2
2     1 Refusal             1          7       1          5
3     2 Answer              2         15       1          3
4     2 Refusal             2          3       1          0
5     3 Answer              3          4       2         10
6     3 Refusal             3         15       2          5
7     4 Answer              4         13       3          4
8     4 Refusal             4          8       3          4


<details>
<summary>英文:</summary>

library(tidyverse)

df <- structure(list(zone = c(3, 3, 1, 1, 2, 2, 4, 4), situation = c("Answer",
"Refusal", "Answer", "Refusal", "Answer", "Refusal", "Answer",
"Refusal"), Researcher 1 = c(NA, NA, 10, 7, NA, NA, NA, NA),
Researcher 2 = c(NA, NA, NA, NA, 15, 3, NA, NA), Researcher 3 = c(4,
15, NA, NA, NA, NA, NA, NA), Researcher 4 = c(NA, NA, NA,
NA, NA, NA, 13, 8), revisit 1 = c(NA, NA, 2, 5, 3, 0, NA,
NA), revisit 2 = c(10, 5, NA, NA, NA, NA, NA, NA), revisit 3 = c(NA,
NA, NA, NA, NA, NA, 4, 4)), class = "data.frame", row.names = c(NA,
-8L))

df <- df %>%
pivot_longer(cols = contains("Researcher"), names_to = "Researchers", values_to = "Result 1") %>%
pivot_longer(cols = contains("revisit"), names_to = "revisit", values_to = "Result 2", values_drop_na = TRUE) %>%
arrange(zone)

A tibble: 8 × 6

zone situation Researchers Result 1 revisit Result 2
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 Answer Researcher 1 10 revisit 1 2
2 1 Refusal Researcher 1 7 revisit 1 5
3 2 Answer Researcher 2 15 revisit 1 3
4 2 Refusal Researcher 2 3 revisit 1 0
5 3 Answer Researcher 3 4 revisit 2 10
6 3 Refusal Researcher 3 15 revisit 2 5
7 4 Answer Researcher 4 13 revisit 3 4
8 4 Refusal Researcher 4 8 revisit 3 4

You could also then do a bit more cleaning to get to this:

mutate(df, situation = as.factor(situation),
Researchers = as.numeric(str_remove(Researchers, "Researcher ")),
revisit = as.numeric(str_remove(revisit, "revisit ")))

A tibble: 8 × 6

zone situation Researchers Result 1 revisit Result 2
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 Answer 1 10 1 2
2 1 Refusal 1 7 1 5
3 2 Answer 2 15 1 3
4 2 Refusal 2 3 1 0
5 3 Answer 3 4 2 10
6 3 Refusal 3 15 2 5
7 4 Answer 4 13 3 4
8 4 Refusal 4 8 3 4


</details>



# 答案3
**得分**: 1

考虑再次使用基本的R `reshape`。由于*Researcher*和*revisit*列不是相关的列，只需使用通用方法两次`reshape`，然后进行最终的`merge`：

```r
id_cols = c("zone", "situation")
researcher_cols <- names(df)[grep("Researcher", names(df))]
revisit_cols <- names(df)[grep("revisit", names(df))]

special_reshape <- function(data, id_cols, focused_cols, drop_cols, var_name, value_name) {
  reshape(
    data,
    idvar = id_cols,
    varying = focused_cols,
    times = focused_cols,
    drop = drop_cols,
    v.names = value_name,
    timevar = var_name,
    new.row.names = 1:1E5,
    direction = "long"
  ) |> subset(
    !is.na(get(value_name))
  )
}

df_long <- merge(
  special_reshape(df, id_cols, researcher_cols, revisit_cols, "researcher", "result 1"), 
  special_reshape(df, id_cols, revisit_cols, researcher_cols, "revisit", "result 2"),
  by = id_cols
)

df_long
#   zone situation   researcher result 1   revisit result 2
# 1    1    Answer Researcher 1       10 revisit 1        2
# 2    1   Refusal Researcher 1        7 revisit 1        5
# 3    2    Answer Researcher 2       15 revisit 1        3
# 4    2   Refusal Researcher 2        3 revisit 1        0
# 5    3    Answer Researcher 3        4 revisit 2       10
# 6    3   Refusal Researcher 3       15 revisit 2        5
# 7    4    Answer Researcher 4       13 revisit 3        4
# 8    4   Refusal Researcher 4        8 revisit 3        4

英文:

Consider again base R's reshape. Since Researcher and revisit columns are not related columns, simply reshape twice via a generalized method with final merge:

id_cols = c(&quot;zone&quot;, &quot;situation&quot;)
researcher_cols &lt;- names(df)[grep(&quot;Researcher&quot;, names(df))]
revisit_cols &lt;- names(df)[grep(&quot;revisit&quot;, names(df))]

special_reshape &lt;- function(data, id_cols, focused_cols, drop_cols, var_name, value_name) {
  reshape(
    data,
    idvar = id_cols,
    varying = focused_cols,
    times = focused_cols,
    drop = drop_cols,
    v.names = value_name,
    timevar = var_name,
    new.row.names = 1:1E5,
    direction = &quot;long&quot;
  ) |&gt; subset(
    !is.na(get(value_name))
  )
}

df_long &lt;- merge(
  special_reshape(df, id_cols, researcher_cols, revisit_cols, &quot;researcher&quot;, &quot;result 1&quot;), 
  special_reshape(df, id_cols, revisit_cols, researcher_cols, &quot;revisit&quot;, &quot;result 2&quot;),
  by = id_cols
)

df_long
#   zone situation   researcher result 1   revisit result 2
# 1    1    Answer Researcher 1       10 revisit 1        2
# 2    1   Refusal Researcher 1        7 revisit 1        5
# 3    2    Answer Researcher 2       15 revisit 1        3
# 4    2   Refusal Researcher 2        3 revisit 1        0
# 5    3    Answer Researcher 3        4 revisit 2       10
# 6    3   Refusal Researcher 3       15 revisit 2        5
# 7    4    Answer Researcher 4       13 revisit 3        4
# 8    4   Refusal Researcher 4        8 revisit 3        4


</details>

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

如何将数据重塑为长格式，同时删除NA值？

问题

答案1

答案2

A tibble: 8 × 6

A tibble: 8 × 6

R根据另一列中的值在多列之间进行更改。

使用R Shiny中的反应式’value’来更新自身

根据参考表填写数值。

Environmental problems while predicting from gaulss-gams with a custom variance function inside a package

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论