英文:
How to reshape to long format, removing NAs cells?
问题
我有一个包含社会研究员和再访者的表格,这些地方在第一次尝试时没有得到服务,我需要将我的数据框重新整理成长格式,没有空白或NA值,这对我来说似乎很容易,但我在尝试中没有成功。
zone situation Researcher1 Researcher2 Researcher3 Researcher4 revisit1 revisit2 revisit3
1 3 Answered NA NA 4 NA NA 10 NA
2 3 Refusal NA NA 15 NA NA 5 NA
3 1 Answered 10 NA NA NA 2 NA NA
4 1 Refusal 7 NA NA NA 5 NA NA
5 2 Answered NA 15 NA NA 3 NA NA
6 2 Refusal NA 3 NA NA 0 NA NA
7 4 Answered NA NA NA 13 NA NA 4
8 4 Refusal NA NA NA 8 NA NA 4
我期望得到类似这样的结果:
注意: 上面的代码部分已被排除,只提供翻译的内容。
英文:
I have a table with social researchers and revisiters in places that didn't get service on the first try, I need to reshape my dataframe to long format without whitespace or NAs, that seemed easy to me, but I'm not getting success in my attempts.
zone situation Researcher1 Researcher2 Researcher3 Researcher4 revisit1 revisit2 revisit3
1 3 Answered NA NA 4 NA NA 10 NA
2 3 Refusal NA NA 15 NA NA 5 NA
3 1 Answered 10 NA NA NA 2 NA NA
4 1 Refusal 7 NA NA NA 5 NA NA
5 2 Answered NA 15 NA NA 3 NA NA
6 2 Refusal NA 3 NA NA 0 NA NA
7 4 Answered NA NA NA 13 NA NA 4
8 4 Refusal NA NA NA 8 NA NA 4
long_rsch <- reshape(survey, direction="long", timevar="Situation",
idvar="zone",
varying=list(c("Researcher 1", "Researcher 2",
"Researcher 3", "Researcher 4"),
c("revisit 1", "revisit 2", "revisit3")),
v.names=c("first team", "Revisit"))
Gives me:
<!-- language: lang-none -->
Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, :
'varying' arguments must be the same length
I expected something like this
答案1
得分: 4
reshape
实际上很简单,只需花点时间理解它是如何工作的。
reshape(survey, direction="long", varying=3:6, sep='', v.names='Result1', timevar='Researcher') |>
reshape(direction="long", varying=3:5, sep='', idvar=c('id', 'Researcher'),
v.names=c('Result2'), timevar='Revisit') |>
na.omit()
说明:
zone 不能用作 idvar
,因为它不是唯一的。如果我们不指定 idvar
,reshape
会自动处理一个唯一的 idvar
(只要没有名为 "id"
的变量),新的 id 列将添加到输出中。
需要进行两次 reshape
,出现错误是因为 Researchers 和 revisits 引用不同的 "times"(否则 Researcher1 会得到 revisit1,Researcher2 会得到 revisit2,依此类推),因此第一个 reshape
是在 Researcher*
上进行的;如果我们将 varying
指定为 Researcher*
列的范围,并使用 sep=''
(与默认值 sep='.'
相反),reshape
可以正确地猜测名称,我们不需要手动输入它们。
相应地,第二个 reshape
是在 revisit*
上进行的,但现在使用 id 和 Researcher 作为 idvar
。
最后,我们可以使用简单的 na.omit()
去除不需要的 NA 行。
英文:
reshape
is actually straightforward if you take a second and understand how it works.
reshape(survey, direction="long", varying=3:6, sep='', v.names='Result1', timevar='Researcher') |>
reshape(direction="long", varying=3:5, sep='', idvar=c('id', 'Researcher'),
v.names=c('Result2'), timevar='Revisit') |>
na.omit()
# zone situation Researcher Result1 id Revisit Result2
# 3.1.1 1 Answered 1 10 3 1 2
# 4.1.1 1 Refusal 1 7 4 1 5
# 5.2.1 2 Answered 2 15 5 1 3
# 6.2.1 2 Refusal 2 3 6 1 0
# 1.3.2 3 Answered 3 4 1 2 10
# 2.3.2 3 Refusal 3 15 2 2 5
# 7.4.3 4 Answered 4 13 7 3 4
# 8.4.3 4 Refusal 4 8 8 3 4
Explanation:
zone can't be used as idvar
because it's not unique. reshape
automatically takes care of a unique idvar
if we don't specify one (and as long there is no variable called "id"
), the new id column will be added in the output.
There are two reshape
s needed, you get the error because Researchers and revisits refer to different "times" (else Researcher1 gets revisit1, Researcher2 gets revisit2, etc.), so the first reshape
is on Researcher*
; if we specify varying
as a range of the Researcher*
columns and use sep=''
(in contrast to e.g. sep='.'
which is the default), reshape
can guess the names correctly, and we don't need to type them.
Correspondingly the second reshape
is on revisit*
, but now with id and Researcher as idvar
.
Finally, we can omit the unwanted NA lines with a simple na.omit()
.
Data:
survey <- structure(list(zone = c(3, 3, 1, 1, 2, 2, 4, 4), situation = c("Answered",
"Refusal", "Answered", "Refusal", "Answered", "Refusal", "Answered",
"Refusal"), Researcher1 = c(NA, NA, 10, 7, NA, NA, NA, NA), Researcher2 = c(NA,
NA, NA, NA, 15, 3, NA, NA), Researcher3 = c(4, 15, NA, NA, NA,
NA, NA, NA), Researcher4 = c(NA, NA, NA, NA, NA, NA, 13, 8),
revisit1 = c(NA, NA, 2, 5, 3, 0, NA, NA), revisit2 = c(10,
5, NA, NA, NA, NA, NA, NA), revisit3 = c(NA, NA, NA, NA,
NA, NA, 4, 4)), class = "data.frame", row.names = c(NA, -8L
))
答案2
得分: 2
library(tidyverse)
df <- structure(list(zone = c(3, 3, 1, 1, 2, 2, 4, 4), situation = c("Answer", "Refusal", "Answer", "Refusal", "Answer", "Refusal", "Answer", "Refusal"), `Researcher 1` = c(NA, NA, 10, 7, NA, NA, NA, NA), `Researcher 2` = c(NA, NA, NA, NA, 15, 3, NA, NA), `Researcher 3` = c(4, 15, NA, NA, NA, NA, NA, NA), `Researcher 4` = c(NA, NA, NA, NA, NA, NA, 13, 8), `revisit 1` = c(NA, NA, 2, 5, 3, 0, NA, NA), `revisit 2` = c(10, 5, NA, NA, NA, NA, NA, NA), `revisit 3` = c(NA, NA, NA, NA, NA, NA, 4, 4)), class = "data.frame", row.names = c(NA, -8L))
df <- df %>%
pivot_longer(cols = contains("Researcher"), names_to = "Researchers", values_to = "Result 1") %>%
pivot_longer(cols = contains("revisit"), names_to = "revisit", values_to = "Result 2", values_drop_na = TRUE) %>%
arrange(zone)
# A tibble: 8 × 6
zone situation Researchers `Result 1` revisit `Result 2`
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 Answer Researcher 1 10 revisit 1 2
2 1 Refusal Researcher 1 7 revisit 1 5
3 2 Answer Researcher 2 15 revisit 1 3
4 2 Refusal Researcher 2 3 revisit 1 0
5 3 Answer Researcher 3 4 revisit 2 10
6 3 Refusal Researcher 3 15 revisit 2 5
7 4 Answer Researcher 4 13 revisit 3 4
8 4 Refusal Researcher 4 8 revisit 3 4
You could also then do a bit more cleaning to get to this:
mutate(df, situation = as.factor(situation),
Researchers = as.numeric(str_remove(Researchers, "Researcher ")),
revisit = as.numeric(str_remove(revisit, "revisit ")))
# A tibble: 8 × 6
zone situation Researchers `Result 1` revisit `Result 2`
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 Answer 1 10 1 2
2 1 Refusal 1 7 1 5
3 2 Answer 2 15 1 3
4 2 Refusal 2 3 1 0
5 3 Answer 3 4 2 10
6 3 Refusal 3 15 2 5
7 4 Answer 4 13 3 4
8 4 Refusal 4 8 3 4
<details>
<summary>英文:</summary>
library(tidyverse)
df <- structure(list(zone = c(3, 3, 1, 1, 2, 2, 4, 4), situation = c("Answer",
"Refusal", "Answer", "Refusal", "Answer", "Refusal", "Answer",
"Refusal"), Researcher 1
= c(NA, NA, 10, 7, NA, NA, NA, NA),
Researcher 2
= c(NA, NA, NA, NA, 15, 3, NA, NA), Researcher 3
= c(4,
15, NA, NA, NA, NA, NA, NA), Researcher 4
= c(NA, NA, NA,
NA, NA, NA, 13, 8), revisit 1
= c(NA, NA, 2, 5, 3, 0, NA,
NA), revisit 2
= c(10, 5, NA, NA, NA, NA, NA, NA), revisit 3
= c(NA,
NA, NA, NA, NA, NA, 4, 4)), class = "data.frame", row.names = c(NA,
-8L))
df <- df %>%
pivot_longer(cols = contains("Researcher"), names_to = "Researchers", values_to = "Result 1") %>%
pivot_longer(cols = contains("revisit"), names_to = "revisit", values_to = "Result 2", values_drop_na = TRUE) %>%
arrange(zone)
A tibble: 8 × 6
zone situation Researchers Result 1
revisit Result 2
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 Answer Researcher 1 10 revisit 1 2
2 1 Refusal Researcher 1 7 revisit 1 5
3 2 Answer Researcher 2 15 revisit 1 3
4 2 Refusal Researcher 2 3 revisit 1 0
5 3 Answer Researcher 3 4 revisit 2 10
6 3 Refusal Researcher 3 15 revisit 2 5
7 4 Answer Researcher 4 13 revisit 3 4
8 4 Refusal Researcher 4 8 revisit 3 4
You could also then do a bit more cleaning to get to this:
mutate(df, situation = as.factor(situation),
Researchers = as.numeric(str_remove(Researchers, "Researcher ")),
revisit = as.numeric(str_remove(revisit, "revisit ")))
A tibble: 8 × 6
zone situation Researchers Result 1
revisit Result 2
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 Answer 1 10 1 2
2 1 Refusal 1 7 1 5
3 2 Answer 2 15 1 3
4 2 Refusal 2 3 1 0
5 3 Answer 3 4 2 10
6 3 Refusal 3 15 2 5
7 4 Answer 4 13 3 4
8 4 Refusal 4 8 3 4
</details>
# 答案3
**得分**: 1
考虑再次使用基本的R `reshape`。由于*Researcher*和*revisit*列不是相关的列,只需使用通用方法两次`reshape`,然后进行最终的`merge`:
```r
id_cols = c("zone", "situation")
researcher_cols <- names(df)[grep("Researcher", names(df))]
revisit_cols <- names(df)[grep("revisit", names(df))]
special_reshape <- function(data, id_cols, focused_cols, drop_cols, var_name, value_name) {
reshape(
data,
idvar = id_cols,
varying = focused_cols,
times = focused_cols,
drop = drop_cols,
v.names = value_name,
timevar = var_name,
new.row.names = 1:1E5,
direction = "long"
) |> subset(
!is.na(get(value_name))
)
}
df_long <- merge(
special_reshape(df, id_cols, researcher_cols, revisit_cols, "researcher", "result 1"),
special_reshape(df, id_cols, revisit_cols, researcher_cols, "revisit", "result 2"),
by = id_cols
)
df_long
# zone situation researcher result 1 revisit result 2
# 1 1 Answer Researcher 1 10 revisit 1 2
# 2 1 Refusal Researcher 1 7 revisit 1 5
# 3 2 Answer Researcher 2 15 revisit 1 3
# 4 2 Refusal Researcher 2 3 revisit 1 0
# 5 3 Answer Researcher 3 4 revisit 2 10
# 6 3 Refusal Researcher 3 15 revisit 2 5
# 7 4 Answer Researcher 4 13 revisit 3 4
# 8 4 Refusal Researcher 4 8 revisit 3 4
英文:
Consider again base R's reshape
. Since Researcher and revisit columns are not related columns, simply reshape
twice via a generalized method with final merge
:
id_cols = c("zone", "situation")
researcher_cols <- names(df)[grep("Researcher", names(df))]
revisit_cols <- names(df)[grep("revisit", names(df))]
special_reshape <- function(data, id_cols, focused_cols, drop_cols, var_name, value_name) {
reshape(
data,
idvar = id_cols,
varying = focused_cols,
times = focused_cols,
drop = drop_cols,
v.names = value_name,
timevar = var_name,
new.row.names = 1:1E5,
direction = "long"
) |> subset(
!is.na(get(value_name))
)
}
df_long <- merge(
special_reshape(df, id_cols, researcher_cols, revisit_cols, "researcher", "result 1"),
special_reshape(df, id_cols, revisit_cols, researcher_cols, "revisit", "result 2"),
by = id_cols
)
df_long
# zone situation researcher result 1 revisit result 2
# 1 1 Answer Researcher 1 10 revisit 1 2
# 2 1 Refusal Researcher 1 7 revisit 1 5
# 3 2 Answer Researcher 2 15 revisit 1 3
# 4 2 Refusal Researcher 2 3 revisit 1 0
# 5 3 Answer Researcher 3 4 revisit 2 10
# 6 3 Refusal Researcher 3 15 revisit 2 5
# 7 4 Answer Researcher 4 13 revisit 3 4
# 8 4 Refusal Researcher 4 8 revisit 3 4
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论