英文:
Data wrangling problem with labelled sound files
问题
以下是您要翻译的内容:
"Let's say I have a large dataframe with a column for 'soundfile' and then 'start' and 'end' columns for when a particular bird is vocalising. Each vocalisation can vary significantly in length. An example of the dataframe is sound_df below. Each row in the sound_df represents one vocalisation - each sound file is the same length (300 seconds in the real data, 15 seconds in this example).
## setup example dataframe
id <- c("soundfile_1","soundfile_2","soundfile_3")
sound_df<-data.frame(rep(id, each = 2), c("0","8.0","3.3","11.7","4.6","13.1"), c("3.2","14.1","3.8","12.8","5.9","14.8"))
names(sound_df)[1] <- "soundfile"
names(sound_df)[2] <- "sound_start"
names(sound_df)[3] <- "sound_end"
print(sound_df)
soundfile sound_start sound_end
1 soundfile_1 0 3.2
2 soundfile_1 8.0 14.1
3 soundfile_2 3.3 3.8
4 soundfile_2 11.7 12.8
5 soundfile_3 4.6 5.9
6 soundfile_3 13.1 14.8
I then want to create a new dataframe for which each soundfile is divided into 3 second chunks as below, and the presence or absence of vocalisation in this period is extracted from sound_df and recorded in the column 'present'.
The results produced for sound_df should be as follows:
soundfile start end present
1 soundfile_1 0 3 yes
2 soundfile_1 3 6 yes
3 soundfile_1 6 9 yes
4 soundfile_1 9 12 yes
5 soundfile_1 12 15 yes
6 soundfile_2 0 3 no
7 soundfile_2 3 6 yes
8 soundfile_2 6 9 no
9 soundfile_2 9 12 yes
10 soundfile_2 12 15 yes
11 soundfile_3 0 3 no
12 soundfile_3 3 6 yes
13 soundfile_3 6 9 no
14 soundfile_3 9 12 no
15 soundfile_3 12 15 yes
"
英文:
Let's say I have a large dataframe with a column for 'soundfile' and then 'start and 'end' columns for when a particular bird is vocalising. Each vocalisation can vary significantly in length. An example of the dataframe is sound_df below. Each row in the sound_df represents one vocalisation - each sound file is the same length (300 seconds in the real data, 15 seconds in this example).
## setup example dataframe
id <- c("soundfile_1","soundfile_2","soundfile_3")
sound_df<-data.frame(rep(id, each = 2), c("0","8.0","3.3","11.7","4.6","13.1"), c("3.2","14.1","3.8","12.8","5.9","14.8"))
names(sound_df)[1] <- "soundfile"
names(sound_df)[2] <- "sound_start"
names(sound_df)[3] <- "sound_end"
print(sound_df)
soundfile sound_start sound_end
1 soundfile_1 0 3.2
2 soundfile_1 8.0 14.1
3 soundfile_2 3.3 3.8
4 soundfile_2 11.7 12.8
5 soundfile_3 4.6 5.9
6 soundfile_3 13.1 14.8
I then want to create a new dataframe for which each soundfile is divided into 3 second chunks as below, and the presence or absence of vocalisation in this period is extracted from sound_df and recorded in the column 'present'.
The results produced for sound_df should be as follows:
soundfile start end present
1 soundfile_1 0 3 yes
2 soundfile_1 3 6 yes
3 soundfile_1 6 9 yes
4 soundfile_1 9 12 yes
5 soundfile_1 12 15 yes
6 soundfile_2 0 3 no
7 soundfile_2 3 6 yes
8 soundfile_2 6 9 no
9 soundfile_2 9 12 yes
10 soundfile_2 12 15 yes
11 soundfile_3 0 3 no
12 soundfile_3 3 6 yes
13 soundfile_3 6 9 no
14 soundfile_3 9 12 no
15 soundfile_3 12 15 yes
答案1
得分: 1
以下是您要翻译的内容:
"Sounds like a case for a data.table
non-equi join.
Following the advice in this blog post by David Selby, I create some duplicate columns, because I too can never remember which ones are merged when you join:
library(data.table)
setDT(sound_df)
sound_df[, `:=`(
sound_start_for_join = as.numeric(sound_start),
sound_end_for_join = as.numeric(sound_end)
)]
Then we can simply create a data.table
of the required time periods (again with duplicate columns for the join):
CHUNK_SECONDS <- 3
FILE_SECONDS <- 15
time_windows <- CJ(
soundfile = sound_df$soundfile,
start = seq(from = 0, to = FILE_SECONDS - CHUNK_SECONDS, by = CHUNK_SECONDS),
unique = TRUE
)[, end := start + CHUNK_SECONDS][
,
`:=`(
start_for_join = start,
end_for_join = end
)
]
time_windows
# soundfile start end start_for_join end_for_join
# <char> <num> <num> <num> <num>
# 1: soundfile_1 0 3 0 3
# 2: soundfile_1 3 6 3 6
# 3: soundfile_1 6 9 6 9
# 4: soundfile_1 9 12 9 12
# 5: soundfile_1 12 15 12 15
# <etc>
Finally we join in those cases where the vocalization overlaps with the time period, and remove the extra columns:
sound_out <- sound_df[
time_windows,
on = .(
soundfile,
sound_start_for_join < end_for_join,
sound_end_for_join > start_for_join
)
][, .(
soundfile, start, end,
present = !is.na(sound_start)
)]
Which produces the desired output:
soundfile start end present
<char> <num> <num> <lgcl>
1: soundfile_1 0 3 TRUE
2: soundfile_1 3 6 TRUE
3: soundfile_1 6 9 TRUE
4: soundfile_1 9 12 TRUE
5: soundfile_1 12 15 TRUE
6: soundfile_2 0 3 FALSE
7: soundfile_2 3 6 TRUE
8: soundfile_2 6 9 FALSE
9: soundfile_2 9 12 TRUE
10: soundfile_2 12 15 TRUE
11: soundfile_3 0 3 FALSE
12: soundfile_3 3 6 TRUE
13: soundfile_3 6 9 FALSE
14: soundfile_3 9 12 FALSE
15: soundfile_3 12 15 TRUE
Note: I made the present
column a logical
vector because they're easier to work with. If you really want a character
vector of "yes"
and "no"
, you can change present = !is.na(sound_start)
to present = fifelse(is.na(sound_start), "yes", "no")
.
英文:
Sounds like a case for a data.table
non-equi join.
Following the advice in this blog post by David Selby, I create some duplicate columns, because I too can never remember which ones are merged when you join:
library(data.table)
setDT(sound_df)
sound_df[, `:=`(
sound_start_for_join = as.numeric(sound_start),
sound_end_for_join = as.numeric(sound_end)
)]
Then we can simply create a data.table
of the required time periods (again with duplicate columns for the join):
CHUNK_SECONDS <- 3
FILE_SECONDS <- 15
time_windows <- CJ(
soundfile = sound_df$soundfile,
start = seq(from = 0, to = FILE_SECONDS - CHUNK_SECONDS, by = CHUNK_SECONDS),
unique = TRUE
)[, end := start + CHUNK_SECONDS][
,
`:=`(
start_for_join = start,
end_for_join = end
)
]
time_windows
# soundfile start end start_for_join end_for_join
# <char> <num> <num> <num> <num>
# 1: soundfile_1 0 3 0 3
# 2: soundfile_1 3 6 3 6
# 3: soundfile_1 6 9 6 9
# 4: soundfile_1 9 12 9 12
# 5: soundfile_1 12 15 12 15
# <etc>
Finally we join in those cases where the vocalization overlaps with the time period, and remove the extra columns:
sound_out <- sound_df[
time_windows,
on = .(
soundfile,
sound_start_for_join < end_for_join,
sound_end_for_join > start_for_join
)
][, .(
soundfile, start, end,
present = !is.na(sound_start)
)]
Which produces the desired output:
soundfile start end present
<char> <num> <num> <lgcl>
1: soundfile_1 0 3 TRUE
2: soundfile_1 3 6 TRUE
3: soundfile_1 6 9 TRUE
4: soundfile_1 9 12 TRUE
5: soundfile_1 12 15 TRUE
6: soundfile_2 0 3 FALSE
7: soundfile_2 3 6 TRUE
8: soundfile_2 6 9 FALSE
9: soundfile_2 9 12 TRUE
10: soundfile_2 12 15 TRUE
11: soundfile_3 0 3 FALSE
12: soundfile_3 3 6 TRUE
13: soundfile_3 6 9 FALSE
14: soundfile_3 9 12 FALSE
15: soundfile_3 12 15 TRUE
Note: I made the present
column a logical
vector because they're easier to work with. If you really want a character
vector of "yes"
and "no"
you can change present = !is.na(sound_start)
to present = fifelse(is.na(sound_start), "yes", "no")
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论