数据整理问题,带有标记的声音文件

huangapple go评论58阅读模式
英文:

Data wrangling problem with labelled sound files

问题

以下是您要翻译的内容:

"Let's say I have a large dataframe with a column for 'soundfile' and then 'start' and 'end' columns for when a particular bird is vocalising. Each vocalisation can vary significantly in length. An example of the dataframe is sound_df below. Each row in the sound_df represents one vocalisation - each sound file is the same length (300 seconds in the real data, 15 seconds in this example).

## setup example dataframe
id <- c("soundfile_1","soundfile_2","soundfile_3")
sound_df<-data.frame(rep(id, each = 2), c("0","8.0","3.3","11.7","4.6","13.1"), c("3.2","14.1","3.8","12.8","5.9","14.8"))
names(sound_df)[1] <- "soundfile"
names(sound_df)[2] <- "sound_start" 
names(sound_df)[3] <- "sound_end"
print(sound_df)

    soundfile sound_start sound_end
1 soundfile_1           0       3.2
2 soundfile_1         8.0      14.1
3 soundfile_2         3.3       3.8
4 soundfile_2        11.7      12.8
5 soundfile_3         4.6       5.9
6 soundfile_3        13.1      14.8

I then want to create a new dataframe for which each soundfile is divided into 3 second chunks as below, and the presence or absence of vocalisation in this period is extracted from sound_df and recorded in the column 'present'.

The results produced for sound_df should be as follows:

     soundfile start end present
1  soundfile_1     0   3     yes
2  soundfile_1     3   6     yes
3  soundfile_1     6   9     yes
4  soundfile_1     9  12     yes
5  soundfile_1    12  15     yes
6  soundfile_2     0   3      no
7  soundfile_2     3   6     yes
8  soundfile_2     6   9      no
9  soundfile_2     9  12     yes
10 soundfile_2    12  15     yes
11 soundfile_3     0   3      no
12 soundfile_3     3   6     yes
13 soundfile_3     6   9      no
14 soundfile_3     9  12      no
15 soundfile_3    12  15     yes

"

英文:

Let's say I have a large dataframe with a column for 'soundfile' and then 'start and 'end' columns for when a particular bird is vocalising. Each vocalisation can vary significantly in length. An example of the dataframe is sound_df below. Each row in the sound_df represents one vocalisation - each sound file is the same length (300 seconds in the real data, 15 seconds in this example).

## setup example dataframe
id &lt;- c(&quot;soundfile_1&quot;,&quot;soundfile_2&quot;,&quot;soundfile_3&quot;)
sound_df&lt;-data.frame(rep(id, each = 2), c(&quot;0&quot;,&quot;8.0&quot;,&quot;3.3&quot;,&quot;11.7&quot;,&quot;4.6&quot;,&quot;13.1&quot;), c(&quot;3.2&quot;,&quot;14.1&quot;,&quot;3.8&quot;,&quot;12.8&quot;,&quot;5.9&quot;,&quot;14.8&quot;))
names(sound_df)[1] &lt;- &quot;soundfile&quot;
names(sound_df)[2] &lt;- &quot;sound_start&quot; 
names(sound_df)[3] &lt;- &quot;sound_end&quot;
print(sound_df)

    soundfile sound_start sound_end
1 soundfile_1           0       3.2
2 soundfile_1         8.0      14.1
3 soundfile_2         3.3       3.8
4 soundfile_2        11.7      12.8
5 soundfile_3         4.6       5.9
6 soundfile_3        13.1      14.8

I then want to create a new dataframe for which each soundfile is divided into 3 second chunks as below, and the presence or absence of vocalisation in this period is extracted from sound_df and recorded in the column 'present'.

The results produced for sound_df should be as follows:

     soundfile start end present
1  soundfile_1     0   3     yes
2  soundfile_1     3   6     yes
3  soundfile_1     6   9     yes
4  soundfile_1     9  12     yes
5  soundfile_1    12  15     yes
6  soundfile_2     0   3      no
7  soundfile_2     3   6     yes
8  soundfile_2     6   9      no
9  soundfile_2     9  12     yes
10 soundfile_2    12  15     yes
11 soundfile_3     0   3      no
12 soundfile_3     3   6     yes
13 soundfile_3     6   9      no
14 soundfile_3     9  12      no
15 soundfile_3    12  15     yes

答案1

得分: 1

以下是您要翻译的内容:

"Sounds like a case for a data.table non-equi join.

Following the advice in this blog post by David Selby, I create some duplicate columns, because I too can never remember which ones are merged when you join:

library(data.table)
setDT(sound_df)
sound_df[, `:=`(
    sound_start_for_join = as.numeric(sound_start),
    sound_end_for_join = as.numeric(sound_end)
)]

Then we can simply create a data.table of the required time periods (again with duplicate columns for the join):

CHUNK_SECONDS &lt;- 3
FILE_SECONDS &lt;- 15
time_windows &lt;- CJ(
    soundfile = sound_df$soundfile,
    start = seq(from = 0, to = FILE_SECONDS - CHUNK_SECONDS, by = CHUNK_SECONDS),
    unique = TRUE
)[, end := start + CHUNK_SECONDS][
    ,
    `:=`(
        start_for_join = start,
        end_for_join = end
    )
]

time_windows

#       soundfile start   end start_for_join end_for_join
#          &lt;char&gt; &lt;num&gt; &lt;num&gt;          &lt;num&gt;        &lt;num&gt;
#  1: soundfile_1     0     3              0            3
#  2: soundfile_1     3     6              3            6
#  3: soundfile_1     6     9              6            9
#  4: soundfile_1     9    12              9           12
#  5: soundfile_1    12    15             12           15
# &lt;etc&gt;

Finally we join in those cases where the vocalization overlaps with the time period, and remove the extra columns:

sound_out &lt;- sound_df[
    time_windows,
    on = .(
        soundfile,
        sound_start_for_join &lt; end_for_join,
        sound_end_for_join &gt; start_for_join
    )
][, .(
    soundfile, start, end,
    present = !is.na(sound_start)
)]

Which produces the desired output:

      soundfile start   end present
         &lt;char&gt; &lt;num&gt; &lt;num&gt;  &lt;lgcl&gt;
 1: soundfile_1     0     3    TRUE
 2: soundfile_1     3     6    TRUE
 3: soundfile_1     6     9    TRUE
 4: soundfile_1     9    12    TRUE
 5: soundfile_1    12    15    TRUE
 6: soundfile_2     0     3   FALSE
 7: soundfile_2     3     6    TRUE
 8: soundfile_2     6     9   FALSE
 9: soundfile_2     9    12    TRUE
10: soundfile_2    12    15    TRUE
11: soundfile_3     0     3   FALSE
12: soundfile_3     3     6    TRUE
13: soundfile_3     6     9   FALSE
14: soundfile_3     9    12   FALSE
15: soundfile_3    12    15    TRUE

Note: I made the present column a logical vector because they're easier to work with. If you really want a character vector of "yes" and "no", you can change present = !is.na(sound_start) to present = fifelse(is.na(sound_start), "yes", "no").

英文:

Sounds like a case for a data.table non-equi join.

Following the advice in this blog post by David Selby, I create some duplicate columns, because I too can never remember which ones are merged when you join:

library(data.table)
setDT(sound_df)
sound_df[, `:=`(
    sound_start_for_join = as.numeric(sound_start),
    sound_end_for_join = as.numeric(sound_end)
)]

Then we can simply create a data.table of the required time periods (again with duplicate columns for the join):

CHUNK_SECONDS &lt;- 3
FILE_SECONDS &lt;- 15
time_windows &lt;- CJ(
    soundfile = sound_df$soundfile,
    start = seq(from = 0, to = FILE_SECONDS - CHUNK_SECONDS, by = CHUNK_SECONDS),
    unique = TRUE
)[, end := start + CHUNK_SECONDS][
    ,
    `:=`(
        start_for_join = start,
        end_for_join = end
    )
]

time_windows

#       soundfile start   end start_for_join end_for_join
#          &lt;char&gt; &lt;num&gt; &lt;num&gt;          &lt;num&gt;        &lt;num&gt;
#  1: soundfile_1     0     3              0            3
#  2: soundfile_1     3     6              3            6
#  3: soundfile_1     6     9              6            9
#  4: soundfile_1     9    12              9           12
#  5: soundfile_1    12    15             12           15
# &lt;etc&gt;

Finally we join in those cases where the vocalization overlaps with the time period, and remove the extra columns:

sound_out &lt;- sound_df[
    time_windows,
    on = .(
        soundfile,
        sound_start_for_join &lt; end_for_join,
        sound_end_for_join &gt; start_for_join
    )
][, .(
    soundfile, start, end,
    present = !is.na(sound_start)
)]

Which produces the desired output:

      soundfile start   end present
         &lt;char&gt; &lt;num&gt; &lt;num&gt;  &lt;lgcl&gt;
 1: soundfile_1     0     3    TRUE
 2: soundfile_1     3     6    TRUE
 3: soundfile_1     6     9    TRUE
 4: soundfile_1     9    12    TRUE
 5: soundfile_1    12    15    TRUE
 6: soundfile_2     0     3   FALSE
 7: soundfile_2     3     6    TRUE
 8: soundfile_2     6     9   FALSE
 9: soundfile_2     9    12    TRUE
10: soundfile_2    12    15    TRUE
11: soundfile_3     0     3   FALSE
12: soundfile_3     3     6    TRUE
13: soundfile_3     6     9   FALSE
14: soundfile_3     9    12   FALSE
15: soundfile_3    12    15    TRUE

Note: I made the present column a logical vector because they're easier to work with. If you really want a character vector of &quot;yes&quot; and &quot;no&quot; you can change present = !is.na(sound_start) to present = fifelse(is.na(sound_start), &quot;yes&quot;, &quot;no&quot;).

huangapple
  • 本文由 发表于 2023年5月29日 17:35:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76356207.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定