2023年5月29日 17:35:45go评论98阅读模式

英文:

Data wrangling problem with labelled sound files

问题

以下是您要翻译的内容：

"Let's say I have a large dataframe with a column for 'soundfile' and then 'start' and 'end' columns for when a particular bird is vocalising. Each vocalisation can vary significantly in length. An example of the dataframe is sound_df below. Each row in the sound_df represents one vocalisation - each sound file is the same length (300 seconds in the real data, 15 seconds in this example).

## setup example dataframe
id <- c("soundfile_1","soundfile_2","soundfile_3")
sound_df<-data.frame(rep(id, each = 2), c("0","8.0","3.3","11.7","4.6","13.1"), c("3.2","14.1","3.8","12.8","5.9","14.8"))
names(sound_df)[1] <- "soundfile"
names(sound_df)[2] <- "sound_start" 
names(sound_df)[3] <- "sound_end"
print(sound_df)
    soundfile sound_start sound_end
1 soundfile_1           0       3.2
2 soundfile_1         8.0      14.1
3 soundfile_2         3.3       3.8
4 soundfile_2        11.7      12.8
5 soundfile_3         4.6       5.9
6 soundfile_3        13.1      14.8

I then want to create a new dataframe for which each soundfile is divided into 3 second chunks as below, and the presence or absence of vocalisation in this period is extracted from sound_df and recorded in the column 'present'.

The results produced for sound_df should be as follows:

     soundfile start end present
1  soundfile_1     0   3     yes
2  soundfile_1     3   6     yes
3  soundfile_1     6   9     yes
4  soundfile_1     9  12     yes
5  soundfile_1    12  15     yes
6  soundfile_2     0   3      no
7  soundfile_2     3   6     yes
8  soundfile_2     6   9      no
9  soundfile_2     9  12     yes
10 soundfile_2    12  15     yes
11 soundfile_3     0   3      no
12 soundfile_3     3   6     yes
13 soundfile_3     6   9      no
14 soundfile_3     9  12      no
15 soundfile_3    12  15     yes

英文:

Let's say I have a large dataframe with a column for 'soundfile' and then 'start and 'end' columns for when a particular bird is vocalising. Each vocalisation can vary significantly in length. An example of the dataframe is sound_df below. Each row in the sound_df represents one vocalisation - each sound file is the same length (300 seconds in the real data, 15 seconds in this example).

## setup example dataframe
id &lt;- c(&quot;soundfile_1&quot;,&quot;soundfile_2&quot;,&quot;soundfile_3&quot;)
sound_df&lt;-data.frame(rep(id, each = 2), c(&quot;0&quot;,&quot;8.0&quot;,&quot;3.3&quot;,&quot;11.7&quot;,&quot;4.6&quot;,&quot;13.1&quot;), c(&quot;3.2&quot;,&quot;14.1&quot;,&quot;3.8&quot;,&quot;12.8&quot;,&quot;5.9&quot;,&quot;14.8&quot;))
names(sound_df)[1] &lt;- &quot;soundfile&quot;
names(sound_df)[2] &lt;- &quot;sound_start&quot; 
names(sound_df)[3] &lt;- &quot;sound_end&quot;
print(sound_df)
    soundfile sound_start sound_end
1 soundfile_1           0       3.2
2 soundfile_1         8.0      14.1
3 soundfile_2         3.3       3.8
4 soundfile_2        11.7      12.8
5 soundfile_3         4.6       5.9
6 soundfile_3        13.1      14.8

The results produced for sound_df should be as follows:

     soundfile start end present
1  soundfile_1     0   3     yes
2  soundfile_1     3   6     yes
3  soundfile_1     6   9     yes
4  soundfile_1     9  12     yes
5  soundfile_1    12  15     yes
6  soundfile_2     0   3      no
7  soundfile_2     3   6     yes
8  soundfile_2     6   9      no
9  soundfile_2     9  12     yes
10 soundfile_2    12  15     yes
11 soundfile_3     0   3      no
12 soundfile_3     3   6     yes
13 soundfile_3     6   9      no
14 soundfile_3     9  12      no
15 soundfile_3    12  15     yes

答案1

得分: 1

以下是您要翻译的内容：

"Sounds like a case for a data.table non-equi join.

Following the advice in this blog post by David Selby, I create some duplicate columns, because I too can never remember which ones are merged when you join:

library(data.table)
setDT(sound_df)
sound_df[, `:=`(
    sound_start_for_join = as.numeric(sound_start),
    sound_end_for_join = as.numeric(sound_end)
)]

Then we can simply create a data.table of the required time periods (again with duplicate columns for the join):

CHUNK_SECONDS &lt;- 3
FILE_SECONDS &lt;- 15
time_windows &lt;- CJ(
    soundfile = sound_df$soundfile,
    start = seq(from = 0, to = FILE_SECONDS - CHUNK_SECONDS, by = CHUNK_SECONDS),
    unique = TRUE
)[, end := start + CHUNK_SECONDS][
    ,
    `:=`(
        start_for_join = start,
        end_for_join = end
    )
]
time_windows
#       soundfile start   end start_for_join end_for_join
#          &lt;char&gt; &lt;num&gt; &lt;num&gt;          &lt;num&gt;        &lt;num&gt;
#  1: soundfile_1     0     3              0            3
#  2: soundfile_1     3     6              3            6
#  3: soundfile_1     6     9              6            9
#  4: soundfile_1     9    12              9           12
#  5: soundfile_1    12    15             12           15
# &lt;etc&gt;

Finally we join in those cases where the vocalization overlaps with the time period, and remove the extra columns:

sound_out &lt;- sound_df[
    time_windows,
    on = .(
        soundfile,
        sound_start_for_join &lt; end_for_join,
        sound_end_for_join &gt; start_for_join
    )
][, .(
    soundfile, start, end,
    present = !is.na(sound_start)
)]

Which produces the desired output:

      soundfile start   end present
         &lt;char&gt; &lt;num&gt; &lt;num&gt;  &lt;lgcl&gt;
 1: soundfile_1     0     3    TRUE
 2: soundfile_1     3     6    TRUE
 3: soundfile_1     6     9    TRUE
 4: soundfile_1     9    12    TRUE
 5: soundfile_1    12    15    TRUE
 6: soundfile_2     0     3   FALSE
 7: soundfile_2     3     6    TRUE
 8: soundfile_2     6     9   FALSE
 9: soundfile_2     9    12    TRUE
10: soundfile_2    12    15    TRUE
11: soundfile_3     0     3   FALSE
12: soundfile_3     3     6    TRUE
13: soundfile_3     6     9   FALSE
14: soundfile_3     9    12   FALSE
15: soundfile_3    12    15    TRUE

Note: I made the present column a logical vector because they're easier to work with. If you really want a character vector of "yes" and "no", you can change present = !is.na(sound_start) to present = fifelse(is.na(sound_start), "yes", "no").

英文:

Sounds like a case for a data.table non-equi join.

Following the advice in this blog post by David Selby, I create some duplicate columns, because I too can never remember which ones are merged when you join:

library(data.table)
setDT(sound_df)
sound_df[, `:=`(
    sound_start_for_join = as.numeric(sound_start),
    sound_end_for_join = as.numeric(sound_end)
)]

Then we can simply create a data.table of the required time periods (again with duplicate columns for the join):

CHUNK_SECONDS &lt;- 3
FILE_SECONDS &lt;- 15
time_windows &lt;- CJ(
    soundfile = sound_df$soundfile,
    start = seq(from = 0, to = FILE_SECONDS - CHUNK_SECONDS, by = CHUNK_SECONDS),
    unique = TRUE
)[, end := start + CHUNK_SECONDS][
    ,
    `:=`(
        start_for_join = start,
        end_for_join = end
    )
]
time_windows
#       soundfile start   end start_for_join end_for_join
#          &lt;char&gt; &lt;num&gt; &lt;num&gt;          &lt;num&gt;        &lt;num&gt;
#  1: soundfile_1     0     3              0            3
#  2: soundfile_1     3     6              3            6
#  3: soundfile_1     6     9              6            9
#  4: soundfile_1     9    12              9           12
#  5: soundfile_1    12    15             12           15
# &lt;etc&gt;

Finally we join in those cases where the vocalization overlaps with the time period, and remove the extra columns:

sound_out &lt;- sound_df[
    time_windows,
    on = .(
        soundfile,
        sound_start_for_join &lt; end_for_join,
        sound_end_for_join &gt; start_for_join
    )
][, .(
    soundfile, start, end,
    present = !is.na(sound_start)
)]

Which produces the desired output:

      soundfile start   end present
         &lt;char&gt; &lt;num&gt; &lt;num&gt;  &lt;lgcl&gt;
 1: soundfile_1     0     3    TRUE
 2: soundfile_1     3     6    TRUE
 3: soundfile_1     6     9    TRUE
 4: soundfile_1     9    12    TRUE
 5: soundfile_1    12    15    TRUE
 6: soundfile_2     0     3   FALSE
 7: soundfile_2     3     6    TRUE
 8: soundfile_2     6     9   FALSE
 9: soundfile_2     9    12    TRUE
10: soundfile_2    12    15    TRUE
11: soundfile_3     0     3   FALSE
12: soundfile_3     3     6    TRUE
13: soundfile_3     6     9   FALSE
14: soundfile_3     9    12   FALSE
15: soundfile_3    12    15    TRUE

Note: I made the present column a logical vector because they're easier to work with. If you really want a character vector of "yes" and "no" you can change present = !is.na(sound_start) to present = fifelse(is.na(sound_start), "yes", "no").

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

数据整理问题，带有标记的声音文件

问题

答案1

如何从两个不同的数据框中对列进行子集化？

如何循环遍历重叠的多边形并从每个多边形中随机抽样点在 R 中。

How can I efficiently create a new column in a pandas DataFrame based on another column's rolling mean over a period of 30 days?

编写一个手动的BFS搜索算法

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。