数据整理问题,带有标记的声音文件

huangapple go评论98阅读模式
英文:

Data wrangling problem with labelled sound files

问题

以下是您要翻译的内容:

"Let's say I have a large dataframe with a column for 'soundfile' and then 'start' and 'end' columns for when a particular bird is vocalising. Each vocalisation can vary significantly in length. An example of the dataframe is sound_df below. Each row in the sound_df represents one vocalisation - each sound file is the same length (300 seconds in the real data, 15 seconds in this example).

  1. ## setup example dataframe
  2. id <- c("soundfile_1","soundfile_2","soundfile_3")
  3. sound_df<-data.frame(rep(id, each = 2), c("0","8.0","3.3","11.7","4.6","13.1"), c("3.2","14.1","3.8","12.8","5.9","14.8"))
  4. names(sound_df)[1] <- "soundfile"
  5. names(sound_df)[2] <- "sound_start"
  6. names(sound_df)[3] <- "sound_end"
  7. print(sound_df)
  8. soundfile sound_start sound_end
  9. 1 soundfile_1 0 3.2
  10. 2 soundfile_1 8.0 14.1
  11. 3 soundfile_2 3.3 3.8
  12. 4 soundfile_2 11.7 12.8
  13. 5 soundfile_3 4.6 5.9
  14. 6 soundfile_3 13.1 14.8

I then want to create a new dataframe for which each soundfile is divided into 3 second chunks as below, and the presence or absence of vocalisation in this period is extracted from sound_df and recorded in the column 'present'.

The results produced for sound_df should be as follows:

  1. soundfile start end present
  2. 1 soundfile_1 0 3 yes
  3. 2 soundfile_1 3 6 yes
  4. 3 soundfile_1 6 9 yes
  5. 4 soundfile_1 9 12 yes
  6. 5 soundfile_1 12 15 yes
  7. 6 soundfile_2 0 3 no
  8. 7 soundfile_2 3 6 yes
  9. 8 soundfile_2 6 9 no
  10. 9 soundfile_2 9 12 yes
  11. 10 soundfile_2 12 15 yes
  12. 11 soundfile_3 0 3 no
  13. 12 soundfile_3 3 6 yes
  14. 13 soundfile_3 6 9 no
  15. 14 soundfile_3 9 12 no
  16. 15 soundfile_3 12 15 yes

"

英文:

Let's say I have a large dataframe with a column for 'soundfile' and then 'start and 'end' columns for when a particular bird is vocalising. Each vocalisation can vary significantly in length. An example of the dataframe is sound_df below. Each row in the sound_df represents one vocalisation - each sound file is the same length (300 seconds in the real data, 15 seconds in this example).

  1. ## setup example dataframe
  2. id &lt;- c(&quot;soundfile_1&quot;,&quot;soundfile_2&quot;,&quot;soundfile_3&quot;)
  3. sound_df&lt;-data.frame(rep(id, each = 2), c(&quot;0&quot;,&quot;8.0&quot;,&quot;3.3&quot;,&quot;11.7&quot;,&quot;4.6&quot;,&quot;13.1&quot;), c(&quot;3.2&quot;,&quot;14.1&quot;,&quot;3.8&quot;,&quot;12.8&quot;,&quot;5.9&quot;,&quot;14.8&quot;))
  4. names(sound_df)[1] &lt;- &quot;soundfile&quot;
  5. names(sound_df)[2] &lt;- &quot;sound_start&quot;
  6. names(sound_df)[3] &lt;- &quot;sound_end&quot;
  7. print(sound_df)
  8. soundfile sound_start sound_end
  9. 1 soundfile_1 0 3.2
  10. 2 soundfile_1 8.0 14.1
  11. 3 soundfile_2 3.3 3.8
  12. 4 soundfile_2 11.7 12.8
  13. 5 soundfile_3 4.6 5.9
  14. 6 soundfile_3 13.1 14.8

I then want to create a new dataframe for which each soundfile is divided into 3 second chunks as below, and the presence or absence of vocalisation in this period is extracted from sound_df and recorded in the column 'present'.

The results produced for sound_df should be as follows:

  1. soundfile start end present
  2. 1 soundfile_1 0 3 yes
  3. 2 soundfile_1 3 6 yes
  4. 3 soundfile_1 6 9 yes
  5. 4 soundfile_1 9 12 yes
  6. 5 soundfile_1 12 15 yes
  7. 6 soundfile_2 0 3 no
  8. 7 soundfile_2 3 6 yes
  9. 8 soundfile_2 6 9 no
  10. 9 soundfile_2 9 12 yes
  11. 10 soundfile_2 12 15 yes
  12. 11 soundfile_3 0 3 no
  13. 12 soundfile_3 3 6 yes
  14. 13 soundfile_3 6 9 no
  15. 14 soundfile_3 9 12 no
  16. 15 soundfile_3 12 15 yes

答案1

得分: 1

以下是您要翻译的内容:

"Sounds like a case for a data.table non-equi join.

Following the advice in this blog post by David Selby, I create some duplicate columns, because I too can never remember which ones are merged when you join:

  1. library(data.table)
  2. setDT(sound_df)
  3. sound_df[, `:=`(
  4. sound_start_for_join = as.numeric(sound_start),
  5. sound_end_for_join = as.numeric(sound_end)
  6. )]

Then we can simply create a data.table of the required time periods (again with duplicate columns for the join):

  1. CHUNK_SECONDS &lt;- 3
  2. FILE_SECONDS &lt;- 15
  3. time_windows &lt;- CJ(
  4. soundfile = sound_df$soundfile,
  5. start = seq(from = 0, to = FILE_SECONDS - CHUNK_SECONDS, by = CHUNK_SECONDS),
  6. unique = TRUE
  7. )[, end := start + CHUNK_SECONDS][
  8. ,
  9. `:=`(
  10. start_for_join = start,
  11. end_for_join = end
  12. )
  13. ]
  14. time_windows
  15. # soundfile start end start_for_join end_for_join
  16. # &lt;char&gt; &lt;num&gt; &lt;num&gt; &lt;num&gt; &lt;num&gt;
  17. # 1: soundfile_1 0 3 0 3
  18. # 2: soundfile_1 3 6 3 6
  19. # 3: soundfile_1 6 9 6 9
  20. # 4: soundfile_1 9 12 9 12
  21. # 5: soundfile_1 12 15 12 15
  22. # &lt;etc&gt;

Finally we join in those cases where the vocalization overlaps with the time period, and remove the extra columns:

  1. sound_out &lt;- sound_df[
  2. time_windows,
  3. on = .(
  4. soundfile,
  5. sound_start_for_join &lt; end_for_join,
  6. sound_end_for_join &gt; start_for_join
  7. )
  8. ][, .(
  9. soundfile, start, end,
  10. present = !is.na(sound_start)
  11. )]

Which produces the desired output:

  1. soundfile start end present
  2. &lt;char&gt; &lt;num&gt; &lt;num&gt; &lt;lgcl&gt;
  3. 1: soundfile_1 0 3 TRUE
  4. 2: soundfile_1 3 6 TRUE
  5. 3: soundfile_1 6 9 TRUE
  6. 4: soundfile_1 9 12 TRUE
  7. 5: soundfile_1 12 15 TRUE
  8. 6: soundfile_2 0 3 FALSE
  9. 7: soundfile_2 3 6 TRUE
  10. 8: soundfile_2 6 9 FALSE
  11. 9: soundfile_2 9 12 TRUE
  12. 10: soundfile_2 12 15 TRUE
  13. 11: soundfile_3 0 3 FALSE
  14. 12: soundfile_3 3 6 TRUE
  15. 13: soundfile_3 6 9 FALSE
  16. 14: soundfile_3 9 12 FALSE
  17. 15: soundfile_3 12 15 TRUE

Note: I made the present column a logical vector because they're easier to work with. If you really want a character vector of "yes" and "no", you can change present = !is.na(sound_start) to present = fifelse(is.na(sound_start), "yes", "no").

英文:

Sounds like a case for a data.table non-equi join.

Following the advice in this blog post by David Selby, I create some duplicate columns, because I too can never remember which ones are merged when you join:

  1. library(data.table)
  2. setDT(sound_df)
  3. sound_df[, `:=`(
  4. sound_start_for_join = as.numeric(sound_start),
  5. sound_end_for_join = as.numeric(sound_end)
  6. )]

Then we can simply create a data.table of the required time periods (again with duplicate columns for the join):

  1. CHUNK_SECONDS &lt;- 3
  2. FILE_SECONDS &lt;- 15
  3. time_windows &lt;- CJ(
  4. soundfile = sound_df$soundfile,
  5. start = seq(from = 0, to = FILE_SECONDS - CHUNK_SECONDS, by = CHUNK_SECONDS),
  6. unique = TRUE
  7. )[, end := start + CHUNK_SECONDS][
  8. ,
  9. `:=`(
  10. start_for_join = start,
  11. end_for_join = end
  12. )
  13. ]
  14. time_windows
  15. # soundfile start end start_for_join end_for_join
  16. # &lt;char&gt; &lt;num&gt; &lt;num&gt; &lt;num&gt; &lt;num&gt;
  17. # 1: soundfile_1 0 3 0 3
  18. # 2: soundfile_1 3 6 3 6
  19. # 3: soundfile_1 6 9 6 9
  20. # 4: soundfile_1 9 12 9 12
  21. # 5: soundfile_1 12 15 12 15
  22. # &lt;etc&gt;

Finally we join in those cases where the vocalization overlaps with the time period, and remove the extra columns:

  1. sound_out &lt;- sound_df[
  2. time_windows,
  3. on = .(
  4. soundfile,
  5. sound_start_for_join &lt; end_for_join,
  6. sound_end_for_join &gt; start_for_join
  7. )
  8. ][, .(
  9. soundfile, start, end,
  10. present = !is.na(sound_start)
  11. )]

Which produces the desired output:

  1. soundfile start end present
  2. &lt;char&gt; &lt;num&gt; &lt;num&gt; &lt;lgcl&gt;
  3. 1: soundfile_1 0 3 TRUE
  4. 2: soundfile_1 3 6 TRUE
  5. 3: soundfile_1 6 9 TRUE
  6. 4: soundfile_1 9 12 TRUE
  7. 5: soundfile_1 12 15 TRUE
  8. 6: soundfile_2 0 3 FALSE
  9. 7: soundfile_2 3 6 TRUE
  10. 8: soundfile_2 6 9 FALSE
  11. 9: soundfile_2 9 12 TRUE
  12. 10: soundfile_2 12 15 TRUE
  13. 11: soundfile_3 0 3 FALSE
  14. 12: soundfile_3 3 6 TRUE
  15. 13: soundfile_3 6 9 FALSE
  16. 14: soundfile_3 9 12 FALSE
  17. 15: soundfile_3 12 15 TRUE

Note: I made the present column a logical vector because they're easier to work with. If you really want a character vector of &quot;yes&quot; and &quot;no&quot; you can change present = !is.na(sound_start) to present = fifelse(is.na(sound_start), &quot;yes&quot;, &quot;no&quot;).

huangapple
  • 本文由 发表于 2023年5月29日 17:35:45
  • 转载请务必保留本文链接:https://go.coder-hub.com/76356207.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定