2023年4月4日 15:31:16go评论115阅读模式

英文:

Normalize a time stamp data

问题

我有一大批数据，其格式为数字数据类型，以24小时制的HHMM形式表示时间。

由于数据类型是数字，因此前导零不存在。可以在这里找到数据的样本：

&gt; dput(sleepDiary_1[1:100,3:4])
structure(list(`你上床的时间是？(hhmm)（例如，晚上11点=2300，凌晨1点35分=0135）
*请确保是4位数（2位小时，2位分钟）` = c(2330, 
100, 9, 110, 10, 0, 209, 330, 2330, 50, 330, 800, 30, 100, 0, 
2345, 130, 135, 400, 330, 100, 400, 100, 100, 315, 2305, 250, 
215, 2300, 3, 356, 2306, 500, 0, 200, 10, 230, 2230, 100, 2200, 
1230, 1128, 100, 430, 200, 5, 300, 145, 1, 100, 2330, 300, 2314, 
1130, 0, 30, 1230, 15, 2300, 300, 200, 315, 2300, 105, 2310, 
300, 1248, 30, 30, 2315, 2300, 35, 2300, 211, 1330, 115, 45, 
130, 1200, 200, 300, 1220, 200, 230, 100, 300, 300, 145, 1100, 
544, 300, 300, 2238, 0, 100, 133, 30, 5, 205, 300), `你尝试入睡的时间是？(hhmm)（例如，晚上11点=2300，凌晨1点35分=0135）` = c(2330, 
115, 34, 130, 20, 0, 257, 330, 15, 110, 430, 800, 40, 130, 0, 
2345, 200, 150, 445, 330, 105, 400, 100, 100, 315, 2305, 330, 
220, 100, 3, 430, 2306, 500, 5, 200, 0, 400, 2240, 130, 2200, 
1230, 200, 130, 430, 215, 15, 320, 200, 30, 130, 2330, 300, 2314, 
1132, 15, 30, 1230, 40, 2345, 300, 200, 315, 200, 110, 2310, 
300, 1248, 125, 30, 2310, 0, 20, 0, 211, 1345, 45, 0, 155, 100, 
330, 400, 1230, 200, 300, 115, 300, 300, 200, 1152, 530, 330, 
300, 2230, 45, 130, 130, 25, 20, 230, 320)), row.names = c(NA, 
-100L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))

我希望对列进行归一化，以便进行进一步分析。原来我不确定哪种归一化方法效果最好。我尝试查看了针对非正常数据的各种可能选项，但没有一个讨论循环数据的，即在一定周期后重新循环，因此值不会不断累加，而是循环。

此外，这些数据与研究中记录的不同参与者的睡眠时间和醒来时间有关。事实证明，我们希望将数据归一化并删除可能存在的任何异常值。

干杯！

英文:

I have a large set of data which is in the form of of numeric data type which defines time in 24 hour format in HHMM form.

Since the data type is numeric, the preceding zeroes are absent. A sample of the data can be found here:

&gt; dput(sleepDiary_1[1:100,3:4])
structure(list(`What time did you get into bed? (hhmm) (e.g., 11pm = 2300, 1.35am = 0135)
*please make sure its 4 digits (2 for hours, 2 for minutes)` = c(2330, 
100, 9, 110, 10, 0, 209, 330, 2330, 50, 330, 800, 30, 100, 0, 
2345, 130, 135, 400, 330, 100, 400, 100, 100, 315, 2305, 250, 
215, 2300, 3, 356, 2306, 500, 0, 200, 10, 230, 2230, 100, 2200, 
1230, 1128, 100, 430, 200, 5, 300, 145, 1, 100, 2330, 300, 2314, 
1130, 0, 30, 1230, 15, 2300, 300, 200, 315, 2300, 105, 2310, 
300, 1248, 30, 30, 2315, 2300, 35, 2300, 211, 1330, 115, 45, 
130, 1200, 200, 300, 1220, 200, 230, 100, 300, 300, 145, 1100, 
544, 300, 300, 2238, 0, 100, 133, 30, 5, 205, 300), `What time did you try and go to sleep? (hhmm) (e.g., 11pm = 2300, 1.35am = 0135)` = c(2330, 
115, 34, 130, 20, 0, 257, 330, 15, 110, 430, 800, 40, 130, 0, 
2345, 200, 150, 445, 330, 105, 400, 100, 100, 315, 2305, 330, 
220, 100, 3, 430, 2306, 500, 5, 200, 0, 400, 2240, 130, 2200, 
1230, 200, 130, 430, 215, 15, 320, 200, 30, 130, 2330, 300, 2314, 
1132, 15, 30, 1230, 40, 2345, 300, 200, 315, 200, 110, 2310, 
300, 1248, 125, 30, 2310, 0, 20, 0, 211, 1345, 45, 0, 155, 100, 
330, 400, 1230, 200, 300, 115, 300, 300, 200, 1152, 530, 330, 
300, 2230, 45, 130, 130, 25, 20, 230, 320)), row.names = c(NA, 
-100L), class = c(&quot;tbl_df&quot;, &quot;tbl&quot;, &quot;data.frame&quot;))

I wish to normalise the columns so I can perform further analysis. Turns out I'm not sure which normalisation shall work the best. I tried to look at the various possible options for non-normal data, but none of them speaks about the a cycled data which recycles after a certain period, i.e., after 2400 the time changes back to 0000, and thus the values don't keep on adding but are cycled.

To add, the data is regarding the sleep timings and wake up timings from different participants recorded in a study. Turns out we wish normalize the data and remove any outliers which may be present.

Cheers!

答案1

得分: 1

我认为这让你更接近你想要的东西。我首先重命名了列。


library(ggplot2)
names(df) <- c("bed_try", "sleep_try")
ggplot(df, aes(bed_try, sleep_try)) + geom_point()

要将时间从hhmm转换为小时，小数点后为分数小时：


convert_hhmm <- function(hhmm) {
  floor(hhmm / 100) +
    (hhmm - floor(hhmm / 100) * 100) / 60
}

选择一个任意的睡眠时段开始时间 - 2000看起来不错
将所有时间更改为“pivot time”之后的hhmm
由于我们想要“pivot time”之后的小时，我们可以将其从时间中减去
大于该时间的时间，并将2400 - “pivot time”添加到其余时间中


pivot_time <- 2000

将bed_try转换为新列bed_plus


df$bed_plus <- df$bed_try - pivot_time
df$bed_plus[df$bed_plus < 0] <- df$bed_plus[df$bed_plus < 0] + 
                                 pivot_time + # 回到bed_try
                                 (2400 - pivot_time)
df$bed_plus <- convert_hhmm(df$bed_plus)

将sleep_try转换为新列sleep_plus


df$sleep_plus <- df$sleep_try - pivot_time
df$sleep_plus[df$sleep_plus < 0] <- df$sleep_plus[df$sleep_plus < 0] + 
  pivot_time +
  (2400 - pivot_time)
df$sleep_plus <- convert_hhmm(df$sleep_plus)

探索性绘图


ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()
ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()

删除负数 - 或者找出如何纠正它们。


df <- df[-which((df$sleep_plus - df$bed_plus) < 0), ]

结果看起来更接近你想要的了吗？


ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()
ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()

英文:

I think this gets you closer to what you want. I started with renaming the columns.

library(ggplot2)
names(df) &lt;- c(&quot;bed_try&quot;, &quot;sleep_try&quot;)
ggplot(df, aes(bed_try, sleep_try)) + geom_point()

To convert from hhmm to hours, where after the decimal, we have
fractional hours:

convert_hhmm &lt;- function(hhmm) {
  floor(hhmm / 100) +
    (hhmm - floor(hhmm / 100) * 100) / 60
}

Pick an arbitrary start to the sleep period - 2000 looks good
Change all times to hhmm after "pivot time"
Since we want hours after pivot time, we can subtract it
from times > that and add 2400 - pivot time to the rest

pivot_time &lt;- 2000

Convert bed_try to new column, bed_plus

df$bed_plus &lt;- df$bed_try - pivot_time
df$bed_plus[df$bed_plus &lt; 0] &lt;- df$bed_plus[df$bed_plus &lt; 0] + 
                                 pivot_time + # back to bed_try
                                 (2400 - pivot_time)
df$bed_plus &lt;- convert_hhmm(df$bed_plus)

Convert sleep_try to new column, sleep_plus

df$sleep_plus &lt;- df$sleep_try - pivot_time
df$sleep_plus[df$sleep_plus &lt; 0] &lt;- df$sleep_plus[df$sleep_plus &lt; 0] + 
  pivot_time +
  (2400 - pivot_time)
df$sleep_plus &lt;- convert_hhmm(df$sleep_plus)

Exploratory plot

ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()
ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()

Remove negatives - alternatively, figure out how to correct them.

df &lt;- df[-which((df$sleep_plus - df$bed_plus) &lt; 0), ]

Results look more like what you want?

ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()
ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

规范化时间戳数据

问题

答案1

R: 根据多个模式对多列进行数据透视

使用前一行数值的滚动函数 [R]

创建新变量，基于组中其他变量的结果 – R

用Selenium网页爬取页面上的所有项目。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。