规范化时间戳数据

huangapple go评论87阅读模式
英文:

Normalize a time stamp data

问题

我有一大批数据,其格式为数字数据类型,以24小时制的HHMM形式表示时间。

由于数据类型是数字,因此前导零不存在。可以在这里找到数据的样本:

> dput(sleepDiary_1[1:100,3:4])

structure(list(`你上床的时间是?(hhmm)(例如,晚上11点=2300,凌晨1点35分=0135)
*请确保是4位数(2位小时,2位分钟)` = c(2330, 
100, 9, 110, 10, 0, 209, 330, 2330, 50, 330, 800, 30, 100, 0, 
2345, 130, 135, 400, 330, 100, 400, 100, 100, 315, 2305, 250, 
215, 2300, 3, 356, 2306, 500, 0, 200, 10, 230, 2230, 100, 2200, 
1230, 1128, 100, 430, 200, 5, 300, 145, 1, 100, 2330, 300, 2314, 
1130, 0, 30, 1230, 15, 2300, 300, 200, 315, 2300, 105, 2310, 
300, 1248, 30, 30, 2315, 2300, 35, 2300, 211, 1330, 115, 45, 
130, 1200, 200, 300, 1220, 200, 230, 100, 300, 300, 145, 1100, 
544, 300, 300, 2238, 0, 100, 133, 30, 5, 205, 300), `你尝试入睡的时间是?(hhmm)(例如,晚上11点=2300,凌晨1点35分=0135)` = c(2330, 
115, 34, 130, 20, 0, 257, 330, 15, 110, 430, 800, 40, 130, 0, 
2345, 200, 150, 445, 330, 105, 400, 100, 100, 315, 2305, 330, 
220, 100, 3, 430, 2306, 500, 5, 200, 0, 400, 2240, 130, 2200, 
1230, 200, 130, 430, 215, 15, 320, 200, 30, 130, 2330, 300, 2314, 
1132, 15, 30, 1230, 40, 2345, 300, 200, 315, 200, 110, 2310, 
300, 1248, 125, 30, 2310, 0, 20, 0, 211, 1345, 45, 0, 155, 100, 
330, 400, 1230, 200, 300, 115, 300, 300, 200, 1152, 530, 330, 
300, 2230, 45, 130, 130, 25, 20, 230, 320)), row.names = c(NA, 
-100L), class = c("tbl_df", "tbl", "data.frame"))

我希望对列进行归一化,以便进行进一步分析。原来我不确定哪种归一化方法效果最好。我尝试查看了针对非正常数据的各种可能选项,但没有一个讨论循环数据的,即在一定周期后重新循环,因此值不会不断累加,而是循环。

此外,这些数据与研究中记录的不同参与者的睡眠时间和醒来时间有关。事实证明,我们希望将数据归一化并删除可能存在的任何异常值。

干杯!

英文:

I have a large set of data which is in the form of of numeric data type which defines time in 24 hour format in HHMM form.

Since the data type is numeric, the preceding zeroes are absent. A sample of the data can be found here:

> dput(sleepDiary_1[1:100,3:4])

structure(list(`What time did you get into bed? (hhmm) (e.g., 11pm = 2300, 1.35am = 0135)
*please make sure its 4 digits (2 for hours, 2 for minutes)` = c(2330, 
100, 9, 110, 10, 0, 209, 330, 2330, 50, 330, 800, 30, 100, 0, 
2345, 130, 135, 400, 330, 100, 400, 100, 100, 315, 2305, 250, 
215, 2300, 3, 356, 2306, 500, 0, 200, 10, 230, 2230, 100, 2200, 
1230, 1128, 100, 430, 200, 5, 300, 145, 1, 100, 2330, 300, 2314, 
1130, 0, 30, 1230, 15, 2300, 300, 200, 315, 2300, 105, 2310, 
300, 1248, 30, 30, 2315, 2300, 35, 2300, 211, 1330, 115, 45, 
130, 1200, 200, 300, 1220, 200, 230, 100, 300, 300, 145, 1100, 
544, 300, 300, 2238, 0, 100, 133, 30, 5, 205, 300), `What time did you try and go to sleep? (hhmm) (e.g., 11pm = 2300, 1.35am = 0135)` = c(2330, 
115, 34, 130, 20, 0, 257, 330, 15, 110, 430, 800, 40, 130, 0, 
2345, 200, 150, 445, 330, 105, 400, 100, 100, 315, 2305, 330, 
220, 100, 3, 430, 2306, 500, 5, 200, 0, 400, 2240, 130, 2200, 
1230, 200, 130, 430, 215, 15, 320, 200, 30, 130, 2330, 300, 2314, 
1132, 15, 30, 1230, 40, 2345, 300, 200, 315, 200, 110, 2310, 
300, 1248, 125, 30, 2310, 0, 20, 0, 211, 1345, 45, 0, 155, 100, 
330, 400, 1230, 200, 300, 115, 300, 300, 200, 1152, 530, 330, 
300, 2230, 45, 130, 130, 25, 20, 230, 320)), row.names = c(NA, 
-100L), class = c("tbl_df", "tbl", "data.frame"))

I wish to normalise the columns so I can perform further analysis. Turns out I'm not sure which normalisation shall work the best. I tried to look at the various possible options for non-normal data, but none of them speaks about the a cycled data which recycles after a certain period, i.e., after 2400 the time changes back to 0000, and thus the values don't keep on adding but are cycled.

To add, the data is regarding the sleep timings and wake up timings from different participants recorded in a study. Turns out we wish normalize the data and remove any outliers which may be present.

Cheers!

答案1

得分: 1

我认为这让你更接近你想要的东西。我首先重命名了列。



library(ggplot2)
names(df) <- c("bed_try", "sleep_try")
ggplot(df, aes(bed_try, sleep_try)) + geom_point()

要将时间从hhmm转换为小时,小数点后为分数小时:



convert_hhmm <- function(hhmm) {
  floor(hhmm / 100) +
    (hhmm - floor(hhmm / 100) * 100) / 60
}

选择一个任意的睡眠时段开始时间 - 2000看起来不错
将所有时间更改为“pivot time”之后的hhmm
由于我们想要“pivot time”之后的小时,我们可以将其从时间中减去
大于该时间的时间,并将2400 - “pivot time”添加到其余时间中



pivot_time <- 2000

将bed_try转换为新列bed_plus



df$bed_plus <- df$bed_try - pivot_time
df$bed_plus[df$bed_plus < 0] <- df$bed_plus[df$bed_plus < 0] + 
                                 pivot_time + # 回到bed_try
                                 (2400 - pivot_time)
df$bed_plus <- convert_hhmm(df$bed_plus)

将sleep_try转换为新列sleep_plus



df$sleep_plus <- df$sleep_try - pivot_time
df$sleep_plus[df$sleep_plus < 0] <- df$sleep_plus[df$sleep_plus < 0] + 
  pivot_time +
  (2400 - pivot_time)
df$sleep_plus <- convert_hhmm(df$sleep_plus)

探索性绘图



ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()

ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()

删除负数 - 或者找出如何纠正它们。



df <- df[-which((df$sleep_plus - df$bed_plus) < 0), ]

结果看起来更接近你想要的了吗?



ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()

ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()

规范化时间戳数据

规范化时间戳数据

英文:

I think this gets you closer to what you want. I started with renaming the columns.

library(ggplot2)
names(df) &lt;- c(&quot;bed_try&quot;, &quot;sleep_try&quot;)
ggplot(df, aes(bed_try, sleep_try)) + geom_point()

To convert from hhmm to hours, where after the decimal, we have
fractional hours:

convert_hhmm &lt;- function(hhmm) {
  floor(hhmm / 100) +
    (hhmm - floor(hhmm / 100) * 100) / 60
}

Pick an arbitrary start to the sleep period - 2000 looks good
Change all times to hhmm after "pivot time"
Since we want hours after pivot time, we can subtract it
from times > that and add 2400 - pivot time to the rest

pivot_time &lt;- 2000

Convert bed_try to new column, bed_plus

df$bed_plus &lt;- df$bed_try - pivot_time
df$bed_plus[df$bed_plus &lt; 0] &lt;- df$bed_plus[df$bed_plus &lt; 0] + 
                                 pivot_time + # back to bed_try
                                 (2400 - pivot_time)
df$bed_plus &lt;- convert_hhmm(df$bed_plus)

Convert sleep_try to new column, sleep_plus

df$sleep_plus &lt;- df$sleep_try - pivot_time
df$sleep_plus[df$sleep_plus &lt; 0] &lt;- df$sleep_plus[df$sleep_plus &lt; 0] + 
  pivot_time +
  (2400 - pivot_time)
df$sleep_plus &lt;- convert_hhmm(df$sleep_plus)

Exploratory plot

ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()

ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()

Remove negatives - alternatively, figure out how to correct them.

df &lt;- df[-which((df$sleep_plus - df$bed_plus) &lt; 0), ]

Results look more like what you want?

ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()

ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()

规范化时间戳数据

规范化时间戳数据

huangapple
  • 本文由 发表于 2023年4月4日 15:31:16
  • 转载请务必保留本文链接:https://go.coder-hub.com/75926627.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定