英文:
Normalize a time stamp data
问题
我有一大批数据,其格式为数字数据类型,以24小时制的HHMM形式表示时间。
由于数据类型是数字,因此前导零不存在。可以在这里找到数据的样本:
> dput(sleepDiary_1[1:100,3:4])
structure(list(`你上床的时间是?(hhmm)(例如,晚上11点=2300,凌晨1点35分=0135)
*请确保是4位数(2位小时,2位分钟)` = c(2330,
100, 9, 110, 10, 0, 209, 330, 2330, 50, 330, 800, 30, 100, 0,
2345, 130, 135, 400, 330, 100, 400, 100, 100, 315, 2305, 250,
215, 2300, 3, 356, 2306, 500, 0, 200, 10, 230, 2230, 100, 2200,
1230, 1128, 100, 430, 200, 5, 300, 145, 1, 100, 2330, 300, 2314,
1130, 0, 30, 1230, 15, 2300, 300, 200, 315, 2300, 105, 2310,
300, 1248, 30, 30, 2315, 2300, 35, 2300, 211, 1330, 115, 45,
130, 1200, 200, 300, 1220, 200, 230, 100, 300, 300, 145, 1100,
544, 300, 300, 2238, 0, 100, 133, 30, 5, 205, 300), `你尝试入睡的时间是?(hhmm)(例如,晚上11点=2300,凌晨1点35分=0135)` = c(2330,
115, 34, 130, 20, 0, 257, 330, 15, 110, 430, 800, 40, 130, 0,
2345, 200, 150, 445, 330, 105, 400, 100, 100, 315, 2305, 330,
220, 100, 3, 430, 2306, 500, 5, 200, 0, 400, 2240, 130, 2200,
1230, 200, 130, 430, 215, 15, 320, 200, 30, 130, 2330, 300, 2314,
1132, 15, 30, 1230, 40, 2345, 300, 200, 315, 200, 110, 2310,
300, 1248, 125, 30, 2310, 0, 20, 0, 211, 1345, 45, 0, 155, 100,
330, 400, 1230, 200, 300, 115, 300, 300, 200, 1152, 530, 330,
300, 2230, 45, 130, 130, 25, 20, 230, 320)), row.names = c(NA,
-100L), class = c("tbl_df", "tbl", "data.frame"))
我希望对列进行归一化,以便进行进一步分析。原来我不确定哪种归一化方法效果最好。我尝试查看了针对非正常数据的各种可能选项,但没有一个讨论循环数据的,即在一定周期后重新循环,因此值不会不断累加,而是循环。
此外,这些数据与研究中记录的不同参与者的睡眠时间和醒来时间有关。事实证明,我们希望将数据归一化并删除可能存在的任何异常值。
干杯!
英文:
I have a large set of data which is in the form of of numeric data type which defines time in 24 hour format in HHMM form.
Since the data type is numeric, the preceding zeroes are absent. A sample of the data can be found here:
> dput(sleepDiary_1[1:100,3:4])
structure(list(`What time did you get into bed? (hhmm) (e.g., 11pm = 2300, 1.35am = 0135)
*please make sure its 4 digits (2 for hours, 2 for minutes)` = c(2330,
100, 9, 110, 10, 0, 209, 330, 2330, 50, 330, 800, 30, 100, 0,
2345, 130, 135, 400, 330, 100, 400, 100, 100, 315, 2305, 250,
215, 2300, 3, 356, 2306, 500, 0, 200, 10, 230, 2230, 100, 2200,
1230, 1128, 100, 430, 200, 5, 300, 145, 1, 100, 2330, 300, 2314,
1130, 0, 30, 1230, 15, 2300, 300, 200, 315, 2300, 105, 2310,
300, 1248, 30, 30, 2315, 2300, 35, 2300, 211, 1330, 115, 45,
130, 1200, 200, 300, 1220, 200, 230, 100, 300, 300, 145, 1100,
544, 300, 300, 2238, 0, 100, 133, 30, 5, 205, 300), `What time did you try and go to sleep? (hhmm) (e.g., 11pm = 2300, 1.35am = 0135)` = c(2330,
115, 34, 130, 20, 0, 257, 330, 15, 110, 430, 800, 40, 130, 0,
2345, 200, 150, 445, 330, 105, 400, 100, 100, 315, 2305, 330,
220, 100, 3, 430, 2306, 500, 5, 200, 0, 400, 2240, 130, 2200,
1230, 200, 130, 430, 215, 15, 320, 200, 30, 130, 2330, 300, 2314,
1132, 15, 30, 1230, 40, 2345, 300, 200, 315, 200, 110, 2310,
300, 1248, 125, 30, 2310, 0, 20, 0, 211, 1345, 45, 0, 155, 100,
330, 400, 1230, 200, 300, 115, 300, 300, 200, 1152, 530, 330,
300, 2230, 45, 130, 130, 25, 20, 230, 320)), row.names = c(NA,
-100L), class = c("tbl_df", "tbl", "data.frame"))
I wish to normalise the columns so I can perform further analysis. Turns out I'm not sure which normalisation shall work the best. I tried to look at the various possible options for non-normal data, but none of them speaks about the a cycled data which recycles after a certain period, i.e., after 2400 the time changes back to 0000, and thus the values don't keep on adding but are cycled.
To add, the data is regarding the sleep timings and wake up timings from different participants recorded in a study. Turns out we wish normalize the data and remove any outliers which may be present.
Cheers!
答案1
得分: 1
我认为这让你更接近你想要的东西。我首先重命名了列。
library(ggplot2)
names(df) <- c("bed_try", "sleep_try")
ggplot(df, aes(bed_try, sleep_try)) + geom_point()
要将时间从hhmm转换为小时,小数点后为分数小时:
convert_hhmm <- function(hhmm) {
floor(hhmm / 100) +
(hhmm - floor(hhmm / 100) * 100) / 60
}
选择一个任意的睡眠时段开始时间 - 2000看起来不错
将所有时间更改为“pivot time”之后的hhmm
由于我们想要“pivot time”之后的小时,我们可以将其从时间中减去
大于该时间的时间,并将2400 - “pivot time”添加到其余时间中
pivot_time <- 2000
将bed_try转换为新列bed_plus
df$bed_plus <- df$bed_try - pivot_time
df$bed_plus[df$bed_plus < 0] <- df$bed_plus[df$bed_plus < 0] +
pivot_time + # 回到bed_try
(2400 - pivot_time)
df$bed_plus <- convert_hhmm(df$bed_plus)
将sleep_try转换为新列sleep_plus
df$sleep_plus <- df$sleep_try - pivot_time
df$sleep_plus[df$sleep_plus < 0] <- df$sleep_plus[df$sleep_plus < 0] +
pivot_time +
(2400 - pivot_time)
df$sleep_plus <- convert_hhmm(df$sleep_plus)
探索性绘图
ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()
ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()
删除负数 - 或者找出如何纠正它们。
df <- df[-which((df$sleep_plus - df$bed_plus) < 0), ]
结果看起来更接近你想要的了吗?
ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()
ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()
英文:
I think this gets you closer to what you want. I started with renaming the columns.
library(ggplot2)
names(df) <- c("bed_try", "sleep_try")
ggplot(df, aes(bed_try, sleep_try)) + geom_point()
To convert from hhmm to hours, where after the decimal, we have
fractional hours:
convert_hhmm <- function(hhmm) {
floor(hhmm / 100) +
(hhmm - floor(hhmm / 100) * 100) / 60
}
Pick an arbitrary start to the sleep period - 2000 looks good
Change all times to hhmm after "pivot time"
Since we want hours after pivot time, we can subtract it
from times > that and add 2400 - pivot time to the rest
pivot_time <- 2000
Convert bed_try to new column, bed_plus
df$bed_plus <- df$bed_try - pivot_time
df$bed_plus[df$bed_plus < 0] <- df$bed_plus[df$bed_plus < 0] +
pivot_time + # back to bed_try
(2400 - pivot_time)
df$bed_plus <- convert_hhmm(df$bed_plus)
Convert sleep_try to new column, sleep_plus
df$sleep_plus <- df$sleep_try - pivot_time
df$sleep_plus[df$sleep_plus < 0] <- df$sleep_plus[df$sleep_plus < 0] +
pivot_time +
(2400 - pivot_time)
df$sleep_plus <- convert_hhmm(df$sleep_plus)
Exploratory plot
ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()
ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()
Remove negatives - alternatively, figure out how to correct them.
df <- df[-which((df$sleep_plus - df$bed_plus) < 0), ]
Results look more like what you want?
ggplot(df, aes(bed_plus, sleep_plus)) + geom_jitter()
ggplot(df, aes(sleep_plus - bed_plus)) + geom_histogram()
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论