英文:
How does data.table define "nearest" in a rolling join?
问题
"nearest" 意味着在 data.table 中的滚动连接操作中,如果 i 是一个 data.table,且它的行与除了最后一个 x 连接列之外的所有连接列匹配,并且它在最后一个 i 连接列中的值处于一个间隙中(包括在该组中 x 的最后观察之后),那么 "nearest" 会滚动到最近的值。
rollends 参数表示:
- 如果 rollends[1]=TRUE,则会向后滚动第一个值。
- 如果 rollends[2]=TRUE,则会向前滚动最后一个值。
然而,尽管 "nearest" 的默认 rollends 值为 c(TRUE, TRUE),当要连接的行在第一个观察之前时,输出不符合预期。
希望连接的示例数据如下:
d
是一个 data.table,包含两行,每行都包含一个 id
和一个 IDate
。 lookup
是一个 data.table,对于每个 id
记录了特定日期的 value
。 d
中的 id
1 的日期在 lookup
中的该 id
的最后记录日期之后。 d
中的 id
2 的日期在 lookup
中的该 id
的第一个记录日期之前。
连接操作如下:
希望将 d
中的每个 id
与 lookup
中最近记录的日期上的 value
进行连接:
d[lookup,
on = c("id", "date"),
roll = "nearest",
value := i.value,
rollends = c(TRUE, TRUE)
]
预期输出:
id
1 具有日期 "2020-01-10",这在所有 lookup
中 id
1 的日期之后,因此预期它与最后一个日期上的 value
匹配,即 3
。类似地,id
2 具有日期 "2020-01-01",这在所有 lookup
中 id
2 的日期之前,因此预期它与第一个日期上的 value
匹配。
实际输出:
id
1 的匹配结果如预期。然而,id
2 的匹配结果是 value
为 6,它在 "2020-01-06" 记录的,即组2的最后一个日期上,而不是第一个日期上。根据 "nearest" 的理解,对于日期 c("2020-01-04", "2020-01-05", "2020-01-06")
,距离 "2020-01-01" 最近的是 "2020-01-04"。
这是无论 rollends 是否为 c(TRUE, TRUE) 或其他三个可能选项的输出。
希望的输出在一个方向上连接时表现正常,但在另一个方向上连接时表现不正常。"nearest" 的定义是什么?是否有一种方法可以通过一次连接获得预期的输出?
英文:
The data.table
documentation states roll = "nearest"
means:
> When i is a data.table and its row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then... "nearest" rolls the nearest value
The rollends
parameter states:
> If rollends[1]=TRUE, it will roll the first value backward. If rollends[2]=TRUE, it will roll the last value forward.
However, although rollends = c(TRUE, TRUE)
is the default for roll = "nearest"
, when the row to join is before the first observation, the output is not as I expect.
Minimal reproducible example
Sample data
d
is a data.table
, with two rows, which each contain an id
and an IDate
. lookup
is a data.table
with a value
recorded for each id
on a specific date. The date in d
for id
1 is after the last recorded date for that id
in lookup
. The date in d
for id
2 is before the first recorded date for that id
in lookup
.
library(data.table)
lookup <- data.table(
id = c(1, 1, 1, 2, 2, 2),
date = seq(from = as.IDate("2020-01-01"), by = 1, length.out = 6),
value = c(1:6)
)
d <- data.table(
id = c(1, 2),
date = as.IDate(c("2020-01-10", "2020-01-01"))
)
d
# id date
# <num> <IDat>
# 1: 1 2020-01-10
# 2: 2 2020-01-01
lookup
# id date value
# <num> <IDat> <int>
# 1: 1 2020-01-01 1
# 2: 1 2020-01-02 2
# 3: 1 2020-01-03 3
# 4: 2 2020-01-04 4
# 5: 2 2020-01-05 5
# 6: 2 2020-01-06 6
Join operation
I wish to join each id
in d
to the value
on the nearest recorded date in lookup
:
d[lookup,
on = c("id", "date"),
roll = "nearest",
value := i.value,
rollends = c(TRUE, TRUE)
]
Expected output
id
1 has the date "2020-01-10"
, which is after all the dates for id
1 in lookup, so I expect it to the value
on the final date, i.e. 3
. Similarly, id
2 has the date "2020-01-01"
, which is before all the dates for id
2 in lookup, so I expect it to match to the value
on the first date.
id date value
<num> <IDat> <int>
1: 1 2020-01-10 3
2: 2 2020-01-01 4
Actual output
id
1 matches as expected. However, id
2 matches to the value
6, which is recorded on "2020-01-06"
, i.e. the final date for group 2, rather than the first date. My understanding of "nearest" is that of the dates c("2020-01-04", "2020-01-05", "2020-01-06")
, the nearest to to "2020-01-01"
is "2020-01-04"
.
id date value
<num> <IDat> <int>
1: 1 2020-01-10 3
2: 2 2020-01-01 6
This is the output regardless of whether rollends = c(TRUE, TRUE)
or any of the other three possible options.
Session info
R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
packageVersion("data.table") #'1.14.7'
Related questions
This is related but does not have an answer. This has an answer which suggests doing the join the other way around, i.e.:
lookup[d,
on = c("id", "date"),
roll = "nearest",
nearest_date := i.date
]
d[lookup,
on = c("id", "date" = "nearest_date"),
value := i.value
]
This works, but it is two joins, which is more expensive, and more joins make code harder to maintain. In any case the fact that it works confirms I do not understand what "nearest" means.
Why does "nearest" act as expected if joining in one direction, but not the other? What is the definition of "nearest"? Is there a way to get the expected output with only one join?
答案1
得分: 2
"nearest"正常运作,但内部表应为"d"。
这里发生的情况是,:=
应用于所有 lookup
值,以便在 d
中获取每个组的最后一个查找值:3
和 6
d[lookup,
on = c("id", "date"),
roll = "nearest",
.(i.value,i.date,x.date)
]
i.value i.date x.date
<int> <IDat> <IDat>
1: 1 2020-01-01 2020-01-10
2: 2 2020-01-02 2020-01-10
3: 3 2020-01-03 2020-01-10 # 对于id=1,最后一个值覆盖了之前的值
4: 4 2020-01-04 2020-01-01
5: 5 2020-01-05 2020-01-01
6: 6 2020-01-06 2020-01-01 # 对于id=2,最后一个值覆盖了之前的值
尝试:
lookup[d,
on = c("id", "date"),
roll = "nearest",
.(id,i.date,x.value)
][]
id i.date x.value
<num> <IDat> <int>
1: 1 2020-01-10 3
2: 2 2020-01-01 4
英文:
nearest
works as expected, but the inner table should be d
.
What happens here is that :=
is applied to all lookup
values so that you get in d
the last lookup value for each group : 3
and 6
d[lookup,
on = c("id", "date"),
roll = "nearest",
.(i.value,i.date,x.date)
]
i.value i.date x.date
<int> <IDat> <IDat>
1: 1 2020-01-01 2020-01-10
2: 2 2020-01-02 2020-01-10
3: 3 2020-01-03 2020-01-10 # Last value overwrites previous ones for id=1
4: 4 2020-01-04 2020-01-01
5: 5 2020-01-05 2020-01-01
6: 6 2020-01-06 2020-01-01 # Last value overwrites previous ones for id=2
Try :
lookup[d,
on = c("id", "date"),
roll = "nearest",
.(id,i.date,x.value)
][]
id i.date x.value
<num> <IDat> <int>
1: 1 2020-01-10 3
2: 2 2020-01-01 4
</details>
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论