data.table如何在滚动连接中定义“nearest”?

huangapple go评论62阅读模式
英文:

How does data.table define "nearest" in a rolling join?

问题

"nearest" 意味着在 data.table 中的滚动连接操作中,如果 i 是一个 data.table,且它的行与除了最后一个 x 连接列之外的所有连接列匹配,并且它在最后一个 i 连接列中的值处于一个间隙中(包括在该组中 x 的最后观察之后),那么 "nearest" 会滚动到最近的值。

rollends 参数表示:

  • 如果 rollends[1]=TRUE,则会向后滚动第一个值。
  • 如果 rollends[2]=TRUE,则会向前滚动最后一个值。

然而,尽管 "nearest" 的默认 rollends 值为 c(TRUE, TRUE),当要连接的行在第一个观察之前时,输出不符合预期。

希望连接的示例数据如下:

d 是一个 data.table,包含两行,每行都包含一个 id 和一个 IDatelookup 是一个 data.table,对于每个 id 记录了特定日期的 valued 中的 id 1 的日期在 lookup 中的该 id 的最后记录日期之后。 d 中的 id 2 的日期在 lookup 中的该 id 的第一个记录日期之前。

连接操作如下:

希望将 d 中的每个 idlookup 中最近记录的日期上的 value 进行连接:

d[lookup,
    on = c("id", "date"),
    roll = "nearest",
    value := i.value,
    rollends = c(TRUE, TRUE)
]

预期输出:

id 1 具有日期 "2020-01-10",这在所有 lookupid 1 的日期之后,因此预期它与最后一个日期上的 value 匹配,即 3。类似地,id 2 具有日期 "2020-01-01",这在所有 lookupid 2 的日期之前,因此预期它与第一个日期上的 value 匹配。

实际输出:

id 1 的匹配结果如预期。然而,id 2 的匹配结果是 value 为 6,它在 "2020-01-06" 记录的,即组2的最后一个日期上,而不是第一个日期上。根据 "nearest" 的理解,对于日期 c("2020-01-04", "2020-01-05", "2020-01-06"),距离 "2020-01-01" 最近的是 "2020-01-04"。

这是无论 rollends 是否为 c(TRUE, TRUE) 或其他三个可能选项的输出。

希望的输出在一个方向上连接时表现正常,但在另一个方向上连接时表现不正常。"nearest" 的定义是什么?是否有一种方法可以通过一次连接获得预期的输出?

英文:

The data.table documentation states roll = "nearest" means:

> When i is a data.table and its row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then... "nearest" rolls the nearest value

The rollends parameter states:

> If rollends[1]=TRUE, it will roll the first value backward. If rollends[2]=TRUE, it will roll the last value forward.

However, although rollends = c(TRUE, TRUE) is the default for roll = "nearest", when the row to join is before the first observation, the output is not as I expect.

Minimal reproducible example

Sample data

d is a data.table, with two rows, which each contain an id and an IDate. lookup is a data.table with a value recorded for each id on a specific date. The date in d for id 1 is after the last recorded date for that id in lookup. The date in d for id 2 is before the first recorded date for that id in lookup.

library(data.table)
lookup <- data.table(
    id = c(1, 1, 1, 2, 2, 2),
    date = seq(from = as.IDate("2020-01-01"), by = 1, length.out = 6),
    value = c(1:6)
)

d <- data.table(
    id = c(1, 2),
    date = as.IDate(c("2020-01-10", "2020-01-01"))
)

d

#       id       date
#    <num>     <IDat>
# 1:     1 2020-01-10
# 2:     2 2020-01-01

lookup

#       id       date value
#    <num>     <IDat> <int>
# 1:     1 2020-01-01     1
# 2:     1 2020-01-02     2
# 3:     1 2020-01-03     3
# 4:     2 2020-01-04     4
# 5:     2 2020-01-05     5
# 6:     2 2020-01-06     6

Join operation

I wish to join each id in d to the value on the nearest recorded date in lookup:

d[lookup,
    on = c("id", "date"),
    roll = "nearest",
    value := i.value,
    rollends = c(TRUE, TRUE)
]

Expected output

id 1 has the date "2020-01-10", which is after all the dates for id 1 in lookup, so I expect it to the value on the final date, i.e. 3. Similarly, id 2 has the date "2020-01-01", which is before all the dates for id 2 in lookup, so I expect it to match to the value on the first date.

      id       date value
   <num>     <IDat> <int>
1:     1 2020-01-10     3
2:     2 2020-01-01     4

Actual output

id 1 matches as expected. However, id 2 matches to the value 6, which is recorded on "2020-01-06", i.e. the final date for group 2, rather than the first date. My understanding of "nearest" is that of the dates c("2020-01-04", "2020-01-05", "2020-01-06"), the nearest to to "2020-01-01" is "2020-01-04".

      id       date value
   <num>     <IDat> <int>
1:     1 2020-01-10     3
2:     2 2020-01-01     6

This is the output regardless of whether rollends = c(TRUE, TRUE) or any of the other three possible options.

Session info

R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
packageVersion("data.table") #'1.14.7'

This is related but does not have an answer. This has an answer which suggests doing the join the other way around, i.e.:

lookup[d,
    on = c("id", "date"),
    roll = "nearest",
    nearest_date := i.date
]

d[lookup,
    on = c("id", "date" = "nearest_date"),
    value := i.value
]

This works, but it is two joins, which is more expensive, and more joins make code harder to maintain. In any case the fact that it works confirms I do not understand what "nearest" means.

Why does "nearest" act as expected if joining in one direction, but not the other? What is the definition of "nearest"? Is there a way to get the expected output with only one join?

答案1

得分: 2

"nearest"正常运作,但内部表应为"d"。

这里发生的情况是,:= 应用于所有 lookup 值,以便在 d 中获取每个组的最后一个查找值:36

d[lookup,
  on = c("id", "date"),
  roll = "nearest",
  .(i.value,i.date,x.date)
]
   i.value     i.date     x.date
     <int>     <IDat>     <IDat>
1:       1 2020-01-01 2020-01-10
2:       2 2020-01-02 2020-01-10
3:       3 2020-01-03 2020-01-10 # 对于id=1,最后一个值覆盖了之前的值
4:       4 2020-01-04 2020-01-01
5:       5 2020-01-05 2020-01-01
6:       6 2020-01-06 2020-01-01 # 对于id=2,最后一个值覆盖了之前的值

尝试:

lookup[d,
  on = c("id", "date"),
  roll = "nearest",
  .(id,i.date,x.value)
][]

      id     i.date x.value
   <num>     <IDat>   <int>
1:     1 2020-01-10       3
2:     2 2020-01-01       4
英文:

nearest works as expected, but the inner table should be d.

What happens here is that := is applied to all lookup values so that you get in d the last lookup value for each group : 3 and 6

d[lookup,
  on = c("id", "date"),
  roll = "nearest",
  .(i.value,i.date,x.date)
]
   i.value     i.date     x.date
     <int>     <IDat>     <IDat>
1:       1 2020-01-01 2020-01-10
2:       2 2020-01-02 2020-01-10
3:       3 2020-01-03 2020-01-10 # Last value overwrites previous ones for id=1
4:       4 2020-01-04 2020-01-01
5:       5 2020-01-05 2020-01-01
6:       6 2020-01-06 2020-01-01 # Last value overwrites previous ones for id=2

Try :

lookup[d,
  on = c("id", "date"),
  roll = "nearest",
  .(id,i.date,x.value)
][]

      id     i.date x.value
   <num>     <IDat>   <int>
1:     1 2020-01-10       3
2:     2 2020-01-01       4


</details>



huangapple
  • 本文由 发表于 2023年2月23日 19:11:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75544014.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定