data.table如何在滚动连接中定义“nearest”?

huangapple go评论94阅读模式
英文:

How does data.table define "nearest" in a rolling join?

问题

"nearest" 意味着在 data.table 中的滚动连接操作中,如果 i 是一个 data.table,且它的行与除了最后一个 x 连接列之外的所有连接列匹配,并且它在最后一个 i 连接列中的值处于一个间隙中(包括在该组中 x 的最后观察之后),那么 "nearest" 会滚动到最近的值。

rollends 参数表示:

  • 如果 rollends[1]=TRUE,则会向后滚动第一个值。
  • 如果 rollends[2]=TRUE,则会向前滚动最后一个值。

然而,尽管 "nearest" 的默认 rollends 值为 c(TRUE, TRUE),当要连接的行在第一个观察之前时,输出不符合预期。

希望连接的示例数据如下:

d 是一个 data.table,包含两行,每行都包含一个 id 和一个 IDatelookup 是一个 data.table,对于每个 id 记录了特定日期的 valued 中的 id 1 的日期在 lookup 中的该 id 的最后记录日期之后。 d 中的 id 2 的日期在 lookup 中的该 id 的第一个记录日期之前。

连接操作如下:

希望将 d 中的每个 idlookup 中最近记录的日期上的 value 进行连接:

  1. d[lookup,
  2. on = c("id", "date"),
  3. roll = "nearest",
  4. value := i.value,
  5. rollends = c(TRUE, TRUE)
  6. ]

预期输出:

id 1 具有日期 "2020-01-10",这在所有 lookupid 1 的日期之后,因此预期它与最后一个日期上的 value 匹配,即 3。类似地,id 2 具有日期 "2020-01-01",这在所有 lookupid 2 的日期之前,因此预期它与第一个日期上的 value 匹配。

实际输出:

id 1 的匹配结果如预期。然而,id 2 的匹配结果是 value 为 6,它在 "2020-01-06" 记录的,即组2的最后一个日期上,而不是第一个日期上。根据 "nearest" 的理解,对于日期 c("2020-01-04", "2020-01-05", "2020-01-06"),距离 "2020-01-01" 最近的是 "2020-01-04"。

这是无论 rollends 是否为 c(TRUE, TRUE) 或其他三个可能选项的输出。

希望的输出在一个方向上连接时表现正常,但在另一个方向上连接时表现不正常。"nearest" 的定义是什么?是否有一种方法可以通过一次连接获得预期的输出?

英文:

The data.table documentation states roll = "nearest" means:

> When i is a data.table and its row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then... "nearest" rolls the nearest value

The rollends parameter states:

> If rollends[1]=TRUE, it will roll the first value backward. If rollends[2]=TRUE, it will roll the last value forward.

However, although rollends = c(TRUE, TRUE) is the default for roll = "nearest", when the row to join is before the first observation, the output is not as I expect.

Minimal reproducible example

Sample data

d is a data.table, with two rows, which each contain an id and an IDate. lookup is a data.table with a value recorded for each id on a specific date. The date in d for id 1 is after the last recorded date for that id in lookup. The date in d for id 2 is before the first recorded date for that id in lookup.

  1. library(data.table)
  2. lookup <- data.table(
  3. id = c(1, 1, 1, 2, 2, 2),
  4. date = seq(from = as.IDate("2020-01-01"), by = 1, length.out = 6),
  5. value = c(1:6)
  6. )
  7. d <- data.table(
  8. id = c(1, 2),
  9. date = as.IDate(c("2020-01-10", "2020-01-01"))
  10. )
  11. d
  12. # id date
  13. # <num> <IDat>
  14. # 1: 1 2020-01-10
  15. # 2: 2 2020-01-01
  16. lookup
  17. # id date value
  18. # <num> <IDat> <int>
  19. # 1: 1 2020-01-01 1
  20. # 2: 1 2020-01-02 2
  21. # 3: 1 2020-01-03 3
  22. # 4: 2 2020-01-04 4
  23. # 5: 2 2020-01-05 5
  24. # 6: 2 2020-01-06 6

Join operation

I wish to join each id in d to the value on the nearest recorded date in lookup:

  1. d[lookup,
  2. on = c("id", "date"),
  3. roll = "nearest",
  4. value := i.value,
  5. rollends = c(TRUE, TRUE)
  6. ]

Expected output

id 1 has the date "2020-01-10", which is after all the dates for id 1 in lookup, so I expect it to the value on the final date, i.e. 3. Similarly, id 2 has the date "2020-01-01", which is before all the dates for id 2 in lookup, so I expect it to match to the value on the first date.

  1. id date value
  2. <num> <IDat> <int>
  3. 1: 1 2020-01-10 3
  4. 2: 2 2020-01-01 4

Actual output

id 1 matches as expected. However, id 2 matches to the value 6, which is recorded on "2020-01-06", i.e. the final date for group 2, rather than the first date. My understanding of "nearest" is that of the dates c("2020-01-04", "2020-01-05", "2020-01-06"), the nearest to to "2020-01-01" is "2020-01-04".

  1. id date value
  2. <num> <IDat> <int>
  3. 1: 1 2020-01-10 3
  4. 2: 2 2020-01-01 6

This is the output regardless of whether rollends = c(TRUE, TRUE) or any of the other three possible options.

Session info

  1. R version 4.1.3 (2022-03-10)
  2. Platform: x86_64-w64-mingw32/x64 (64-bit)
  3. packageVersion("data.table") #'1.14.7'

This is related but does not have an answer. This has an answer which suggests doing the join the other way around, i.e.:

  1. lookup[d,
  2. on = c("id", "date"),
  3. roll = "nearest",
  4. nearest_date := i.date
  5. ]
  6. d[lookup,
  7. on = c("id", "date" = "nearest_date"),
  8. value := i.value
  9. ]

This works, but it is two joins, which is more expensive, and more joins make code harder to maintain. In any case the fact that it works confirms I do not understand what "nearest" means.

Why does "nearest" act as expected if joining in one direction, but not the other? What is the definition of "nearest"? Is there a way to get the expected output with only one join?

答案1

得分: 2

"nearest"正常运作,但内部表应为"d"。

这里发生的情况是,:= 应用于所有 lookup 值,以便在 d 中获取每个组的最后一个查找值:36

  1. d[lookup,
  2. on = c("id", "date"),
  3. roll = "nearest",
  4. .(i.value,i.date,x.date)
  5. ]
  6. i.value i.date x.date
  7. <int> <IDat> <IDat>
  8. 1: 1 2020-01-01 2020-01-10
  9. 2: 2 2020-01-02 2020-01-10
  10. 3: 3 2020-01-03 2020-01-10 # 对于id=1,最后一个值覆盖了之前的值
  11. 4: 4 2020-01-04 2020-01-01
  12. 5: 5 2020-01-05 2020-01-01
  13. 6: 6 2020-01-06 2020-01-01 # 对于id=2,最后一个值覆盖了之前的值

尝试:

  1. lookup[d,
  2. on = c("id", "date"),
  3. roll = "nearest",
  4. .(id,i.date,x.value)
  5. ][]
  6. id i.date x.value
  7. <num> <IDat> <int>
  8. 1: 1 2020-01-10 3
  9. 2: 2 2020-01-01 4
英文:

nearest works as expected, but the inner table should be d.

What happens here is that := is applied to all lookup values so that you get in d the last lookup value for each group : 3 and 6

  1. d[lookup,
  2. on = c("id", "date"),
  3. roll = "nearest",
  4. .(i.value,i.date,x.date)
  5. ]
  6. i.value i.date x.date
  7. <int> <IDat> <IDat>
  8. 1: 1 2020-01-01 2020-01-10
  9. 2: 2 2020-01-02 2020-01-10
  10. 3: 3 2020-01-03 2020-01-10 # Last value overwrites previous ones for id=1
  11. 4: 4 2020-01-04 2020-01-01
  12. 5: 5 2020-01-05 2020-01-01
  13. 6: 6 2020-01-06 2020-01-01 # Last value overwrites previous ones for id=2

Try :

  1. lookup[d,
  2. on = c("id", "date"),
  3. roll = "nearest",
  4. .(id,i.date,x.value)
  5. ][]
  6. id i.date x.value
  7. <num> <IDat> <int>
  8. 1: 1 2020-01-10 3
  9. 2: 2 2020-01-01 4
  10. </details>

huangapple
  • 本文由 发表于 2023年2月23日 19:11:38
  • 转载请务必保留本文链接:https://go.coder-hub.com/75544014.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定