英文:
match values in both row and columns from another dataframe
问题
I understand your request, but let's first clarify what you want to do in code. You'd like to create a new column in the "d" dataframe. This new column should be populated with values from the "L" dataframe based on the values in the "quantity" column of "d" and the corresponding "fli" values. Is that correct?
英文:
L <- c(0, 500, 1000, 2000, 3000, 5000, 10000, 20000, 50000);
fli.1 <- c(0, 0.1, 0.2, 0.4, 0.8, 0.9, 1, 1.2, 1.8);
fli.2 <- c(0, 0.11, 0.21, 0.42, 0.84, 0.95, 1.05, 1.26, 1.89);
fli.3 <- c(0, 0.11, 0.22, 0.44, 0.88, 0.99, 1.1, 1.32, 1.98);
fli.4 <- c(0, 0.12, 0.23, 0.46, 0.93, 1.04, 1.16, 1.39, 2.08);
fli.5 <- c(0, 0.12, 0.24, 0.49, 0.97, 1.09, 1.22, 1.46, 2.19);
data <- data.frame(L, fli.1, fli.2, fli.3, fli.4, fli.5);
d <- data.frame(quantity = c(300, 368, 568, 20, 1000, 37659, 45000, 2500, 4500, 78453, 1200, 1589), fli = c("fli.1", "fli.1", "fli.4", "fli.5", "fli.2", "fli.2", "fli.5", "fli.1", "fli.2", "fli.2", "fli.3", "fli.4"));
i need to create another column in the dataframe d such that for each of its entry it takes value from the table L.
it should select row which is less than the quantity.
it should select column based on the fli.
for e.g. 37659 it would be 8th row and 2nd column which is 1.26.
I have tried using matrix, but it takes too much time. note that it is sample data i need to apply it to a very large dataset.
答案1
得分: 1
以下是您要翻译的内容:
"如 L 在 data 中已排序,您可以使用 findInterval
获取行,使用 match
获取列,然后使用 cbind
组合这些索引并用它们来子集 data。
d$value <-
data[cbind(findInterval(d$quantity, data$L), match(d$fli, names(data)))]
d
# quantity fli value
#1 300 fli.1 0.00
#2 368 fli.1 0.00
#3 568 fli.4 0.12
#4 20 fli.5 0.00
#5 1000 fli.2 0.21
#6 37659 fli.2 1.26
#7 45000 fli.5 1.46
#8 2500 fli.1 0.40
#9 4500 fli.2 0.84
#10 78453 fli.2 1.89
#11 1200 fli.3 0.22
#12 1589 fli.4 0.23
```"
---
"性能测试
```R
library(dplyr)
bench::mark(check=FALSE,
"Jon Spring" = {d |>
left_join(tidyr::pivot_longer(data, -L, names_to = "fli"),
join_by(fli, closest(quantity >= L)))},
GKi = cbind(d, value=data[cbind(findInterval(d$quantity, data$L), match(d$fli, names(data)))]))
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#1 Jon Spring 5.22ms 5.29ms 188. 21.4KB 8.45 89 4
#2 GKi 198.77µs 207.86µs 4742. 480B 10.3 2313 5
```"
在这种情况下,GKi 比 Jon Spring 快约 25 倍,并且分配的内存较少。
<details>
<summary>英文:</summary>
As *L* is sorted in *data* you can use `findInterval` to get the row and `match` for the column, `cbind` the indices and use them to subset *data*.
d$value <-
data[cbind(findInterval(d$quantity, data$L), match(d$fli, names(data)))]
d
quantity fli value
#1 300 fli.1 0.00
#2 368 fli.1 0.00
#3 568 fli.4 0.12
#4 20 fli.5 0.00
#5 1000 fli.2 0.21
#6 37659 fli.2 1.26
#7 45000 fli.5 1.46
#8 2500 fli.1 0.40
#9 4500 fli.2 0.84
#10 78453 fli.2 1.89
#11 1200 fli.3 0.22
#12 1589 fli.4 0.23
---
Benchmark
library(dplyr)
bench::mark(check=FALSE,
"Jon Spring" = {d |>
left_join(tidyr::pivot_longer(data, -L, names_to = "fli"),
join_by(fli, closest(quantity >= L)))},
GKi = cbind(d, value=data[cbind(findInterval(d$quantity, data$L), match(d$fli, names(data)))])
)
expression min median itr/sec
mem_alloc gc/sec
n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#1 Jon Spring 5.22ms 5.29ms 188. 21.4KB 8.45 89 4
#2 GKi 198.77µs 207.86µs 4742. 480B 10.3 2313 5
In this case GKi is about 25 times faster than Jon Spring and allocates less memory.
</details>
# 答案2
**得分**: 0
从我之前回答你的问题中进行修改,这似乎是一个小的修订。(如果你意识到问题不清楚或不是你想要问的内容,及时编辑你的问题是最佳实践。)
```R
library(dplyr) # v1.1.0+
d |>
left_join(pivot_longer(data, -L, names_to = "fli"),
join_by(fli, closest(quantity >= L)))
结果
quantity fli L value
1 300 fli.1 0 0.00
2 368 fli.1 0 0.00
3 568 fli.4 500 0.12
4 20 fli.5 0 0.00
5 1000 fli.2 1000 0.21
6 37659 fli.2 20000 1.26
7 45000 fli.5 20000 1.46
8 2500 fli.1 2000 0.40
9 4500 fli.2 3000 0.84
10 78453 fli.2 50000 1.89
11 1200 fli.3 1000 0.22
12 1589 fli.4 1000 0.23
英文:
Modifying from my answer to your prior question to which this seems like a minor revision. (It's best practice to promptly edit your question if you realize it's unclear or not what you meant to ask.)
library(dplyr) # v1.1.0+
d |>
left_join(pivot_longer(data, -L, names_to = "fli"),
join_by(fli, closest(quantity >= L)))
Result
quantity fli L value
1 300 fli.1 0 0.00
2 368 fli.1 0 0.00
3 568 fli.4 500 0.12
4 20 fli.5 0 0.00
5 1000 fli.2 1000 0.21
6 37659 fli.2 20000 1.26
7 45000 fli.5 20000 1.46
8 2500 fli.1 2000 0.40
9 4500 fli.2 3000 0.84
10 78453 fli.2 50000 1.89
11 1200 fli.3 1000 0.22
12 1589 fli.4 1000 0.23
答案3
得分: 0
以下是翻译好的部分:
这里,我提供了一个稍微不同的方法。
首先,我将“data”从宽格式转换为长格式,并重命名列以便后续合并表格时使用。
library(reshape2)
data_DT <- melt(data, id = "L")
names(data_DT) <- c("L", "fli", "value")
然后,我将“quantity”分成不同组,使用基于“data”中的“L”值的分割点。这些组将类似于“(0,500]”、“(500,1000]”等。使用简单的正则表达式匹配,我可以获得下限值,这将用于与第一个表格合并。
library(data.table)
d_DT <- data.table(d)
d_DT[, quantity_group := cut(quantity, c(data[, "L"], Inf))]
d_DT[, L := as.numeric(gsub("^.","",gsub(",.*","",quantity_group)))]
d_DT <- merge(d_DT, data_DT, by = c("L", "fli"))
英文:
Here, I provide a slightly different approach.
First, I changed the data
from wide format to long format, and renamed the columns for merging the tables later.
library(reshape2)
data_DT <- melt(data, id = "L")
names(data_DT) <- c("L", "fli", "value")
Then, I divide the quantity
into groups, using breakpoints based on the values of L
in data
. The groups will be something like (0,500]
, (500,1000]
, and so on. Using simple regex matching, I can then obtain the value of the lower bound, this will be used to merge with the first table.
library(data.table)
d_DT <- data.table(d)
d_DT[, quantity_group := cut(quantity, c(data[, "L"], Inf))]
d_DT[, L := as.numeric(gsub("^.", "", gsub(",.*", "", quantity_group)))]
d_DT <- merge(d_DT, data_DT, by = c("L", "fli"))
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论