在另一个数据框中匹配行和列中的数值。

huangapple go评论72阅读模式
英文:

match values in both row and columns from another dataframe

问题

I understand your request, but let's first clarify what you want to do in code. You'd like to create a new column in the "d" dataframe. This new column should be populated with values from the "L" dataframe based on the values in the "quantity" column of "d" and the corresponding "fli" values. Is that correct?

英文:
    L <- c(0, 500, 1000, 2000, 3000, 5000, 10000, 20000, 50000);
fli.1 <- c(0, 0.1, 0.2, 0.4, 0.8, 0.9, 1, 1.2, 1.8);
fli.2 <- c(0, 0.11, 0.21, 0.42, 0.84, 0.95, 1.05, 1.26, 1.89);
fli.3 <- c(0, 0.11, 0.22, 0.44, 0.88, 0.99, 1.1, 1.32, 1.98);
fli.4 <- c(0, 0.12, 0.23, 0.46, 0.93, 1.04, 1.16, 1.39, 2.08);
fli.5 <- c(0, 0.12, 0.24, 0.49, 0.97, 1.09, 1.22, 1.46, 2.19);
data <- data.frame(L, fli.1, fli.2, fli.3, fli.4, fli.5);


d <- data.frame(quantity = c(300, 368, 568, 20, 1000, 37659, 45000, 2500, 4500, 78453, 1200, 1589), fli = c("fli.1", "fli.1", "fli.4", "fli.5", "fli.2", "fli.2", "fli.5", "fli.1", "fli.2", "fli.2", "fli.3", "fli.4"));

i need to create another column in the dataframe d such that for each of its entry it takes value from the table L.
it should select row which is less than the quantity.
it should select column based on the fli.
for e.g. 37659 it would be 8th row and 2nd column which is 1.26.

I have tried using matrix, but it takes too much time. note that it is sample data i need to apply it to a very large dataset.

答案1

得分: 1

以下是您要翻译的内容:

"如 Ldata 中已排序,您可以使用 findInterval 获取行,使用 match 获取列,然后使用 cbind 组合这些索引并用它们来子集 data

d$value <-
  data[cbind(findInterval(d$quantity, data$L), match(d$fli, names(data)))]

d
#   quantity   fli value
#1       300 fli.1  0.00
#2       368 fli.1  0.00
#3       568 fli.4  0.12
#4        20 fli.5  0.00
#5      1000 fli.2  0.21
#6     37659 fli.2  1.26
#7     45000 fli.5  1.46
#8      2500 fli.1  0.40
#9      4500 fli.2  0.84
#10    78453 fli.2  1.89
#11     1200 fli.3  0.22
#12     1589 fli.4  0.23
```"
---
"性能测试
```R
library(dplyr)

bench::mark(check=FALSE,
"Jon Spring" = {d |>
  left_join(tidyr::pivot_longer(data, -L, names_to = "fli"), 
            join_by(fli,  closest(quantity >= L)))},
GKi = cbind(d, value=data[cbind(findInterval(d$quantity, data$L), match(d$fli, names(data)))]))
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 Jon Spring   5.22ms   5.29ms      188.    21.4KB     8.45    89     4
#2 GKi        198.77µs 207.86µs     4742.      480B    10.3   2313     5
```"
在这种情况下,GKi 比 Jon Spring 快约 25 倍,并且分配的内存较少。

<details>
<summary>英文:</summary>

As *L* is sorted in *data* you can use `findInterval` to get the row and `match` for the column, `cbind` the indices and use them to subset *data*.

d$value <-
data[cbind(findInterval(d$quantity, data$L), match(d$fli, names(data)))]

d

quantity fli value

#1 300 fli.1 0.00
#2 368 fli.1 0.00
#3 568 fli.4 0.12
#4 20 fli.5 0.00
#5 1000 fli.2 0.21
#6 37659 fli.2 1.26
#7 45000 fli.5 1.46
#8 2500 fli.1 0.40
#9 4500 fli.2 0.84
#10 78453 fli.2 1.89
#11 1200 fli.3 0.22
#12 1589 fli.4 0.23

---
Benchmark

library(dplyr)

bench::mark(check=FALSE,
"Jon Spring" = {d |>
left_join(tidyr::pivot_longer(data, -L, names_to = "fli"),
join_by(fli, closest(quantity >= L)))},
GKi = cbind(d, value=data[cbind(findInterval(d$quantity, data$L), match(d$fli, names(data)))])
)

expression min median itr/sec mem_alloc gc/sec n_itr n_gc

<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>

#1 Jon Spring 5.22ms 5.29ms 188. 21.4KB 8.45 89 4
#2 GKi 198.77µs 207.86µs 4742. 480B 10.3 2313 5

In this case GKi is about 25 times faster than Jon Spring and allocates less memory.

</details>



# 答案2
**得分**: 0

从我之前回答你的问题中进行修改,这似乎是一个小的修订。(如果你意识到问题不清楚或不是你想要问的内容,及时编辑你的问题是最佳实践。)

```R
library(dplyr) # v1.1.0+
d |>
  left_join(pivot_longer(data, -L, names_to = "fli"), 
            join_by(fli,  closest(quantity >= L)))

结果

   quantity   fli     L value
1       300 fli.1     0  0.00
2       368 fli.1     0  0.00
3       568 fli.4   500  0.12
4        20 fli.5     0  0.00
5      1000 fli.2  1000  0.21
6     37659 fli.2 20000  1.26
7     45000 fli.5 20000  1.46
8      2500 fli.1  2000  0.40
9      4500 fli.2  3000  0.84
10    78453 fli.2 50000  1.89
11     1200 fli.3  1000  0.22
12     1589 fli.4  1000  0.23
英文:

Modifying from my answer to your prior question to which this seems like a minor revision. (It's best practice to promptly edit your question if you realize it's unclear or not what you meant to ask.)

library(dplyr) # v1.1.0+
d |&gt;
  left_join(pivot_longer(data, -L, names_to = &quot;fli&quot;), 
            join_by(fli,  closest(quantity &gt;= L)))

Result

   quantity   fli     L value
1       300 fli.1     0  0.00
2       368 fli.1     0  0.00
3       568 fli.4   500  0.12
4        20 fli.5     0  0.00
5      1000 fli.2  1000  0.21
6     37659 fli.2 20000  1.26
7     45000 fli.5 20000  1.46
8      2500 fli.1  2000  0.40
9      4500 fli.2  3000  0.84
10    78453 fli.2 50000  1.89
11     1200 fli.3  1000  0.22
12     1589 fli.4  1000  0.23

答案3

得分: 0

以下是翻译好的部分:

这里,我提供了一个稍微不同的方法。

首先,我将“data”从宽格式转换为长格式,并重命名列以便后续合并表格时使用。

library(reshape2)

data_DT <- melt(data, id = "L")
names(data_DT) <- c("L", "fli", "value")

然后,我将“quantity”分成不同组,使用基于“data”中的“L”值的分割点。这些组将类似于“(0,500]”、“(500,1000]”等。使用简单的正则表达式匹配,我可以获得下限值,这将用于与第一个表格合并。

library(data.table)

d_DT <- data.table(d)
d_DT[, quantity_group := cut(quantity, c(data[, "L"], Inf))]
d_DT[, L := as.numeric(gsub("^.","",gsub(",.*","",quantity_group)))]

d_DT <- merge(d_DT, data_DT, by = c("L", "fli"))
英文:

Here, I provide a slightly different approach.

First, I changed the data from wide format to long format, and renamed the columns for merging the tables later.

library(reshape2)

data_DT &lt;- melt(data, id = &quot;L&quot;)
names(data_DT) &lt;- c(&quot;L&quot;, &quot;fli&quot;, &quot;value&quot;)

Then, I divide the quantity into groups, using breakpoints based on the values of L in data. The groups will be something like (0,500], (500,1000], and so on. Using simple regex matching, I can then obtain the value of the lower bound, this will be used to merge with the first table.

library(data.table)

d_DT &lt;- data.table(d)
d_DT[, quantity_group := cut(quantity, c(data[, &quot;L&quot;], Inf))]
d_DT[, L := as.numeric(gsub(&quot;^.&quot;, &quot;&quot;, gsub(&quot;,.*&quot;, &quot;&quot;, quantity_group)))]

d_DT &lt;- merge(d_DT, data_DT, by = c(&quot;L&quot;, &quot;fli&quot;))

huangapple
  • 本文由 发表于 2023年6月13日 12:47:33
  • 转载请务必保留本文链接:https://go.coder-hub.com/76461765.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定