使用if/then循环和amatch或match来查找相似值并匹配两个数据框列?

huangapple go评论68阅读模式
英文:

Use if/then for loop and amatch or match to find similar values and match two dataframe columns?

问题

我有两个数据框,一个包含原始数据标签,另一个包含数据需要匹配到的正确调整值。标签是数值型的,但可以相差最多+/- 2。我正试图想出如何编写一个带有if/then循环的代码,因为amatch对于数值类型的数据不适用。目标是编写一个循环,对于原始数据中的每个值,它将与校正数据中的值进行比较,并匹配到最接近的值(如果存在的话),其中校正值 - 原始值在-2和2之间。我已经粘贴了我的尝试但非常不起作用的尝试如下。

我的想法是可能可以使用amatch在找到匹配时选择最佳/最接近的匹配,因为在某些情况下,有几个数据标签值很接近,而在其他情况下,数字会有最多2的偏移。

是否有办法编写这样的代码,或者是否有更好的方法来实现这个目标?目标是将校正列与原始数据标签匹配,然后可以将其与原始数据和附加到校正标签的其他元数据合并,但在我的标签完整列表中,只有大约60%的标签需要进行此调整匹配(例如,您可以在示例数据中看到,1910应该匹配到1911,而2056需要匹配到2057)。由于数据的性质,差异不是一致的,我希望这个功能可以应用于不仅仅是单个数据实例,以便我不必手动匹配每个数据标签。

raw <- c(1419, 1444, 1460, 1485, 1501, 1542, 1581, 1590, 
         1606, 1622, 1647, 1663, 1688, 1704, 1743, 1791, 
         1793, 1809, 1850, 1866, 1891, 1905, 1910, 1954, 
         1956, 1976, 1996, 2012, 2028, 2041, 2053, 2056, 
         2067, 2100, 2102, 2122)

corrected <- c(1419, 1444, 1460, 1485, 1501, 1542, 1562, 
               1581, 1590, 1606, 1622, 1630, 1647, 1663, 
               1688, 1704, 1743, 1792, 1793, 1809, 1825, 
               1834, 1850, 1866, 1891, 1905, 1911, 1914, 
               1938, 1954, 1955, 1971, 1976, 1996, 2012, 
               2019, 2028, 2053, 2057, 2100, 2101, 2122)

labelmatch <- function(x, y) {
  data.frame(glycan = x, glycan_name = sapply(x, function(xi) {
    closest_match <- y[which.min(abs(y - xi))]
    if (abs(closest_match - xi) <= 2) {
      return(closest_match)
    } else {
      return(NA)
    }
  }))
}

labelmatch(raw, corrected)

上述代码定义了一个函数labelmatch,它将原始数据raw与校正数据corrected进行匹配,然后创建一个包含匹配结果的数据框。希望这对你有所帮助。

英文:

I have two dataframes, one with raw data labels and one with the correct adjusted values the data needs to be matched to. The labels are numeric but can differ up to +/- 2. I am trying to figure out how to write a coded if/then loop since amatch does not work well for numerics. The goal is to have a loop where for every value in the raw data, it will check against the values in the corrected data and match to the closest match if one is present where corrected - raw is between -2 and 2. I have pasted my attempted but very much nonfunctional attempt below.

My thought is that it may be possible to use amatch to select the best/closest match when one is found, since in some cases there are several data label values close together while in others there is a shift in the number up to 2.

Is there a way to write such a code, or another better way to accomplish this? The goal is to have a corrected column matched to the raw data labels that I can then use to merge with the raw data and the additional metadata attached to the corrected labels, but for my full list of labels only about 60% match without needing this adjustment (you can see in the sample data, for example, 1910 should match to 1911 and 2056 needs to match to 2057). Because of the nature of the data, the differences are not consistent and I want this to be a function that I can apply to more than just a single instance of data so that I do not have to go through and match every data label by hand.

raw &lt;- c(1419, 1444, 1460, 1485, 1501, 1542, 1581, 1590, 
         1606, 1622, 1647, 1663, 1688, 1704, 1743, 1791, 
         1793, 1809, 1850, 1866, 1891, 1905, 1910, 1954, 
         1956, 1976, 1996, 2012, 2028, 2041, 2053, 2056, 
         2067, 2100, 2102, 2122)

corrected &lt;- c(1419, 1444, 1460, 1485, 1501, 1542, 1562, 
               1581, 1590, 1606, 1622, 1630, 1647, 1663, 
               1688, 1704, 1743, 1792, 1793, 1809, 1825, 
               1834, 1850, 1866, 1891, 1905, 1911, 1914, 
               1938, 1954, 1955, 1971, 1976, 1996, 2012, 
               2019, 2028, 2053, 2057, 2100, 2101, 2122)


labelmatch &lt;- function(x, y) {data.frame(glycan=x, glycan_name=
                                            (for(i in 1:length(x)) {
                                              for(n in 1:length(y)) {
                                                if (n-i &lt;= 2 &amp; n-i &gt;=-2) {
                                                  match(x, y)} else{
                                                    if (n-i &gt;= 2 | n-i &lt;=-2){
                                                  next}}}}))
}

labelmatch(raw, corrected)

答案1

得分: 1

自从您的校正数据已排序,我们可以利用这一事实快速搜索向量。受到 np.searchsorted 的启发:

searchsorted <- function(findIn, vec, isSorted = TRUE){
  if(!isSorted) findIn <- sort(findIn)
  idx <- rank(c(vec, findIn, -Inf),, 'first')[seq_along(vec)] - rank(vec)
  right_vals <- findIn[idx]
  left_vals <- findIn[(idx2<-idx - 1) + !idx2]
  right_vals[na_idx] <- left_vals[na_idx<- is.na(right_vals)]
  right_vals[idx2] <- left_vals[idx2<- abs(right_vals - vec) > abs(left_vals - vec)]
  is.na(right_vals) <- abs(right_vals - vec) > 2
  right_vals
}

searchsorted(corrected, raw)
[1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704 1743
[16] 1792 1793 1809 1850 1866 1891 1905 1911 1954 1954 1976 1996 2012 2028   NA
[31] 2053 2057   NA 2100 2100 2122

编辑:

R提供了函数 findInterval,可以用来简化任务:

searchsorted <- function(x, vec){
  idx <- findInterval(x, vec, all.inside = TRUE)
  vals <- vec[idx]
  idx2 <- abs(vals - x) > 2
  vals2 <- vec[idx[idx2] + 1]
  is.na(vals2) <- vals2 - x[idx2] > 2
  replace(vals, idx2, vals2)
}

searchsorted(raw, corrected)
 [1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704
[15] 1743 1792 1793 1809 1850 1866 1891 1905 1911 1954 1955 1976 1996 2012
[29] 2028   NA 2053 2057   NA 2100 2101 2122
英文:

Since your corrected data is sorted, we can use that fact to quickly search through the vector. Inspired by np.searchsorted

searchsorted &lt;- function(findIn, vec, isSorted = TRUE){
  if(!isSorted) findIn &lt;- sort(findIn)
  idx &lt;- rank(c(vec, findIn, -Inf),, &#39;first&#39;)[seq_along(vec)] - rank(vec)
  right_vals &lt;- findIn[idx]
  left_vals &lt;- findIn[(idx2&lt;-idx - 1) + !idx2]
  right_vals[na_idx] &lt;- left_vals[na_idx&lt;- is.na(right_vals)]
  right_vals[idx2] &lt;- left_vals[idx2&lt;- abs(right_vals - vec) &gt; abs(left_vals - vec)]
  is.na(right_vals) &lt;- abs(right_vals - vec) &gt; 2
  right_vals
}
searchsorted(corrected, raw)
[1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704 1743
[16] 1792 1793 1809 1850 1866 1891 1905 1911 1954 1954 1976 1996 2012 2028   NA
[31] 2053 2057   NA 2100 2100 2122

--

Edit:

R does provide the function findInterval which could be used to simplify the task:

searchsorted &lt;- function(x, vec){
  idx &lt;- findInterval(x, vec, all.inside = TRUE)
  vals &lt;- vec[idx]
  idx2 &lt;- abs(vals - x) &gt; 2
  vals2 &lt;- vec[idx[idx2] + 1]
  is.na(vals2) &lt;- vals2 - x[idx2] &gt; 2
  replace(vals, idx2, vals2)
}

searchsorted(raw, corrected)
 [1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704
[15] 1743 1792 1793 1809 1850 1866 1891 1905 1911 1954 1955 1976 1996 2012
[29] 2028   NA 2053 2057   NA 2100 2101 2122

答案2

得分: 0

以下是您提供的代码的翻译部分:

要完成这个任务的一种方法是使用`expand.grid()`来扩展原始和校正选项的所有组合,然后保留差值在-2到2之间的匹配项。我还创建了一个向量,其中包含原始值没有匹配的情况,这样您可以手动查看这些值。

```R
library(dplyr)
labelmatch <- function(x, y) {
  expand.grid(x, y) %>%
    mutate(diff = Var2 - Var1) %>%
    filter(between(diff, -2, 2))
}

labels2 <- labelmatch(raw, corrected)

由于expand.grid可能计算成本较高,这个解决方案可能更有效率一些。首先,我创建了一个矩阵,计算了原始值和校正值之间的差值,然后找到了它们在-2和2之间的索引。从这里,我创建了一个数据框,其中数值在-2和2之间。我使用dplyr来保留匹配最接近原始值的唯一对。

x1 <- sapply(corrected, function(x) {
  x - raw
})
row.names(x1) <- raw
colnames(x1) <- corrected

s1 <- which(x1 <= 2 & x1 >= -2, arr.ind = TRUE)
x2 <- data.frame(rows = rownames(x1)[s1[, 1]], cols = colnames(x1)[s1[, 2]],
                 values = x1[s1])

rawunlabeld <- setdiff(raw, x2$rows)

x3 <- x2 %>%
  dplyr::group_by(rows) %>%
  dplyr::filter(abs(values) == min(abs(values))) %>%
  dplyr::distinct()

请注意,这只是代码的翻译部分,不包括任何额外的内容或回答。

英文:

One way to do this would be to expand all options of raw and corrected using expand.grid() and keeping the matches where the difference is between -2 and 2. I also create a vector where the raw values do not have a match so you can look at those values manually.

library(dplyr)
labelmatch &lt;- function(x,y){
expand.grid(x,y) %&gt;% 
        mutate(diff = Var2-Var1) %&gt;% 
        filter(between(diff,-2,2))
}

labels2 &lt;- labelmatch(raw,corrected)

With expand grid being to computationally expensive maybe this solution is a bit more efficient. first I make a matrix subtracting the raw and corrected values, then I find the indices where they are between -2, and 2, from there I make a data frame of the pairs where the values are between -2, and 2, I used dplyr to keep unique pairs where the match is closest to the raw value.

x1 &lt;- sapply(corrected, function(x){
  x - raw
})
row.names(x1) &lt;- raw
colnames(x1) &lt;- corrected

s1 &lt;- which(x1 &lt;= 2 &amp; x1 &gt;= -2, arr.ind = TRUE)
x2 &lt;- data.frame(rows =  rownames(x1)[s1[,1]], cols =  colnames(x1)[s1[,2]], 
           values = x1[s1])

rawunlabeld &lt;- setdiff(raw, x2$rows)

x3 &lt;- x2 %&gt;% 
      dplyr::group_by(rows) %&gt;% 
      dplyr::filter(abs(values) == min(abs(values))) %&gt;% 
      dplyr::distinct()

答案3

得分: 0

可能你可以尝试
&gt; raw * NA^(!colSums(abs(outer(corrected, raw, `-`)) &lt;= 2))
 [1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704 1743
[16] 1791 1793 1809 1850 1866 1891 1905 1910 1954 1956 1976 1996 2012 2028   NA
[31] 2053 2056   NA 2100 2102 2122
英文:

Probably you can try

&gt; raw * NA^(!colSums(abs(outer(corrected, raw, `-`)) &lt;= 2))
 [1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704 1743
[16] 1791 1793 1809 1850 1866 1891 1905 1910 1954 1956 1976 1996 2012 2028   NA
[31] 2053 2056   NA 2100 2102 2122

huangapple
  • 本文由 发表于 2023年7月7日 01:25:34
  • 转载请务必保留本文链接:https://go.coder-hub.com/76631219.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定