如何在R中删除一个看起来像另一个数据的数据

huangapple go评论91阅读模式
英文:

how to delete a data looks like an other data in R

问题

我想删除一个具有最短时间的数据,但它们的名称相似。

以下是代码部分,不要翻译:

  1. df1 <- data.frame(
  2. name = c(
  3. "A. MAHJUM-61365",
  4. "A. MAHJUM-61365. MAHJUM-61365",
  5. "A. RIZAL. AD-11002795",
  6. "A. RIZAL. AD-11002795. RIZAL. AD-11002795",
  7. "ABD. KADIR-60447",
  8. "ABD. KADIR-60447ABD. KADIR-60447",
  9. "ABD. KAHAR-62551",
  10. "ABD. RASYID DS-11002082",
  11. "ABDREAS APUNG @SANY",
  12. "ABDUL AZIS @HYUNDAI",
  13. "ABDUL AZIZ @HYUNDAY",
  14. "ABDUL AZIZ@HYUNDAI"
  15. ),
  16. time = c(100, 5, 40, 6, 55, 7, 90, 29, 100, 20, 100, 6)
  17. )

和期望的df2数据框如下:

  1. df2 <- data.frame(name = c(
  2. "A. MAHJUM-61365",
  3. "A. RIZAL. AD-11002795",
  4. "ABD. KADIR-60447",
  5. "ABD. KAHAR-62551",
  6. "ABD. RASYID DS-11002082",
  7. "ABDREAS APUNG @SANY",
  8. "ABDUL AZIS @HYUNDAY"
  9. ),
  10. time = c(100, 40, 55, 90, 29, 100, 100)
  11. )

我期望的df数据框应该与df2相似。

英文:

i want to delete a data with a minim time, but in the name is like each other

  1. df1 &lt;- data.frame(
  2. name = c(
  3. &quot;A. MAHJUM-61365&quot;,
  4. &quot;A. MAHJUM-61365. MAHJUM-61365&quot;,
  5. &quot;A. RIZAL. AD-11002795&quot;,
  6. &quot;A. RIZAL. AD-11002795. RIZAL. AD-11002795&quot;,
  7. &quot;ABD. KADIR-60447&quot;,
  8. &quot;ABD. KADIR-60447ABD. KADIR-60447&quot;,
  9. &quot;ABD. KAHAR-62551&quot;,
  10. &quot;ABD. RASYID DS-11002082&quot;,
  11. &quot;ABDREAS APUNG @SANY&quot;,
  12. &quot;ABDUL AZIS @HYUNDAI&quot;,
  13. &quot;ABDUL AZIZ @HYUNDAY&quot;,
  14. &quot;ABDUL AZIZ@HYUNDAI&quot;
  15. ),
  16. time=c(100,5,40,6,55,7,90,29,100,20,100,6))

and the df would be like this

  1. df2=data.frame(name=c(
  2. &quot;A. MAHJUM-61365&quot;
  3. &quot;A. RIZAL. AD-11002795&quot;
  4. &quot;ABD. KADIR-60447&quot;
  5. &quot;ABD. KAHAR-62551&quot;
  6. &quot;ABD. RASYID DS-11002082&quot;
  7. &quot;ABDREAS APUNG @SANY&quot;
  8. &quot;ABDUL AZIS @HYUNDAY&quot;),
  9. time=c(100,40,55,90,29,100,100))

my expected the df like to df2

答案1

得分: 3

你可以尝试使用 adist 并使用 hclust 找到相似的名称。使用 ave 找到最大值。

  1. x <- adist(df1$name, partial=TRUE)
  2. i <- cutree(hclust(as.dist(pmin(x, t(x))), h=2))
  3. df1[df1$time == ave(df1$time, i, FUN=max),]
  4. # name time
  5. #1 A. MAHJUM-61365 100
  6. #3 A. RIZAL. AD-11002795 40
  7. #5 ABD. KADIR-60447 55
  8. #7 ABD. KAHAR-62551 90
  9. #8 ABD. RASYID DS-11002082 29
  10. #9 ABDREAS APUNG @SANY 100
  11. #11 ABDUL AZIZ @HYUNDAY 100
英文:

You can try adist and use hclust to find similar names. Use ave to find the maximum.

  1. x &lt;- adist(df1$name, partial=TRUE)
  2. i &lt;- cutree(hclust(as.dist(pmin(x, t(x)))), h=2)
  3. df1[df1$time == ave(df1$time, i, FUN=max),]
  4. # name time
  5. #1 A. MAHJUM-61365 100
  6. #3 A. RIZAL. AD-11002795 40
  7. #5 ABD. KADIR-60447 55
  8. #7 ABD. KAHAR-62551 90
  9. #8 ABD. RASYID DS-11002082 29
  10. #9 ABDREAS APUNG @SANY 100
  11. #11 ABDUL AZIZ @HYUNDAY 100

huangapple
  • 本文由 发表于 2023年4月4日 15:47:40
  • 转载请务必保留本文链接:https://go.coder-hub.com/75926762.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定