英文:
Better way to find duplicate entire rows and flag minor differences within single R dataframe?
问题
我可以帮你翻译这段内容:
我有一个大型数据集,近20万行和20列(数字和字符串混合数据)。每行都有一个唯一的标识符。少于100行具有重复的标识符。我试图确定两件事:
- 如果具有重复标识符的每行在所有20列中都有相同的重复值,还是只有一些列有相同的值(标识符至少有1列)。
- 对于具有不在所有列中具有相同值的重复标识符的每组行,确定哪些列具有不同的值。
我已经看过一些其他的Stack Overflow帖子,但它们通常讨论的是列的重复,而不是行,并且通常是在数据帧之间比较,而不是在它们内部比较。
这是一个小数据示例。
请注意,有org_id值(a、b、c),其中a和b是重复的。对于org_id a,所有列中的值都是重复的,但对于org_id b,情况不同。
首先,我需要另一个数据帧来告诉我哪些org_id具有重复的行值,比如:
exact_dup | dup_orgs |
---|---|
TRUE | a |
FALSE | b |
到目前为止,我将行数据合并为一个长字符串进行比较。下面的代码有效,但似乎有些笨拙。您对如何改进这一点有什么建议吗?
创建每行的长字符串:
df$x <- apply(df, 1, paste0, collapse="|")
用于识别数据中具有重复标识符的精确重复行的占位数据帧:
review_dups <- data.frame(exact_dup = NA, dup_orgs = df |>
filter(duplicated(org_id) == TRUE) |>
distinct(org_id) |>
pull(org_id))
循环查找差异:
for(i in 1:nrow(review_dups)){
n <- df |>
filter(org_id == review_dups$dup_orgs[i]) |>
select(x) |>
count(x) |>
pull(n) |>
max()
dup_rows <- df |>
filter(org_id == review_dups$dup_orgs[i]) |>
nrow()
review_dups[i,1] <- n==dup_rows
rm(n, dup_rows)
}
rm(i)
查看结果:
review_dups
其次,我需要一种报告那些对于具有重复org_id的行不匹配其余行的数据列的方法。
因此,输出应告诉我org_id b的score列和name列不同。
最好,报告值可以出现为上面示例中的重复检查数据帧的第三个结果列,但我也可以接受不同的报告选项。
我尚没有这部分的代码解决方案。
谢谢!
英文:
I have a large dataset of nearly 200,000 rows and 20 columns (mix of numeric and string data). Each row has a unique identifier. Less than 100 rows have duplicate identifiers. I am trying to determine 2 things:
- If each row with a duplicate identifier has the exact same duplicate values across all 20 columns or just some columns (min 1 column for the identifier).
- For each set of rows with duplicate identifiers that do not have the same values across all columns, identify which columns have different values.
I've seen several other SO posts, but they usually discuss duplicates by columns, not rows and/or compare across data frames instead of within them.
Here is a small data example.<br/>
Note there are org_id values (a,b,c), of which a & b are duplicates. All values across the columns are duplicate for org_id a, but not org_id b.
# load toy data
df <- data.frame(org_id=c("a","a","b","b","b","c"),
thing=c("1","1","1","1","2","1"),
name=c("really_long_A_name_here", "really_long_A_name_here", "really_long_B_name_here", "really_long_B2_name_here", "really_long_B_name_here", "really_long_C_name_here"),
start=c("2020-10-31", "2020-10-31", "2022-09-17", "2022-09-17", "2022-09-17", "2023-05-11") )
df
org_id | score | name | start |
---|---|---|---|
a | 1 | really_long_A_name_here | 2020-10-31 |
a | 1 | really_long_A_name_here | 2020-10-31 |
b | 1 | really_long_B_name_here | 2022-09-17 |
b | 1 | really_long_B2_name_here | 2022-09-17 |
b | 2 | really_long_B_name_here | 2022-09-17 |
c | 1 | really_long_C_name_here | 2023-05-11 |
Here is an example of what I need:
First, another data frame to tell me which org_id have duplicate row values, such as:
exact_dup | dup_orgs |
---|---|
TRUE | a |
FALSE | b |
So far, I combined row data into a long string for comparison. The code below works but seems clunky. Any suggestions on how to improve this?
# create a long string for each row
df$x <- apply(df, 1, paste0, collapse="|")
# placeholder dataframe to identify which uplicate rows in the data are exact duplicates across the entire row
review_dups <- data.frame(exact_dup = NA, dup_orgs = df |> filter(duplicated(org_id)==TRUE) |> distinct(org_id)|> pull(org_id) )
# loop to find differences
for(i in 1:nrow(review_dups)){
n <- df |> filter(org_id == review_dups$dup_orgs[i]) |> select(x) |> count(x) |> pull(n) |> max()
dup_rows <- df |> filter(org_id == review_dups$dup_orgs[i]) |> nrow()
review_dups[i,1] <- n==dup_rows
rm(n, dup_rows)
}
rm(i)
# view results
review_dups
Second, I need a way to report those data columns that don't match the rest of the rows for rows with duplicate org_id.<br/>
So, the output should tell me the org_id b score column and name column differ.<br/>
Preferably, the report values could appear as a third results column on the duplicate check data frame example above, but I'm open to different reporting options.<br/>
I don't yet have a code solution for this part.
Thanks!
答案1
得分: 1
我推测你只关心将第二行及其后的行与第一行进行比较,而不是进行完整的成对差异比较。
这会在原始数据框中添加一列,该列包含逗号分隔的列名。
df$dupe_differences <- unlist(by(df, df$org_id, function(dat) {
if (nrow(dat) == 1) return(NA)
c("", sapply(2:nrow(dat), function(i) {
same <- mapply(Negate(`%in%`), dat[1,], dat[i,])
paste(names(same[same]), collapse = ",")
}))
}))
df
# org_id thing name start dupe_differences
# 1 a 1 really_long_A_name_here 2020-10-31
# 2 a 1 really_long_A_name_here 2020-10-31
# 3 b 1 really_long_B_name_here 2022-09-17
# 4 b 1 really_long_B2_name_here 2022-09-17 name
# 5 b 2 really_long_B_name_here 2022-09-17 thing
# 6 c 1 really_long_C_name_here 2023-05-11 <NA>
区分是明确的:
NA
表示没有重复项,该行是唯一的""
(空字符串)表示其内容与该重复集的第一行相同- 其他任何内容都列出了与该重复集的第一行不同的列名(逗号分隔)
从这里,你可以使用 is.na(.)
(对于没有重复项),!is.na(.) & !nzchar(.)
(对于与第一行相同的行,包括第一行),以及 !is.na(.) & nzchar(.)
(对于具有差异的重复行)轻松地筛选出你想要的特定行。
英文:
I'm inferring that you're only concerned with comparing the 2nd and subsequent rows with the 1st row, not a complete pairwise set of differences.
This adds a column to the original frame that gives a comma-separated list of column names.
df$dupe_differences <- unlist(by(df, df$org_id, function(dat) {
if (nrow(dat) == 1) return(NA)
c("", sapply(2:nrow(dat), function(i) {
same <- mapply(Negate(`%in%`), dat[1,], dat[i,])
paste(names(same[same]), collapse = ",")
}))
}))
df
# org_id thing name start dupe_differences
# 1 a 1 really_long_A_name_here 2020-10-31
# 2 a 1 really_long_A_name_here 2020-10-31
# 3 b 1 really_long_B_name_here 2022-09-17
# 4 b 1 really_long_B2_name_here 2022-09-17 name
# 5 b 2 really_long_B_name_here 2022-09-17 thing
# 6 c 1 really_long_C_name_here 2023-05-11 <NA>
The distinction is unambiguous:
NA
means there are no duplicates, that row is unique""
(empty string) means that its contents are identical with the first row of that dupe-set- anything else lists the column names (comma-separated) that are different from the first row of that dupe-set
From here, you can easily filter out the specific rows you want using is.na(.)
(for no-dupes), !is.na(.) & !nzchar(.)
for rows identical to the first row (including the first row), and !is.na(.) & nzchar(.)
for rows that are duplicates with differences.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论