英文:
AWK command to get distinct values in a column based on the values in another column that are the same
问题
我想查找具有col1和col2中相同值的行,并获取它们不同的第三列。
假设我有这个数据:
City,Sunset,Anna
City,Sunset,Anna
City,Sunset,Ben
City,Sunset,Ben
我的期望输出是:
City,Sunset,"Anna,Ben"
到目前为止,我有这段代码:
cat file.txt | sed 's/,/|/2' | awk -F"," '{if (!($1 in a)) a[$1] = $2; else a[$1]=a[$1] "," $2 } END { for (key in a) print key, a[key] }' OFS=,
但是我只得到了:
City,Sunset,Anna,Anna,Ben,Ben
输出应该是:
City,Sunset,"Anna,Ben"
有人可以帮助我解决这个问题吗?
英文:
I want to find rows that have the same values in col1 and col2, and get their distinct third column
Say I have this data
City,Sunset,Anna
City,Sunset,Anna
City,Sunset,Ben
City,Sunset,Ben
My desired output is:
City,Sunset,"Anna,Ben"
So far, I have this code
cat file.txt | sed 's/,/|/2' | awk -F"," '{if (!($1 in a)) a[$1] = $2; else a[$1]=a[$1] "," $2 } END { for (key in a) print key, a[key] }' OFS=,
But Im only getting
City,Sunset,Anna,Anna,Ben,Ben
When the output should be
City,Sunset,"Anna,Ben"
Could someone help me with this one?
答案1
得分: 2
使用GNU awk
(支持多维数组,输出以-|
为前缀):
awk -F, '{ arr[$1 "," $2][$3] } END { for(i in arr) { s=""; for(j in arr[i]) s = s (s?"," : "") j; print i ",\"" s "\""}}' Input_file
-| City,Sunset,"Ben,Anna"
- 将逗号声明为输入字段分隔符(
-F,
)。 - 对于每一行,在数组
arr
中添加一个具有键$1,$2
和$3
的条目。 - 在
END
中循环遍历两个索引,将第二个(j
)索引连接到字符串s
中,以逗号分隔,然后打印第一个索引(i
),一个逗号,以及双引号之间的字符串s
。
对于任何符合POSIX标准的awk
,最简单的方法可能是首先对输入文件进行排序,以便所有具有相同的第一和第二字段的行都是连续的:
sort Input_file | awk -F, '
function foo() {s=""; for(i in arr) s=s (s?",":"") i
print key ",\"" s "\""; delete arr}
{key=$1 "," $2} NR>1 && key!=prv {foo()} {prv=key; arr[$3]} END {foo()}'
我们使用相同的原理,但使用了一维数组,并在两个第一字段和第二字段发生变化(key!=prv
)或在END
时打印,而不是在END
时打印所有内容。请注意使用一个函数(foo
)来因式分解打印代码。
英文:
With GNU awk
(for multi-dimensional arrays, output prefixed with -|
):
awk -F, '{ arr[$1 "," $2][$3] } END { for(i in arr) {
s=""; for(j in arr[i]) s = s (s?",":"") j; print i ",\"" s "\""}}' Input_file
-| City,Sunset,"Ben,Anna"
- Declare the comma as input field separator (
-F,
). - For each line add an entry in array
arr
with keys$1,$2
and$3
. - At the
END
loop over the 2 indexes, concatenate the second (j
) index in strings
, separated with commas, print first index (i
), a comma and strings
between double quotes.
With any POSIX-compliant awk
the simplest is probably to sort the input file first, such that all lines with same first and second field are consecutive:
sort Input_file | awk -F, '
function foo() {s=""; for(i in arr) s=s (s?",":"") i
print key ",\"" s "\""; delete arr}
{key=$1 "," $2} NR>1 && key!=prv {foo()} {prv=key; arr[$3]} END {foo()}'
We use the same principle but with a one-dimension array, and print when the two first fields change (key!=prv
) or at the END
, instead of printing everything at the END
. Note the use of a function (foo
) to factorize the printing code.
答案2
得分: 2
awk -F, -v OFS=, '
!a[$0]++{
b[$1","$2]=(b[$1","$2] ? b[$1","$2]","$3 : $3)
}
END{
for(i in b) print i,"""b[i]"""
}
' file
英文:
awk -F, -v OFS=, '
!a[$0]++{
b[$1","$2]=(b[$1","$2] ? b[$1","$2]","$3 : $3)
}
END{
for(i in b) print i,"\""b[i]"\""
}
' file
答案3
得分: 1
1st solution: 使用您提供的示例,请尝试以下代码。我在这里使用了 uniq
+ awk
的组合。
uniq Input_file |
awk '
BEGIN{
FS=OFS=","
s1="\""
}
{
arr[$1 FS $2]=(arr[$1 FS $2]?arr[$1 FS $2] OFS:"") $NF
}
END{
for(i in arr){
print i,s1 arr[i] s1
}
}
'
2nd solution: 仅使用 awk
的解决方案。从第一个 awk
中删除重复项,然后将其作为输入传递给主 awk
。
awk '
BEGIN{
FS=OFS=","
s1="\""
}
{
arr[$1 FS $2]=(arr[$1 FS $2]?arr[$1 FS $2] OFS:"") $NF
}
END{
for(i in arr){
print i,s1 arr[i] s1
}
}
' <(awk '!arr[$0]++' Input_file)
英文:
1st solution: With your shown samples only please try following code. I am using uniq
+ awk
combination here.
uniq Input_file |
awk '
BEGIN{
FS=OFS=","
s1="\""
}
{
arr[$1 FS $2]=(arr[$1 FS $2]?arr[$1 FS $2] OFS:"") $NF
}
END{
for(i in arr){
print i,s1 arr[i] s1
}
}
'
2nd solution: Only awk
solution. Where removing the Duplicates from 1 awk and passing it to main awk then as input.
awk '
BEGIN{
FS=OFS=","
s1="\""
}
{
arr[$1 FS $2]=(arr[$1 FS $2]?arr[$1 FS $2] OFS:"") $NF
}
END{
for(i in arr){
print i,s1 arr[i] s1
}
}
' <(awk '!arr[$0]++' Input_file)
答案4
得分: 1
使用任何awk:
$ cat tst.awk
BEGIN { FS=OFS="," }
{
key = $1 FS $2
val = $3
}
key != prev {
if ( NR > 1 ) {
print prev, "\"" vals "\""
}
vals = sep = ""
prev = key
delete seen
}
!seen[val]++ {
vals = vals sep val
sep = OFS
}
END {
print prev, "\"" vals "\""
}
$ awk -f tst.awk file
City,Sunset,"Anna,Ben"
如果你的第一个和第二个输入字段中有多个不同的值,而且不管第三个字段的顺序如何,这将起作用。
它将按照输入中它们出现的顺序输出值,并且一次只存储一个键对应的值,而不是存储整个文件。
例如,给定以下输入:
$ cat file
City,Sunset,Anna
City,Sunset,Ben
City,Sunset,Sue
City,Sunset,Ben
Town,Sunrise,Ben
Town,Sunrise,Phil
它将产生我认为是预期输出:
$ awk -f tst.awk file
City,Sunset,"Anna,Ben,Sue"
Town,Sunrise,"Ben,Phil"
上述脚本假定你的输入是根据第一个和第二个字段的值分组的,如果不是的话,首先运行 sort -t, -k1,2
。
英文:
Using any awk:
$ cat tst.awk
BEGIN { FS=OFS="," }
{
key = $1 FS $2
val = $3
}
key != prev {
if ( NR > 1 ) {
print prev, "\"" vals "\""
}
vals = sep = ""
prev = key
delete seen
}
!seen[val]++ {
vals = vals sep val
sep = OFS
}
END {
print prev, "\"" vals "\""
}
<p>
$ awk -f tst.awk file
City,Sunset,"Anna,Ben"
That will work if you have multiple different values in the first 2 input fields, and regardless of the order of the 3rd field values.
It will output values in the order they appear in the input and only stores the values for one key pair at a time rather than storing the whole file.
For example, given this input:
$ cat file
City,Sunset,Anna
City,Sunset,Ben
City,Sunset,Sue
City,Sunset,Ben
Town,Sunrise,Ben
Town,Sunrise,Phil
it will produce what I assume is the expected output:
$ awk -f tst.awk file
City,Sunset,"Anna,Ben,Sue"
Town,Sunrise,"Ben,Phil"
The above script assumes your input is grouped by the values of the first 2 fields, if it isn't then run sort -t, -k1,2
on it first.
答案5
得分: 0
以下是翻译好的部分:
给定这个示例:
cat file
City,Sunset,Anna
City,Sunrise,Bob
City,Midday,Ellen
City,Sunset,Anna
City,Sunset,Ben
City,Sunset,Ben
City,Sunrise,Ben
这是一个用Ruby实现的代码:
ruby -F, -lane 'BEGIN{h=Hash.new { |hash, key| hash[key] = Set.new() }}
h[$F[0..1].join(",")]<<$F[2]
END{puts h.map{|k,v| "#{k},\"#{v.join(",")}\""}}' file
输出结果为:
City,Sunset,"Anna,Ben"
City,Sunrise,"Bob,Ben"
City,Midday,"Ellen"
英文:
Given this example:
cat file
City,Sunset,Anna
City,Sunrise,Bob
City,Midday,Ellen
City,Sunset,Anna
City,Sunset,Ben
City,Sunset,Ben
City,Sunrise,Ben
Here is a Ruby to do that:
ruby -F, -lane 'BEGIN{h=Hash.new { |hash, key| hash[key] = Set.new() }}
h[$F[0..1].join(",")]<<$F[2]
END{puts h.map{|k,v| "#{k},\"#{v.join(",")}\""}}' file
Prints:
City,Sunset,"Anna,Ben"
City,Sunrise,"Bob,Ben"
City,Midday,"Ellen"
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论