AWK命令根据另一列中相同的值获取列中的唯一值。

huangapple go评论92阅读模式
英文:

AWK command to get distinct values in a column based on the values in another column that are the same

问题

我想查找具有col1和col2中相同值的行,并获取它们不同的第三列。

假设我有这个数据:

City,Sunset,Anna
City,Sunset,Anna
City,Sunset,Ben
City,Sunset,Ben

我的期望输出是:

City,Sunset,"Anna,Ben"

到目前为止,我有这段代码:

cat file.txt | sed 's/,/|/2' | awk -F"," '{if (!($1 in a)) a[$1] = $2; else a[$1]=a[$1] "," $2 } END { for (key in a) print key, a[key] }' OFS=,

但是我只得到了:

City,Sunset,Anna,Anna,Ben,Ben

输出应该是:

City,Sunset,"Anna,Ben"

有人可以帮助我解决这个问题吗?

英文:

I want to find rows that have the same values in col1 and col2, and get their distinct third column

Say I have this data

City,Sunset,Anna
City,Sunset,Anna
City,Sunset,Ben
City,Sunset,Ben

My desired output is:

City,Sunset,"Anna,Ben"

So far, I have this code

cat file.txt | sed 's/,/|/2' | awk -F"," '{if (!($1 in a)) a[$1] = $2; else a[$1]=a[$1] "," $2 } END { for (key in a) print key, a[key] }' OFS=,

But Im only getting

City,Sunset,Anna,Anna,Ben,Ben

When the output should be

City,Sunset,"Anna,Ben"

Could someone help me with this one?

答案1

得分: 2

使用GNU awk(支持多维数组,输出以-| 为前缀):

awk -F, '{ arr[$1 "," $2][$3] } END { for(i in arr) { s=""; for(j in arr[i]) s = s (s?"," : "") j; print i ",\"" s "\""}}' Input_file
-| City,Sunset,"Ben,Anna"
  • 将逗号声明为输入字段分隔符(-F,)。
  • 对于每一行,在数组arr中添加一个具有键$1,$2$3的条目。
  • END中循环遍历两个索引,将第二个(j)索引连接到字符串s中,以逗号分隔,然后打印第一个索引(i),一个逗号,以及双引号之间的字符串s

对于任何符合POSIX标准的awk,最简单的方法可能是首先对输入文件进行排序,以便所有具有相同的第一和第二字段的行都是连续的:

sort Input_file | awk -F, '
  function foo() {s=""; for(i in arr) s=s (s?",":"") i
    print key ",\"" s "\""; delete arr}
  {key=$1 "," $2} NR>1 && key!=prv {foo()} {prv=key; arr[$3]} END {foo()}'

我们使用相同的原理,但使用了一维数组,并在两个第一字段和第二字段发生变化(key!=prv)或在END时打印,而不是在END时打印所有内容。请注意使用一个函数(foo)来因式分解打印代码。

英文:

With GNU awk (for multi-dimensional arrays, output prefixed with -| ):

awk -F, '{ arr[$1 "," $2][$3] } END { for(i in arr) {
  s=""; for(j in arr[i]) s = s (s?",":"") j; print i ",\"" s "\""}}' Input_file
-| City,Sunset,"Ben,Anna"
  • Declare the comma as input field separator (-F,).
  • For each line add an entry in array arr with keys $1,$2 and $3.
  • At the END loop over the 2 indexes, concatenate the second (j) index in string s, separated with commas, print first index (i), a comma and string s between double quotes.

With any POSIX-compliant awk the simplest is probably to sort the input file first, such that all lines with same first and second field are consecutive:

sort Input_file | awk -F, '
  function foo() {s=""; for(i in arr) s=s (s?",":"") i
    print key ",\"" s "\""; delete arr}
  {key=$1 "," $2} NR>1 && key!=prv {foo()} {prv=key; arr[$3]} END {foo()}'

We use the same principle but with a one-dimension array, and print when the two first fields change (key!=prv) or at the END, instead of printing everything at the END. Note the use of a function (foo) to factorize the printing code.

答案2

得分: 2

awk -F, -v OFS=, '
!a[$0]++{
b[$1","$2]=(b[$1","$2] ? b[$1","$2]","$3 : $3)
}
END{
for(i in b) print i,"""b[i]"""
}
' file

英文:
awk -F, -v OFS=, '
    !a[$0]++{
        b[$1","$2]=(b[$1","$2] ? b[$1","$2]","$3 : $3)
    }
    END{
        for(i in b) print  i,"\""b[i]"\""
    }
' file

答案3

得分: 1

1st solution: 使用您提供的示例,请尝试以下代码。我在这里使用了 uniq + awk 的组合。

uniq Input_file | 
awk '
BEGIN{
  FS=OFS=","
  s1="\""
}
{
  arr[$1 FS $2]=(arr[$1 FS $2]?arr[$1 FS $2] OFS:"") $NF
}
END{
  for(i in arr){
    print i,s1 arr[i] s1
  }
}
'

2nd solution: 仅使用 awk 的解决方案。从第一个 awk 中删除重复项,然后将其作为输入传递给主 awk

awk '
BEGIN{
  FS=OFS=","
  s1="\""
}
{
  arr[$1 FS $2]=(arr[$1 FS $2]?arr[$1 FS $2] OFS:"") $NF
}
END{
  for(i in arr){
    print i,s1 arr[i] s1
  }
}
' <(awk '!arr[$0]++' Input_file)
英文:

1st solution: With your shown samples only please try following code. I am using uniq + awk combination here.

uniq Input_file | 
awk &#39;
BEGIN{
  FS=OFS=&quot;,&quot;
  s1=&quot;\&quot;&quot;
}
{
  arr[$1 FS $2]=(arr[$1 FS $2]?arr[$1 FS $2] OFS:&quot;&quot;) $NF
}
END{
  for(i in arr){
    print i,s1 arr[i] s1
  }
}
&#39;

2nd solution: Only awk solution. Where removing the Duplicates from 1 awk and passing it to main awk then as input.

awk &#39;
BEGIN{
  FS=OFS=&quot;,&quot;
  s1=&quot;\&quot;&quot;
}
{
  arr[$1 FS $2]=(arr[$1 FS $2]?arr[$1 FS $2] OFS:&quot;&quot;) $NF
}
END{
  for(i in arr){
    print i,s1 arr[i] s1
  }
}
&#39; &lt;(awk &#39;!arr[$0]++&#39; Input_file)

答案4

得分: 1

使用任何awk:

$ cat tst.awk
BEGIN { FS=OFS="," }
{
    key = $1 FS $2
    val = $3
}
key != prev {
    if ( NR > 1 ) {
        print prev, "\"" vals "\""
    }
    vals = sep = ""
    prev = key
    delete seen
}
!seen[val]++ {
    vals = vals sep val
    sep = OFS
}
END {
    print prev, "\"" vals "\""
}
$ awk -f tst.awk file
City,Sunset,"Anna,Ben"

如果你的第一个和第二个输入字段中有多个不同的值,而且不管第三个字段的顺序如何,这将起作用。

它将按照输入中它们出现的顺序输出值,并且一次只存储一个键对应的值,而不是存储整个文件。

例如,给定以下输入:

$ cat file
City,Sunset,Anna
City,Sunset,Ben
City,Sunset,Sue
City,Sunset,Ben
Town,Sunrise,Ben
Town,Sunrise,Phil

它将产生我认为是预期输出:

$ awk -f tst.awk file
City,Sunset,"Anna,Ben,Sue"
Town,Sunrise,"Ben,Phil"

上述脚本假定你的输入是根据第一个和第二个字段的值分组的,如果不是的话,首先运行 sort -t, -k1,2

英文:

Using any awk:

$ cat tst.awk
BEGIN { FS=OFS=&quot;,&quot; }
{
    key = $1 FS $2
    val = $3
}
key != prev {
    if ( NR &gt; 1 ) {
        print prev, &quot;\&quot;&quot; vals &quot;\&quot;&quot;
    }
    vals = sep = &quot;&quot;
    prev = key
    delete seen
}
!seen[val]++ {
    vals = vals sep val
    sep = OFS
}
END {
    print prev, &quot;\&quot;&quot; vals &quot;\&quot;&quot;
}

<p>

$ awk -f tst.awk file
City,Sunset,&quot;Anna,Ben&quot;

That will work if you have multiple different values in the first 2 input fields, and regardless of the order of the 3rd field values.

It will output values in the order they appear in the input and only stores the values for one key pair at a time rather than storing the whole file.

For example, given this input:

$ cat file
City,Sunset,Anna
City,Sunset,Ben
City,Sunset,Sue
City,Sunset,Ben
Town,Sunrise,Ben
Town,Sunrise,Phil

it will produce what I assume is the expected output:

$ awk -f tst.awk file
City,Sunset,&quot;Anna,Ben,Sue&quot;
Town,Sunrise,&quot;Ben,Phil&quot;

The above script assumes your input is grouped by the values of the first 2 fields, if it isn't then run sort -t, -k1,2 on it first.

答案5

得分: 0

以下是翻译好的部分:

给定这个示例:

cat file
City,Sunset,Anna
City,Sunrise,Bob
City,Midday,Ellen
City,Sunset,Anna
City,Sunset,Ben
City,Sunset,Ben
City,Sunrise,Ben

这是一个用Ruby实现的代码:

ruby -F, -lane 'BEGIN{h=Hash.new { |hash, key| hash[key] = Set.new() }}
h[$F[0..1].join(",")]<<$F[2]
END{puts h.map{|k,v| "#{k},\"#{v.join(",")}\""}}' file

输出结果为:

City,Sunset,"Anna,Ben"
City,Sunrise,"Bob,Ben"
City,Midday,"Ellen"
英文:

Given this example:

cat file
City,Sunset,Anna
City,Sunrise,Bob
City,Midday,Ellen
City,Sunset,Anna
City,Sunset,Ben
City,Sunset,Ben
City,Sunrise,Ben

Here is a Ruby to do that:

ruby -F, -lane &#39;BEGIN{h=Hash.new { |hash, key| hash[key] = Set.new() }}
h[$F[0..1].join(&quot;,&quot;)]&lt;&lt;$F[2]
END{puts h.map{|k,v| &quot;#{k},\&quot;#{v.join(&quot;,&quot;)}\&quot;&quot;}}&#39; file

Prints:

City,Sunset,&quot;Anna,Ben&quot;
City,Sunrise,&quot;Bob,Ben&quot;
City,Midday,&quot;Ellen&quot;

huangapple
  • 本文由 发表于 2023年8月5日 16:09:07
  • 转载请务必保留本文链接:https://go.coder-hub.com/76840695.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定