AWK: print ALL rows with MAX value in one field Per the other field including Identical Rows with Max value AND multiple columns

huangapple go评论61阅读模式
英文:

AWK: print ALL rows with MAX value in one field Per the other field including Identical Rows with Max value AND multiple columns

问题

I appreciate that I was touched with many solutions promptly from many contributors!!! (https://stackoverflow.com/questions/75392860/awk-print-all-rows-with-max-value-in-one-field-per-the-other-field-including-id/75401868#75401868)

This question include data with one more column and I'd like to keep the rows with highest value in column 2 per column 1 including identical rows with max value from the data containing multiple columns, and print all columns.

Data

a	130	data1
a	55	data2
a	66	data3
b	88	data4
b	99	data5
b	99	data6
c	110	data7
c	130	data8
c	130	data9

Desired output

a	130	data1
b	99	data5
b	99	data6
c	130	data8
c	130	data9

Code from @jared_mamrot works perfectly and print out all columns.

Code @Andre Wildberg provided also works perfectly and print out all columns.

       arr[$1] == $2{n[$1,arr[$1]]++; line[$1,arr[$1],n[$1,arr[$1]]] = $0}
       END{for(i in arr){
             j=0; do{j++; print line[i,arr[i],j]} while(j < n[i,arr[i]])}}' file

The awk script below by @Ed Morton also works perfectly for my previous data with 2 columns. It prints two columns; key and val.

My further question is when I have multiple columns in data, how should I modify this script to print all columns.

    { cnt[$1,$2]++; max[$1]=$2 }
    END { for (key in max) { val=max[key]; for (i=1; i<=cnt[key,val]; i++) print key, val } }
' 

Thank you all for great helps!!!

英文:

I appreciate that I was touched with many solutions promptly from many contributors!!! (https://stackoverflow.com/questions/75392860/awk-print-all-rows-with-max-value-in-one-field-per-the-other-field-including-id/75401868#75401868)

This question include data with one more column and I'd like to keep the rows with highest value in column 2 per column 1 including identical rows with max value from the data containing multiple columns, and print all columns.

Data

a	130	data1
a	55	data2
a	66	data3
b	88	data4
b	99	data5
b	99	data6
c	110	data7
c	130	data8
c	130	data9

Desired output

a	130	data1
b	99	data5
b	99	data6
c	130	data8
c	130	data9

Code from @jared_mamrot works perfectly and print out all columns.

awk &#39;NR==FNR{if($2 &gt; max[$1]){max[$1]=$2}; next} max[$1] == $2&#39; file file

Code @Andre Wildberg provided also works perfectly and print out all columns.

awk &#39;arr[$1] &lt; $2{arr[$1] = $2}
       arr[$1] == $2{n[$1,arr[$1]]++; line[$1,arr[$1],n[$1,arr[$1]]] = $0}
       END{for(i in arr){
             j=0; do{j++; print line[i,arr[i],j]} while(j &lt; n[i,arr[i]])}}&#39; file

The awk script below by @Ed Morton also works perfectly for my previous data with 2 columns. It prints two columns; key and val.

My further question is when I have multiple columns in data, how should I modify this script to print all columns.

sort file | awk &#39;
    { cnt[$1,$2]++; max[$1]=$2 }
    END { for (key in max) { val=max[key]; for (i=1; i&lt;=cnt[key,val]; i++) print key, val } }
&#39; 

Thank you all for great helps!!!

答案1

得分: 4

使用任何awk和sort命令:

$ sort -k1,1 -k2,2nr file | awk '!seen[$1]++{max=$2} $2==max'
a   130 data1
b   99  data5
b   99  data6
c   130 data8
c   130 data9

或者:

$ sort -k1,1 -k2,2nr file | awk '$1!=prev{prev=$1; max=$2} $2==max'
a   130 data1
b   99  data5
b   99  data6
c   130 data8
c   130 data9

在实际运行之前的原始脚本:

$ sort -k1,1 -k2,2nr file | awk '!seen[$1]++{key=$1; max=$2} $1==key && $2==max'
a   130 data1
b   99  data5
b   99  data6
c   130 data8
c   130 data9

seen[$1]++的值在输入中首次出现给定的$1值时为0,并在再次出现相同的$1时递增为非零值。因此,!seen[$1]++的值在条件上下文中为1(即true),第一次看到给定的$1时,之后为0(false)。因此,当第一次出现a作为$1时,我们将key设置为amax设置为$2的值,即在这种情况下为130。这就是!seen["a"]++的作用。

然后,从那时开始,我们只打印$1为a且$2为130的每一行,这在这种情况下只是输入的第一行。

然后,当b首次出现为$1时,发生相同的情况。

英文:

Using any awk and sort:

$ sort -k1,1 -k2,2nr file | awk &#39;!seen[$1]++{max=$2} $2==max&#39;
a   130 data1
b   99  data5
b   99  data6
c   130 data8
c   130 data9

or:

$ sort -k1,1 -k2,2nr file | awk &#39;$1!=prev{prev=$1; max=$2} $2==max&#39;
a   130 data1
b   99  data5
b   99  data6
c   130 data8
c   130 data9

original script before realising I'd over-thought it:

$ sort -k1,1 -k2,2nr file | awk &#39;!seen[$1]++{key=$1; max=$2} $1==key &amp;&amp; $2==max&#39;
a   130 data1
b   99  data5
b   99  data6
c   130 data8
c   130 data9

The value of seen[$1]++ is 0 the first time any given value of $1 appears in the input, and some incremental non-zero number when that same $1 appears again. So, the value of !seen[$1]++ is 1 (i.e. true in a conditional context) the first time a given $ is seen in the input, and 0 (false) afterwards. So, the first time a appears as $1 we set key to a and max to whatever value $2 has, i.e. 130 in this case. That's it for the involvement of !seen[&quot;a&quot;]++.

From then on we just print every line for which $1 is a and $2 is 130, which in this case is just the first line of input.

Then the same happens when b is first seen as $1.

答案2

得分: 3

只需添加一个关联数组,将第3列作为值,前两列作为键,并在cnt变量中计算一个运行计数器:

awk '{
   map[$1,$2,++cnt[$1,$2]] = $0
   max[$1] = ($2 > max[$1] ? $2 : max[$1])
}
END {
   for (key in max) {
      val = max[key]
      for (i=1; i<=cnt[key,val]; i++)
         print map[key,val,i]
   }
}' 文件名

a   130 data1
b   99  data5
b   99  data6
c   130 data8
c   130 data9

这个awk解决方案不需要对文件进行排序。

英文:

You just need one additional associative array to store 3rd column as value and key as first 2 columns and a running counter being computer in cnt variable:

awk &#39;{
   map[$1,$2,++cnt[$1,$2]] = $0
   max[$1] = ($2 &gt; max[$1] ? $2 : max[$1])
}
END {
   for (key in max) {
      val = max[key]
      for (i=1; i&lt;=cnt[key,val]; i++)
         print map[key,val,i]
   }
}&#39; file

a   130 data1
b   99  data5
b   99  data6
c   130 data8
c   130 data9

There is no need to sort the file for this awk solution.

答案3

得分: 2

假设一行可能有超过3个字段:

$ cat file
a   130 data1
a   55  data2
a   66  data3
b   88  data4
b   99  data5
b   99  data6
c   110 data7
c   130 data8
c   130 data9 data10 data11

修改当前的 awk 代码的一个想法:

awk '
    { key=$1; val=$2                    # 保存前两个字段
      $1=$2=""                          # 清除前两个字段
      gsub(/^[[:space:]]+/,"")          # 移除行首空格

      ++cnt[key,val]
      max[key]=(val > max[key] ? val : max[key])
      row[key,val,cnt[key,val]]=$0      # 保存剩余的行
    }
END { for (key in max) { 
          val=max[key]
          for (i=1; i<=cnt[key,val]; i++) 
              print key, val, row[key,val,i]
      }
    }
'  file

生成的结果是:

a 66 data3
b 99 data5
b 99 data6
c 130 data8
c 130 data9 data10 data11
英文:

Assuming there may be more than 3 fields to a row:

$ cat file
a   130 data1
a   55  data2
a   66  data3
b   88  data4
b   99  data5
b   99  data6
c   110 data7
c   130 data8
c   130 data9 data10 data11

One idea for modifying the current awk code:

awk &#39;
    { key=$1; val=$2                    # save 1st two fields
      $1=$2=&quot;&quot;                          # clear 1st two fields
      gsub(/^[[:space:]]+/,&quot;&quot;)          # remove leading white space from line

      ++cnt[key,val]
      max[key]=(val &gt; max[key] ? val : max[key])
      row[key,val,cnt[key,val]]=$0      # save rest of line
    }
END { for (key in max) { 
          val=max[key]
          for (i=1; i&lt;=cnt[key,val]; i++) 
              print key, val, row[key,val,i]
      }
    }
&#39;  file

This generates:

a 66 data3
b 99 data5
b 99 data6
c 130 data8
c 130 data9 data10 data11

答案4

得分: 2

awk '
    $1 != firstcol{ firstcol=$1; max=$2; map[NR]=$0 } 
    $1 == firstcol{
        if($2>max){ map[NR--]=$0; max=$2 }
        if($2==max) map[NR]=$0
    }
    END{
        for(i in map) print map[i]
    }
' inputfile

a   130 data1
b   99  data5
b   99  data6
c   130 data8
c   130 data9
英文:
awk &#39;
    $1 != firstcol{ firstcol=$1; max=$2; map[NR]=$0 } 
    $1 == firstcol{
        if($2&gt;max){ map[NR--]=$0; max=$2 }
        if($2==max) map[NR]=$0
    }
    END{
        for(i in map) print map[i]
    }
&#39; inputfile

a   130 data1
b   99  data5
b   99  data6
c   130 data8
c   130 data9

答案5

得分: 1

相同的 Ruby 代码只需进行细微调整:

ruby -e '
grps=$<.read.split(/\R/).
	group_by{|line| line[/^\S+/]}
# {"a"=>["a   130 data1", "a   55  data2", "a   66  data3"], "b"=>["b   88  data4", "b   99  data5", "b   99  data6"], "c"=>["c   110 data7", "c   130 data8", "c   130 data9"]}

maxes=grps.map{|k,v| v.max_by{|s| s.split[1].to_f}}.map{|s| s.split[0..1] }
# [["a", "130"], ["b", "99"], ["c", "130"]}

grps.values.flatten.each{|s| puts s if maxes.include?(s.split[0..1])}
' file

输出结果:

a   130 data1
b   99  data5
b   99  data6
c   130 data8
c   130 data9

一旦你需要处理三列或更多列的数据,使用 Ruby(或 Perl、Python 等)更容易,因为它支持切片、分组和连接数组。

英文:

The same ruby works with minor adjustments:

ruby -e &#39;
grps=$&lt;.read.split(/\R/).
	group_by{|line| line[/^\S+/]}
# {&quot;a&quot;=&gt;[&quot;a   130 data1&quot;, &quot;a   55  data2&quot;, &quot;a   66  data3&quot;], &quot;b&quot;=&gt;[&quot;b   88  data4&quot;, &quot;b   99  data5&quot;, &quot;b   99  data6&quot;], &quot;c&quot;=&gt;[&quot;c   110 data7&quot;, &quot;c   130 data8&quot;, &quot;c   130 data9&quot;]}

maxes=grps.map{|k,v| v.max_by{|s| s.split[1].to_f}}.map{|s| s.split[0..1] }
# [[&quot;a&quot;, &quot;130&quot;], [&quot;b&quot;, &quot;99&quot;], [&quot;c&quot;, &quot;130&quot;]]

grps.values.flatten.each{|s| puts s if maxes.include?(s.split[0..1])}
&#39; file 

Prints:

a   130 data1
b   99  data5
b   99  data6
c   130 data8
c   130 data9

Once you start getting into 3 or more columns to manage, it is easier to use ruby (or Perl, Python, etc) because of the support for slicing, grouping and joining arrays.

huangapple
  • 本文由 发表于 2023年2月10日 03:59:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75403832.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定