AWK: print ALL rows with MAX value in one field Per the other field including Identical Rows with Max value AND multiple columns

huangapple go评论90阅读模式
英文:

AWK: print ALL rows with MAX value in one field Per the other field including Identical Rows with Max value AND multiple columns

问题

I appreciate that I was touched with many solutions promptly from many contributors!!! (https://stackoverflow.com/questions/75392860/awk-print-all-rows-with-max-value-in-one-field-per-the-other-field-including-id/75401868#75401868)

This question include data with one more column and I'd like to keep the rows with highest value in column 2 per column 1 including identical rows with max value from the data containing multiple columns, and print all columns.

Data

  1. a 130 data1
  2. a 55 data2
  3. a 66 data3
  4. b 88 data4
  5. b 99 data5
  6. b 99 data6
  7. c 110 data7
  8. c 130 data8
  9. c 130 data9

Desired output

  1. a 130 data1
  2. b 99 data5
  3. b 99 data6
  4. c 130 data8
  5. c 130 data9

Code from @jared_mamrot works perfectly and print out all columns.

Code @Andre Wildberg provided also works perfectly and print out all columns.

  1. arr[$1] == $2{n[$1,arr[$1]]++; line[$1,arr[$1],n[$1,arr[$1]]] = $0}
  2. END{for(i in arr){
  3. j=0; do{j++; print line[i,arr[i],j]} while(j < n[i,arr[i]])}}' file

The awk script below by @Ed Morton also works perfectly for my previous data with 2 columns. It prints two columns; key and val.

My further question is when I have multiple columns in data, how should I modify this script to print all columns.

  1. { cnt[$1,$2]++; max[$1]=$2 }
  2. END { for (key in max) { val=max[key]; for (i=1; i<=cnt[key,val]; i++) print key, val } }
  3. '

Thank you all for great helps!!!

英文:

I appreciate that I was touched with many solutions promptly from many contributors!!! (https://stackoverflow.com/questions/75392860/awk-print-all-rows-with-max-value-in-one-field-per-the-other-field-including-id/75401868#75401868)

This question include data with one more column and I'd like to keep the rows with highest value in column 2 per column 1 including identical rows with max value from the data containing multiple columns, and print all columns.

Data

  1. a 130 data1
  2. a 55 data2
  3. a 66 data3
  4. b 88 data4
  5. b 99 data5
  6. b 99 data6
  7. c 110 data7
  8. c 130 data8
  9. c 130 data9

Desired output

  1. a 130 data1
  2. b 99 data5
  3. b 99 data6
  4. c 130 data8
  5. c 130 data9

Code from @jared_mamrot works perfectly and print out all columns.

  1. awk &#39;NR==FNR{if($2 &gt; max[$1]){max[$1]=$2}; next} max[$1] == $2&#39; file file

Code @Andre Wildberg provided also works perfectly and print out all columns.

  1. awk &#39;arr[$1] &lt; $2{arr[$1] = $2}
  2. arr[$1] == $2{n[$1,arr[$1]]++; line[$1,arr[$1],n[$1,arr[$1]]] = $0}
  3. END{for(i in arr){
  4. j=0; do{j++; print line[i,arr[i],j]} while(j &lt; n[i,arr[i]])}}&#39; file

The awk script below by @Ed Morton also works perfectly for my previous data with 2 columns. It prints two columns; key and val.

My further question is when I have multiple columns in data, how should I modify this script to print all columns.

  1. sort file | awk &#39;
  2. { cnt[$1,$2]++; max[$1]=$2 }
  3. END { for (key in max) { val=max[key]; for (i=1; i&lt;=cnt[key,val]; i++) print key, val } }
  4. &#39;

Thank you all for great helps!!!

答案1

得分: 4

使用任何awk和sort命令:

  1. $ sort -k1,1 -k2,2nr file | awk '!seen[$1]++{max=$2} $2==max'
  2. a 130 data1
  3. b 99 data5
  4. b 99 data6
  5. c 130 data8
  6. c 130 data9

或者:

  1. $ sort -k1,1 -k2,2nr file | awk '$1!=prev{prev=$1; max=$2} $2==max'
  2. a 130 data1
  3. b 99 data5
  4. b 99 data6
  5. c 130 data8
  6. c 130 data9

在实际运行之前的原始脚本:

  1. $ sort -k1,1 -k2,2nr file | awk '!seen[$1]++{key=$1; max=$2} $1==key && $2==max'
  2. a 130 data1
  3. b 99 data5
  4. b 99 data6
  5. c 130 data8
  6. c 130 data9

seen[$1]++的值在输入中首次出现给定的$1值时为0,并在再次出现相同的$1时递增为非零值。因此,!seen[$1]++的值在条件上下文中为1(即true),第一次看到给定的$1时,之后为0(false)。因此,当第一次出现a作为$1时,我们将key设置为amax设置为$2的值,即在这种情况下为130。这就是!seen["a"]++的作用。

然后,从那时开始,我们只打印$1为a且$2为130的每一行,这在这种情况下只是输入的第一行。

然后,当b首次出现为$1时,发生相同的情况。

英文:

Using any awk and sort:

  1. $ sort -k1,1 -k2,2nr file | awk &#39;!seen[$1]++{max=$2} $2==max&#39;
  2. a 130 data1
  3. b 99 data5
  4. b 99 data6
  5. c 130 data8
  6. c 130 data9

or:

  1. $ sort -k1,1 -k2,2nr file | awk &#39;$1!=prev{prev=$1; max=$2} $2==max&#39;
  2. a 130 data1
  3. b 99 data5
  4. b 99 data6
  5. c 130 data8
  6. c 130 data9

original script before realising I'd over-thought it:

  1. $ sort -k1,1 -k2,2nr file | awk &#39;!seen[$1]++{key=$1; max=$2} $1==key &amp;&amp; $2==max&#39;
  2. a 130 data1
  3. b 99 data5
  4. b 99 data6
  5. c 130 data8
  6. c 130 data9

The value of seen[$1]++ is 0 the first time any given value of $1 appears in the input, and some incremental non-zero number when that same $1 appears again. So, the value of !seen[$1]++ is 1 (i.e. true in a conditional context) the first time a given $ is seen in the input, and 0 (false) afterwards. So, the first time a appears as $1 we set key to a and max to whatever value $2 has, i.e. 130 in this case. That's it for the involvement of !seen[&quot;a&quot;]++.

From then on we just print every line for which $1 is a and $2 is 130, which in this case is just the first line of input.

Then the same happens when b is first seen as $1.

答案2

得分: 3

只需添加一个关联数组,将第3列作为值,前两列作为键,并在cnt变量中计算一个运行计数器:

  1. awk '{
  2. map[$1,$2,++cnt[$1,$2]] = $0
  3. max[$1] = ($2 > max[$1] ? $2 : max[$1])
  4. }
  5. END {
  6. for (key in max) {
  7. val = max[key]
  8. for (i=1; i<=cnt[key,val]; i++)
  9. print map[key,val,i]
  10. }
  11. }' 文件名
  12. a 130 data1
  13. b 99 data5
  14. b 99 data6
  15. c 130 data8
  16. c 130 data9

这个awk解决方案不需要对文件进行排序。

英文:

You just need one additional associative array to store 3rd column as value and key as first 2 columns and a running counter being computer in cnt variable:

  1. awk &#39;{
  2. map[$1,$2,++cnt[$1,$2]] = $0
  3. max[$1] = ($2 &gt; max[$1] ? $2 : max[$1])
  4. }
  5. END {
  6. for (key in max) {
  7. val = max[key]
  8. for (i=1; i&lt;=cnt[key,val]; i++)
  9. print map[key,val,i]
  10. }
  11. }&#39; file
  12. a 130 data1
  13. b 99 data5
  14. b 99 data6
  15. c 130 data8
  16. c 130 data9

There is no need to sort the file for this awk solution.

答案3

得分: 2

假设一行可能有超过3个字段:

  1. $ cat file
  2. a 130 data1
  3. a 55 data2
  4. a 66 data3
  5. b 88 data4
  6. b 99 data5
  7. b 99 data6
  8. c 110 data7
  9. c 130 data8
  10. c 130 data9 data10 data11

修改当前的 awk 代码的一个想法:

  1. awk '
  2. { key=$1; val=$2 # 保存前两个字段
  3. $1=$2="" # 清除前两个字段
  4. gsub(/^[[:space:]]+/,"") # 移除行首空格
  5. ++cnt[key,val]
  6. max[key]=(val > max[key] ? val : max[key])
  7. row[key,val,cnt[key,val]]=$0 # 保存剩余的行
  8. }
  9. END { for (key in max) {
  10. val=max[key]
  11. for (i=1; i<=cnt[key,val]; i++)
  12. print key, val, row[key,val,i]
  13. }
  14. }
  15. ' file

生成的结果是:

  1. a 66 data3
  2. b 99 data5
  3. b 99 data6
  4. c 130 data8
  5. c 130 data9 data10 data11
英文:

Assuming there may be more than 3 fields to a row:

  1. $ cat file
  2. a 130 data1
  3. a 55 data2
  4. a 66 data3
  5. b 88 data4
  6. b 99 data5
  7. b 99 data6
  8. c 110 data7
  9. c 130 data8
  10. c 130 data9 data10 data11

One idea for modifying the current awk code:

  1. awk &#39;
  2. { key=$1; val=$2 # save 1st two fields
  3. $1=$2=&quot;&quot; # clear 1st two fields
  4. gsub(/^[[:space:]]+/,&quot;&quot;) # remove leading white space from line
  5. ++cnt[key,val]
  6. max[key]=(val &gt; max[key] ? val : max[key])
  7. row[key,val,cnt[key,val]]=$0 # save rest of line
  8. }
  9. END { for (key in max) {
  10. val=max[key]
  11. for (i=1; i&lt;=cnt[key,val]; i++)
  12. print key, val, row[key,val,i]
  13. }
  14. }
  15. &#39; file

This generates:

  1. a 66 data3
  2. b 99 data5
  3. b 99 data6
  4. c 130 data8
  5. c 130 data9 data10 data11

答案4

得分: 2

  1. awk '
  2. $1 != firstcol{ firstcol=$1; max=$2; map[NR]=$0 }
  3. $1 == firstcol{
  4. if($2>max){ map[NR--]=$0; max=$2 }
  5. if($2==max) map[NR]=$0
  6. }
  7. END{
  8. for(i in map) print map[i]
  9. }
  10. ' inputfile
  11. a 130 data1
  12. b 99 data5
  13. b 99 data6
  14. c 130 data8
  15. c 130 data9
英文:
  1. awk &#39;
  2. $1 != firstcol{ firstcol=$1; max=$2; map[NR]=$0 }
  3. $1 == firstcol{
  4. if($2&gt;max){ map[NR--]=$0; max=$2 }
  5. if($2==max) map[NR]=$0
  6. }
  7. END{
  8. for(i in map) print map[i]
  9. }
  10. &#39; inputfile
  11. a 130 data1
  12. b 99 data5
  13. b 99 data6
  14. c 130 data8
  15. c 130 data9

答案5

得分: 1

相同的 Ruby 代码只需进行细微调整:

  1. ruby -e '
  2. grps=$<.read.split(/\R/).
  3. group_by{|line| line[/^\S+/]}
  4. # {"a"=>["a 130 data1", "a 55 data2", "a 66 data3"], "b"=>["b 88 data4", "b 99 data5", "b 99 data6"], "c"=>["c 110 data7", "c 130 data8", "c 130 data9"]}
  5. maxes=grps.map{|k,v| v.max_by{|s| s.split[1].to_f}}.map{|s| s.split[0..1] }
  6. # [["a", "130"], ["b", "99"], ["c", "130"]}
  7. grps.values.flatten.each{|s| puts s if maxes.include?(s.split[0..1])}
  8. ' file

输出结果:

  1. a 130 data1
  2. b 99 data5
  3. b 99 data6
  4. c 130 data8
  5. c 130 data9

一旦你需要处理三列或更多列的数据,使用 Ruby(或 Perl、Python 等)更容易,因为它支持切片、分组和连接数组。

英文:

The same ruby works with minor adjustments:

  1. ruby -e &#39;
  2. grps=$&lt;.read.split(/\R/).
  3. group_by{|line| line[/^\S+/]}
  4. # {&quot;a&quot;=&gt;[&quot;a 130 data1&quot;, &quot;a 55 data2&quot;, &quot;a 66 data3&quot;], &quot;b&quot;=&gt;[&quot;b 88 data4&quot;, &quot;b 99 data5&quot;, &quot;b 99 data6&quot;], &quot;c&quot;=&gt;[&quot;c 110 data7&quot;, &quot;c 130 data8&quot;, &quot;c 130 data9&quot;]}
  5. maxes=grps.map{|k,v| v.max_by{|s| s.split[1].to_f}}.map{|s| s.split[0..1] }
  6. # [[&quot;a&quot;, &quot;130&quot;], [&quot;b&quot;, &quot;99&quot;], [&quot;c&quot;, &quot;130&quot;]]
  7. grps.values.flatten.each{|s| puts s if maxes.include?(s.split[0..1])}
  8. &#39; file

Prints:

  1. a 130 data1
  2. b 99 data5
  3. b 99 data6
  4. c 130 data8
  5. c 130 data9

Once you start getting into 3 or more columns to manage, it is easier to use ruby (or Perl, Python, etc) because of the support for slicing, grouping and joining arrays.

huangapple
  • 本文由 发表于 2023年2月10日 03:59:10
  • 转载请务必保留本文链接:https://go.coder-hub.com/75403832.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定