英文:
AWK: print ALL rows with MAX value in one field Per the other field including Identical Rows with Max value AND multiple columns
问题
I appreciate that I was touched with many solutions promptly from many contributors!!! (https://stackoverflow.com/questions/75392860/awk-print-all-rows-with-max-value-in-one-field-per-the-other-field-including-id/75401868#75401868)
This question include data with one more column and I'd like to keep the rows with highest value in column 2 per column 1 including identical rows with max value from the data containing multiple columns, and print all columns.
Data
a 130 data1
a 55 data2
a 66 data3
b 88 data4
b 99 data5
b 99 data6
c 110 data7
c 130 data8
c 130 data9
Desired output
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
Code from @jared_mamrot works perfectly and print out all columns.
Code @Andre Wildberg provided also works perfectly and print out all columns.
arr[$1] == $2{n[$1,arr[$1]]++; line[$1,arr[$1],n[$1,arr[$1]]] = $0}
END{for(i in arr){
j=0; do{j++; print line[i,arr[i],j]} while(j < n[i,arr[i]])}}' file
The awk script below by @Ed Morton also works perfectly for my previous data with 2 columns. It prints two columns; key and val.
My further question is when I have multiple columns in data, how should I modify this script to print all columns.
{ cnt[$1,$2]++; max[$1]=$2 }
END { for (key in max) { val=max[key]; for (i=1; i<=cnt[key,val]; i++) print key, val } }
'
Thank you all for great helps!!!
英文:
I appreciate that I was touched with many solutions promptly from many contributors!!! (https://stackoverflow.com/questions/75392860/awk-print-all-rows-with-max-value-in-one-field-per-the-other-field-including-id/75401868#75401868)
This question include data with one more column and I'd like to keep the rows with highest value in column 2 per column 1 including identical rows with max value from the data containing multiple columns, and print all columns.
Data
a 130 data1
a 55 data2
a 66 data3
b 88 data4
b 99 data5
b 99 data6
c 110 data7
c 130 data8
c 130 data9
Desired output
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
Code from @jared_mamrot works perfectly and print out all columns.
awk 'NR==FNR{if($2 > max[$1]){max[$1]=$2}; next} max[$1] == $2' file file
Code @Andre Wildberg provided also works perfectly and print out all columns.
awk 'arr[$1] < $2{arr[$1] = $2}
arr[$1] == $2{n[$1,arr[$1]]++; line[$1,arr[$1],n[$1,arr[$1]]] = $0}
END{for(i in arr){
j=0; do{j++; print line[i,arr[i],j]} while(j < n[i,arr[i]])}}' file
The awk script below by @Ed Morton also works perfectly for my previous data with 2 columns. It prints two columns; key and val.
My further question is when I have multiple columns in data, how should I modify this script to print all columns.
sort file | awk '
{ cnt[$1,$2]++; max[$1]=$2 }
END { for (key in max) { val=max[key]; for (i=1; i<=cnt[key,val]; i++) print key, val } }
'
Thank you all for great helps!!!
答案1
得分: 4
使用任何awk和sort命令:
$ sort -k1,1 -k2,2nr file | awk '!seen[$1]++{max=$2} $2==max'
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
或者:
$ sort -k1,1 -k2,2nr file | awk '$1!=prev{prev=$1; max=$2} $2==max'
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
在实际运行之前的原始脚本:
$ sort -k1,1 -k2,2nr file | awk '!seen[$1]++{key=$1; max=$2} $1==key && $2==max'
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
seen[$1]++
的值在输入中首次出现给定的$1值时为0,并在再次出现相同的$1时递增为非零值。因此,!seen[$1]++
的值在条件上下文中为1(即true
),第一次看到给定的$1时,之后为0(false
)。因此,当第一次出现a
作为$1时,我们将key
设置为a
,max
设置为$2的值,即在这种情况下为130
。这就是!seen["a"]++
的作用。
然后,从那时开始,我们只打印$1为a
且$2为130
的每一行,这在这种情况下只是输入的第一行。
然后,当b
首次出现为$1时,发生相同的情况。
英文:
Using any awk and sort:
$ sort -k1,1 -k2,2nr file | awk '!seen[$1]++{max=$2} $2==max'
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
or:
$ sort -k1,1 -k2,2nr file | awk '$1!=prev{prev=$1; max=$2} $2==max'
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
original script before realising I'd over-thought it:
$ sort -k1,1 -k2,2nr file | awk '!seen[$1]++{key=$1; max=$2} $1==key && $2==max'
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
The value of seen[$1]++
is 0 the first time any given value of $1 appears in the input, and some incremental non-zero number when that same $1 appears again. So, the value of !seen[$1]++
is 1
(i.e. true
in a conditional context) the first time a given $
is seen in the input, and 0
(false
) afterwards. So, the first time a
appears as $1
we set key
to a
and max
to whatever value $2 has, i.e. 130
in this case. That's it for the involvement of !seen["a"]++
.
From then on we just print every line for which $1 is a
and $2 is 130
, which in this case is just the first line of input.
Then the same happens when b
is first seen as $1.
答案2
得分: 3
只需添加一个关联数组,将第3列作为值,前两列作为键,并在cnt
变量中计算一个运行计数器:
awk '{
map[$1,$2,++cnt[$1,$2]] = $0
max[$1] = ($2 > max[$1] ? $2 : max[$1])
}
END {
for (key in max) {
val = max[key]
for (i=1; i<=cnt[key,val]; i++)
print map[key,val,i]
}
}' 文件名
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
这个awk
解决方案不需要对文件进行排序。
英文:
You just need one additional associative array to store 3rd column as value and key as first 2 columns and a running counter being computer in cnt
variable:
awk '{
map[$1,$2,++cnt[$1,$2]] = $0
max[$1] = ($2 > max[$1] ? $2 : max[$1])
}
END {
for (key in max) {
val = max[key]
for (i=1; i<=cnt[key,val]; i++)
print map[key,val,i]
}
}' file
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
There is no need to sort the file for this awk
solution.
答案3
得分: 2
假设一行可能有超过3个字段:
$ cat file
a 130 data1
a 55 data2
a 66 data3
b 88 data4
b 99 data5
b 99 data6
c 110 data7
c 130 data8
c 130 data9 data10 data11
修改当前的 awk
代码的一个想法:
awk '
{ key=$1; val=$2 # 保存前两个字段
$1=$2="" # 清除前两个字段
gsub(/^[[:space:]]+/,"") # 移除行首空格
++cnt[key,val]
max[key]=(val > max[key] ? val : max[key])
row[key,val,cnt[key,val]]=$0 # 保存剩余的行
}
END { for (key in max) {
val=max[key]
for (i=1; i<=cnt[key,val]; i++)
print key, val, row[key,val,i]
}
}
' file
生成的结果是:
a 66 data3
b 99 data5
b 99 data6
c 130 data8
c 130 data9 data10 data11
英文:
Assuming there may be more than 3 fields to a row:
$ cat file
a 130 data1
a 55 data2
a 66 data3
b 88 data4
b 99 data5
b 99 data6
c 110 data7
c 130 data8
c 130 data9 data10 data11
One idea for modifying the current awk
code:
awk '
{ key=$1; val=$2 # save 1st two fields
$1=$2="" # clear 1st two fields
gsub(/^[[:space:]]+/,"") # remove leading white space from line
++cnt[key,val]
max[key]=(val > max[key] ? val : max[key])
row[key,val,cnt[key,val]]=$0 # save rest of line
}
END { for (key in max) {
val=max[key]
for (i=1; i<=cnt[key,val]; i++)
print key, val, row[key,val,i]
}
}
' file
This generates:
a 66 data3
b 99 data5
b 99 data6
c 130 data8
c 130 data9 data10 data11
答案4
得分: 2
awk '
$1 != firstcol{ firstcol=$1; max=$2; map[NR]=$0 }
$1 == firstcol{
if($2>max){ map[NR--]=$0; max=$2 }
if($2==max) map[NR]=$0
}
END{
for(i in map) print map[i]
}
' inputfile
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
英文:
awk '
$1 != firstcol{ firstcol=$1; max=$2; map[NR]=$0 }
$1 == firstcol{
if($2>max){ map[NR--]=$0; max=$2 }
if($2==max) map[NR]=$0
}
END{
for(i in map) print map[i]
}
' inputfile
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
答案5
得分: 1
相同的 Ruby 代码只需进行细微调整:
ruby -e '
grps=$<.read.split(/\R/).
group_by{|line| line[/^\S+/]}
# {"a"=>["a 130 data1", "a 55 data2", "a 66 data3"], "b"=>["b 88 data4", "b 99 data5", "b 99 data6"], "c"=>["c 110 data7", "c 130 data8", "c 130 data9"]}
maxes=grps.map{|k,v| v.max_by{|s| s.split[1].to_f}}.map{|s| s.split[0..1] }
# [["a", "130"], ["b", "99"], ["c", "130"]}
grps.values.flatten.each{|s| puts s if maxes.include?(s.split[0..1])}
' file
输出结果:
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
一旦你需要处理三列或更多列的数据,使用 Ruby(或 Perl、Python 等)更容易,因为它支持切片、分组和连接数组。
英文:
The same ruby works with minor adjustments:
ruby -e '
grps=$<.read.split(/\R/).
group_by{|line| line[/^\S+/]}
# {"a"=>["a 130 data1", "a 55 data2", "a 66 data3"], "b"=>["b 88 data4", "b 99 data5", "b 99 data6"], "c"=>["c 110 data7", "c 130 data8", "c 130 data9"]}
maxes=grps.map{|k,v| v.max_by{|s| s.split[1].to_f}}.map{|s| s.split[0..1] }
# [["a", "130"], ["b", "99"], ["c", "130"]]
grps.values.flatten.each{|s| puts s if maxes.include?(s.split[0..1])}
' file
Prints:
a 130 data1
b 99 data5
b 99 data6
c 130 data8
c 130 data9
Once you start getting into 3 or more columns to manage, it is easier to use ruby (or Perl, Python, etc) because of the support for slicing, grouping and joining arrays.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论