在AWK中带有字段标题的类似Countif的函数

huangapple go评论55阅读模式
英文:

Countif like function in AWK with field headers

问题

我正在寻找一种方法来计算 CSV 文件中字段范围内值出现的次数,类似于 Excel 中的 COUNTIF,尽管我希望如果可能的话使用 awk 命令。

因此,第6列应该有值的范围,第7列应该有该值在第7列中出现的次数,如下所示:

awk -F, '{print $0}' file3
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ

awk -F, '{print $6}' file3
test
SBCD
AWER
ASDF
ASDQ
ASDQ

我想要的是:

f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

要实现这个目标,您可以使用以下 awk 命令:

awk -F, -v OFS=, 'NR==1{ print $0, "count" }
NR>1{ count[$6]++ }
{ print $0, count[$6] }' file3

这将在第一行添加一个名为 "count" 的新字段,并在每行的最后一列显示相应的计数值。

英文:

I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible.

So column 6 should have the range of values and column 7 would have the times the value appears in column 7, as per below

>awk -F, '{print $0}' file3
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ

>awk -F, '{print $6}' file3
test
SBCD
AWER
ASDF
ASDQ
ASDQ

What i want is:

f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

#adds field name count that I want:

 awk -F, -v OFS=, 'NR==1{ print $0, "count"}
 NR>1{ print $0}' file3

Ho do I get the output I want?

I have tried this from previous/similar question but no joy,

>awk -F, 'NR>1{c[$6]++;l[NR>1]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c
展开收缩
]}}' file3 row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ, , , , , ,

very similar question to this one
similar python related Q, for my ref

答案1

得分: 1

你没有正确复制链接问题中的代码。为什么要将 l[NR] 改为 l[NR>1] 呢?另一方面,你应该将 s[1] 改为 s[6],因为第六个字段包含了你要计数的键:

awk -F, 'NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]", "c

展开收缩
]}}'

你还可以输出具有新字段名称的标题:

awk -F, -vOFS=, 'NR==1{print $0,"count"}NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i],c

展开收缩
]}}'

英文:

You did not copy the code from the linked question properly. Why change l[NR] to l[NR&gt;1] at all? On the other hand, you should change s[1] to s[6] since it's the sixth field that has the key you're counting:

awk -F, &#39;NR&gt;1{c[$6]++;l[NR]=$0}END{for(i=0;i++&lt;NR;){split(l[i],s,&quot;,&quot;);print l[i]&quot;,&quot;c
展开收缩
]}}&#39;

You can also output the header with the new field name:

awk -F, -vOFS=, &#39;NR==1{print $0,&quot;count&quot;}NR&gt;1{c[$6]++;l[NR]=$0}END{for(i=0;i++&lt;NR;){split(l[i],s,&quot;,&quot;);print l[i],c
展开收缩
]}}&#39;

答案2

得分: 1

我会使用 GNU AWK 来完成这个任务,如下所示,假设 file.txt 的内容如下:

f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ

然后运行以下代码:

awk 'BEGIN{FS=OFS=","}NR==1{print $0,"count";next}FNR==NR{arr[$6]+=1;next}FNR>1{print $0,arr[$6]}' file.txt file.txt

输出如下:

f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

解释:这是一个两遍的方法,因此 file.txt 出现了两次。我告诉 GNU AWK , 是字段分隔符 (FS) 也是输出字段分隔符 (OFS),然后对于第一行(标题行),我将其 print 出来,后跟 count,并指示 GNU AWK 前往下一行,因此对于第一行没有做其他处理。然后在第一遍遍历时,即全局行号 (NR) 等于文件中的行号 (FNR) 时,我统计第六个字段中的值的出现次数,并将它们作为数组 arr 中的值存储,然后指示 GNU AWK 前往下一行,因此在此遍历中没有做其他处理。在第二遍遍历中,对于所有大于 1 的行 (FNR>1),我 print 整行 ($0),后跟数组 arr 中相应的值。

(在 GNU Awk 5.0.1 中测试过)

英文:

I would harness GNU AWK for this task following way, let file.txt content be

f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ

then

awk &#39;BEGIN{FS=OFS=&quot;,&quot;}NR==1{print $0,&quot;count&quot;;next}FNR==NR{arr[$6]+=1;next}FNR&gt;1{print $0,arr[$6]}&#39; file.txt file.txt

gives output

f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

Explanation: this is two-pass approach, hence file.txt appears twice. I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for first line (header) I print it followed by count and instruct GNU AWK to go to next line, so nothing other is done regarding 1st line, then for first pass, i.e. where global number of line (NR) is equal to number of line in file (FNR) I count number of occurences of values in 6th field and store them as values in array arr, then instruct GNU AWK to get to next line, so onthing other is done in this pass. During second pass for all lines after 1st (FNR&gt;1) I print whole line ($0) followed by corresponding value from array arr

(tested in GNU Awk 5.0.1)

答案3

得分: 1

以下是您要求的代码部分的中文翻译:

一种 `awk` 的想法:

    awk '
    BEGIN { FS=OFS="," }                        # 将输入/输出字段分隔符定义为逗号
          { lines[NR]=$0
            if (NR==1) next
            col6[NR]=$6                         # 复制字段6,这样我们就不必在END块中解析lines[]的内容
            cnt[$6]++
          }
    END   { for (i=1;i<=NR;i++)
                print lines[i], (i==1 ? "count" : cnt[col6[i]] )
          }
    ' file3

这将生成:

    f1,f2,f3,f4,f5,test,count
    row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
    row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
    row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
    row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
    row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
英文:

One awk idea:

awk &#39;
BEGIN { FS=OFS=&quot;,&quot; }                        # define input/output field delimiters as comma
      { lines[NR]=$0
        if (NR==1) next
        col6[NR]=$6                         # copy field 6 so we do not have to parse the contents of lines[] in the END block
        cnt[$6]++
      }
END   { for (i=1;i&lt;=NR;i++)
            print lines[i], (i==1 ? &quot;count&quot; : cnt[col6[i]] )
      }
&#39; file3

This generates:

f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

huangapple
  • 本文由 发表于 2023年1月9日 11:05:47
  • 转载请务必保留本文链接:https://go.coder-hub.com/75052833.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定