我需要在多个文件中对一列中的所有数值进行求和。

huangapple go评论46阅读模式
英文:

I need to sum all the values in a column across multiple files

问题

我有一个包含多个 CSV 文本文件的目录,每个文件都是单行格式为:

'field1,field2,field3,560'

我需要输出目录中所有文件第四个字段的总和(可能有数百或数千个文件)。举例来说:

file1.txt
field1,field2,field3,560

file2.txt
field1,field2,field3,415

file3.txt
field1,field2,field3,672

输出应该是:
1647

我尝试了几种不同的方法,最有希望的是我在这里找到的一个 awk 命令,是对另一个用户问题的回答。它并不能完全满足我的需求,而且我是 awk 的新手,不太确定如何修改才能让它适用于我的目的:

awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]}' file1.txt file2.txt

这条命令能正确输出 '975'。

然而,如果我尝试传递第三个文件,它不是将三个文件的第四个字段相加,而是先将文件1加到文件2,然后将文件1加到文件3:

awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]}' file1.txt file2.txt file3.txt
975
1232

是否有人能展示我如何修改这个 awk 语句以接受超过两个文件,或者更理想的是,因为有成千上万个文件需要求和,如何使用 * 来输出目录中所有文件第四个字段的总和?

谢谢你的时间和帮助。

英文:

I have a directory with multiple csv text files, each with a single line in the format:

field1,field2,field3,560

I need to output the sum of the fourth field across all files in a directory (can be hundreds or thousands of files). So for an example of:

file1.txt
field1,field2,field3,560

file2.txt
field1,field2,field3,415

file3.txt
field1,field2,field3,672

The output would simply be:
1647

I've been trying a few different things, with the most promising being an awk command that I found here in response to another user's question. It doesn't quite do what I need it to do, and I am an awk newb so I'm unsure how to modify it to work for my purpose:

awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt

This correctly outputs 975.

However if I try pass it a 3rd file, rather than add field 4 from all 3 files, it adds file1 to file2, then file1 to file3:

awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt file3.txt
975
1232

Can anyone show me how I can modify this awk statement to accept more than two files or, ideally because there are thousands of files to sum up, an * to output the sum of the fourth field of all files in the directory?

Thank you for your time and assistance.

答案1

得分: 2

这是翻译好的代码部分:

这里我们开始(在连接中不需要测试`NR==FNR`):

$ cat file{1,2,3}.txt | awk -F, '{count+=$4}END{print count}'
1647

或者同样的 🇹🇭(不浪费一些管道(s)):

$ awk -F, '{count+=$4}END{print count}' file{1,2,3}.txt
1647
英文:

Here we go (no need to test NR==FNR in a concatenation):

$ cat file{1,2,3}.txt | awk -F, '{count+=$4}END{print count}' 
1647

Or same-same 🇹🇭 (without wasting some pipe(s)):

$ awk -F, '{count+=$4}END{print count}' file{1,2,3}.txt
1647

答案2

得分: 2

关于当前代码的一些问题:

  • NR==FNR 用于指示对第一个文件进行特殊处理;在这种情况下,对于第一个文件没有什么需要特殊处理的(即,所有文件都要进行相同的处理)
  • 使用数组(例如,a[NR])来维护一组值;在这种情况下,您只需要维护一个全局值,所以不需要数组

由于您只寻找一个全局总和,更简单的代码应该足够:

$ awk -F',' '{sum+=$4} END {print sum+0}' file{1..3}.txt
1647

注意:

  • 在(不太可能的情况下?)所有文件都为空的情况下,sum 将未定义,因此 print sum 将显示一个空链接;sum+0 确保如果 sum 保持未定义(即,所有文件都为空),我们会打印 0
  • 对于不定数量的文件,file{1..3}.txt 可以替换为匹配所需文件集的任何模式,例如 file*.txt*.txt 等。
英文:

A couple issues with the current code:

  • NR==FNR is used to indicate special processing for the 1st file; in this case there is no processing that is 'special' for just the 1st file (ie, all files are to be processed the same)
  • an array (eg, a[NR]) is used to maintain a set of values; in this case you only have one global value to maintain so there is no need for an array

Since you're only looking for one global sum, a bit more simpler code should suffice:

$ awk -F',' '{sum+=$4} END {print sum+0}' file{1..3}.txt
1647

NOTES:

  • in the (unlikely?) case all files are empty, sum will be undefined so print sum will display a blank link; sum+0 insures we print 0 if sum remains undefined (ie, all files are empty)
  • for a variable number of files file{1..3}.txt can be replaced with whatever pattern will match on the desired set of files, eg, file*.txt, *.txt, etc

答案3

得分: 0

$ perl -MList::Util=sum0 -F, -lane'push @a,$F[3];END{print sum0 @a}' file{1..3}.txt
1647

$ perl -F, -lane'push @a,$F[3];END{foreach(@a){ $sum +=$_ };print "$sum"}' file{1..3}.txt
1647

英文:
$ perl -MList::Util=sum0 -F, -lane'push @a,$F[3];END{print sum0 @a}' file{1..3}.txt
1647

$ perl -F, -lane'push @a,$F[3];END{foreach(@a){ $sum +=$_ };print "$sum"}' file{1..3}.txt
1647

答案4

得分: 0

$ cut -d, -f4 file{1..3}.txt | paste -sd+ - | bc
1647

英文:
$ cut -d, -f4 file{1..3}.txt | paste -sd+ - | bc
1647

答案5

得分: 0

代码部分已被排除,以下是您要翻译的内容:

"The paste method is only viable for smaller inputs. When row count gets somewhat larger, like the example below of 34,999,999, compared to awk,

  • gnu-paste is ~2.83x slower
  • bsd-paste is ~3.72x slower

** adding perl for completeness ::: perl 5.36.0 is ~1.54x slower

perl -MList::Util=sum0 -F, -lane'push @a,$_;END{print sum0 @a}'

__='34999999'

( time ( jot "$__" | pvE0 | mawk2 'END { print +_ }{ _+=$__ }' ) )

sleep 1

( time ( jot "$__" | pvE0 | LC_ALL=C paste -sd+ - | bc ) )

sleep 1

( time ( jot "$__" | pvE0 | LC_ALL=C gpaste -sd+ - | bc ) )

|"

"在处理较小的输入时,paste 方法才是可行的。当行数变得相对较大,比如下面的示例中的 34,999,999,与 awk 相比,

  • gnu-paste 较慢,大约2.83倍的速度
  • bsd-paste 较慢,大约3.72倍的速度

** 为完整性添加 perl ::: perl 5.36.0 较慢,大约1.54倍的速度

perl -MList::Util=sum0 -F, -lane'push @a,$_;END{print sum0 @a}'

__='34999999'

( time ( jot "$__" | pvE0 | mawk2 'END { print +_ }{ _+=$__ }' ) )

等待 1 秒

( time ( jot "$__" | pvE0 | LC_ALL=C paste -sd+ - | bc ) )

等待 1 秒

( time ( jot "$__" | pvE0 | LC_ALL=C gpaste -sd+ - | bc ) )

|"

英文:

The paste method is only viable for smaller inputs. When row count gets somewhat larger, like the example below of 34,999,999, compared to awk,

  • gnu-paste is ~2.83x slower
  • bsd-paste is ~3.72x slower

** adding perl for completeness ::: perl 5.36.0 is ~1.54x slower

perl -MList::Util=sum0 -F, -lane'push @a,$_;END{print sum0 @a}'

__='34999999'

( time ( jot "$__" | pvE0 | mawk2 'END { print +_ }{ _+=$__ }' ) )

sleep 1

( time ( jot "$__" | pvE0 | LC_ALL=C paste -sd+ - | bc ) )

sleep 1

( time ( jot "$__" | pvE0 | LC_ALL=C gpaste -sd+ - | bc ) )

|

      in0:  289MiB 0:00:06 [47.9MiB/s] [47.9MiB/s] [<=> ]
     1	612499982500000

( jot "$__" | pvE 0.1 in0 | mawk2 'END { print +_ }{ _+=$__ }'; )

  8.51s user 0.22s system 143% cpu 6.072 total

      in0:  289MiB 0:00:11 [25.5MiB/s] [25.5MiB/s] [ <=> ]
     1	612499982500000

( jot "$__" | pvE 0.1 in0 | LC_ALL=C paste -sd+ - | bc; ) 

27.87s user 1.15s system 128% cpu 22.594 total

      in0:  289MiB 0:00:06 [48.3MiB/s] [48.3MiB/s] [ <=> ]
     1	612499982500000

( jot "$__" | pvE 0.1 in0 | LC_ALL=C gpaste -sd+ - | bc; )

17.14s user 1.10s system 106% cpu 17.185 total

huangapple
  • 本文由 发表于 2023年2月18日 00:53:35
  • 转载请务必保留本文链接:https://go.coder-hub.com/75487127.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定