2023年2月18日 00:53:35go评论50阅读模式

英文:

I need to sum all the values in a column across multiple files

问题

我有一个包含多个 CSV 文本文件的目录，每个文件都是单行格式为：

'field1,field2,field3,560'

我需要输出目录中所有文件第四个字段的总和（可能有数百或数千个文件）。举例来说：

file1.txt
field1,field2,field3,560

file2.txt
field1,field2,field3,415

file3.txt
field1,field2,field3,672

输出应该是：
1647

我尝试了几种不同的方法，最有希望的是我在这里找到的一个 awk 命令，是对另一个用户问题的回答。它并不能完全满足我的需求，而且我是 awk 的新手，不太确定如何修改才能让它适用于我的目的：

awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]}' file1.txt file2.txt

这条命令能正确输出 '975'。

然而，如果我尝试传递第三个文件，它不是将三个文件的第四个字段相加，而是先将文件1加到文件2，然后将文件1加到文件3：

awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]}' file1.txt file2.txt file3.txt
975
1232

是否有人能展示我如何修改这个 awk 语句以接受超过两个文件，或者更理想的是，因为有成千上万个文件需要求和，如何使用 * 来输出目录中所有文件第四个字段的总和？

谢谢你的时间和帮助。

英文:

I have a directory with multiple csv text files, each with a single line in the format:

field1,field2,field3,560

I need to output the sum of the fourth field across all files in a directory (can be hundreds or thousands of files). So for an example of:

file1.txt
field1,field2,field3,560

file2.txt
field1,field2,field3,415

file3.txt
field1,field2,field3,672

The output would simply be:
1647

I've been trying a few different things, with the most promising being an awk command that I found here in response to another user's question. It doesn't quite do what I need it to do, and I am an awk newb so I'm unsure how to modify it to work for my purpose:

awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt

This correctly outputs 975.

However if I try pass it a 3rd file, rather than add field 4 from all 3 files, it adds file1 to file2, then file1 to file3:

awk -F&quot;,&quot; &#39;NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:&#39; file1.txt file2.txt file3.txt
975
1232

Can anyone show me how I can modify this awk statement to accept more than two files or, ideally because there are thousands of files to sum up, an * to output the sum of the fourth field of all files in the directory?

Thank you for your time and assistance.

答案1

得分: 2

这是翻译好的代码部分：

这里我们开始（在连接中不需要测试`NR==FNR`）：

$ cat file{1,2,3}.txt | awk -F, '{count+=$4}END{print count}'
1647

或者同样的 &#127481;&#127469;（不浪费一些管道(s)）：

$ awk -F, '{count+=$4}END{print count}' file{1,2,3}.txt
1647

英文:

Here we go (no need to test NR==FNR in a concatenation):

$ cat file{1,2,3}.txt | awk -F, &#39;{count+=$4}END{print count}&#39; 
1647

Or same-same 🇹🇭 (without wasting some pipe(s)):

$ awk -F, &#39;{count+=$4}END{print count}&#39; file{1,2,3}.txt
1647

答案2

得分: 2

关于当前代码的一些问题：

NR==FNR 用于指示对第一个文件进行特殊处理；在这种情况下，对于第一个文件没有什么需要特殊处理的（即，所有文件都要进行相同的处理）
使用数组（例如，a[NR]）来维护一组值；在这种情况下，您只需要维护一个全局值，所以不需要数组

由于您只寻找一个全局总和，更简单的代码应该足够：

$ awk -F',' '{sum+=$4} END {print sum+0}' file{1..3}.txt
1647

注意：

在（不太可能的情况下？）所有文件都为空的情况下，sum 将未定义，因此 print sum 将显示一个空链接；sum+0 确保如果 sum 保持未定义（即，所有文件都为空），我们会打印 0
对于不定数量的文件，file{1..3}.txt 可以替换为匹配所需文件集的任何模式，例如 file*.txt，*.txt 等。

英文:

A couple issues with the current code:

NR==FNR is used to indicate special processing for the 1st file; in this case there is no processing that is 'special' for just the 1st file (ie, all files are to be processed the same)
an array (eg, a[NR]) is used to maintain a set of values; in this case you only have one global value to maintain so there is no need for an array

Since you're only looking for one global sum, a bit more simpler code should suffice:

$ awk -F&#39;,&#39; &#39;{sum+=$4} END {print sum+0}&#39; file{1..3}.txt
1647

NOTES:

in the (unlikely?) case all files are empty, sum will be undefined so print sum will display a blank link; sum+0 insures we print 0 if sum remains undefined (ie, all files are empty)
for a variable number of files file{1..3}.txt can be replaced with whatever pattern will match on the desired set of files, eg, file*.txt, *.txt, etc

答案3

得分: 0

$ perl -MList::Util=sum0 -F, -lane'push @a,$F[3];END{print sum0 @a}' file{1..3}.txt
1647

$ perl -F, -lane'push @a,$F[3];END{foreach(@a){ $sum +=$_ };print "$sum"}' file{1..3}.txt
1647

英文:

$ perl -MList::Util=sum0 -F, -lane&#39;push @a,$F[3];END{print sum0 @a}&#39; file{1..3}.txt
1647

$ perl -F, -lane&#39;push @a,$F[3];END{foreach(@a){ $sum +=$_ };print &quot;$sum&quot;}&#39; file{1..3}.txt
1647

答案4

得分: 0

$ cut -d, -f4 file{1..3}.txt | paste -sd+ - | bc
1647

英文:

$ cut -d, -f4 file{1..3}.txt | paste -sd+ - | bc
1647

答案5

得分: 0

代码部分已被排除，以下是您要翻译的内容：

"The paste method is only viable for smaller inputs. When row count gets somewhat larger, like the example below of 34,999,999, compared to awk,

gnu-paste is ~2.83x slower
bsd-paste is ~3.72x slower

** adding perl for completeness ::: perl 5.36.0 is ~1.54x slower

perl -MList::Util=sum0 -F, -lane&#39;push @a,$_;END{print sum0 @a}&#39;

__=&#39;34999999&#39;

( time ( jot &quot;$__&quot; | pvE0 | mawk2 &#39;END { print +_ }{ _+=$__ }&#39; ) )

sleep 1

( time ( jot &quot;$__&quot; | pvE0 | LC_ALL=C paste -sd+ - | bc ) )

sleep 1

( time ( jot &quot;$__&quot; | pvE0 | LC_ALL=C gpaste -sd+ - | bc ) )

"在处理较小的输入时，paste 方法才是可行的。当行数变得相对较大，比如下面的示例中的 34,999,999，与 awk 相比，

gnu-paste 较慢，大约2.83倍的速度
bsd-paste 较慢，大约3.72倍的速度

** 为完整性添加 perl ::: perl 5.36.0 较慢，大约1.54倍的速度

perl -MList::Util=sum0 -F, -lane&#39;push @a,$_;END{print sum0 @a}&#39;

__=&#39;34999999&#39;

( time ( jot &quot;$__&quot; | pvE0 | mawk2 &#39;END { print +_ }{ _+=$__ }&#39; ) )

等待 1 秒

( time ( jot &quot;$__&quot; | pvE0 | LC_ALL=C paste -sd+ - | bc ) )

等待 1 秒

( time ( jot &quot;$__&quot; | pvE0 | LC_ALL=C gpaste -sd+ - | bc ) )

英文:

The paste method is only viable for smaller inputs. When row count gets somewhat larger, like the example below of 34,999,999, compared to awk,

gnu-paste is ~2.83x slower
bsd-paste is ~3.72x slower

** adding perl for completeness ::: perl 5.36.0 is ~1.54x slower

perl -MList::Util=sum0 -F, -lane&#39;push @a,$_;END{print sum0 @a}&#39;

__=&#39;34999999&#39;

( time ( jot &quot;$__&quot; | pvE0 | mawk2 &#39;END { print +_ }{ _+=$__ }&#39; ) )

sleep 1

( time ( jot &quot;$__&quot; | pvE0 | LC_ALL=C paste -sd+ - | bc ) )

sleep 1

( time ( jot &quot;$__&quot; | pvE0 | LC_ALL=C gpaste -sd+ - | bc ) )

      in0:  289MiB 0:00:06 [47.9MiB/s] [47.9MiB/s] [&lt;=&gt; ]
     1	612499982500000

( jot &quot;$__&quot; | pvE 0.1 in0 | mawk2 &#39;END { print +_ }{ _+=$__ }&#39;; )

  8.51s user 0.22s system 143% cpu 6.072 total

      in0:  289MiB 0:00:11 [25.5MiB/s] [25.5MiB/s] [ &lt;=&gt; ]
     1	612499982500000

( jot &quot;$__&quot; | pvE 0.1 in0 | LC_ALL=C paste -sd+ - | bc; ) 

27.87s user 1.15s system 128% cpu 22.594 total

      in0:  289MiB 0:00:06 [48.3MiB/s] [48.3MiB/s] [ &lt;=&gt; ]
     1	612499982500000

( jot &quot;$__&quot; | pvE 0.1 in0 | LC_ALL=C gpaste -sd+ - | bc; )

17.14s user 1.10s system 106% cpu 17.185 total

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

我需要在多个文件中对一列中的所有数值进行求和。

问题

答案1

答案2

答案3

答案4

答案5

根据输入的列顺序重新排列 CSV 文件中的列/值。

从文件中获取特定文本。

模板变量替换而已。文本/模板是否适合？

使用Python，我想知道如何删除文件中两个字符串之间第一次出现的字符。

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

发表评论