将一个将CSV文件连接起来并添加一列的awk程序泛化。

huangapple go评论59阅读模式
英文:

Generalize an awk program that concatenates csv files and adds a column

问题

我有这两个CSV文件:

$ cat numbers.csv
h1,h2
1,2
3,4

$ cat letters.csv
h1,h2
a,b
c,d

我想要将这两个CSV文件连接起来,并且添加一个新列,如果行来自letters.csv则该列的值为letters,如果来自numbers.csv则该列的值为columns,因此我编写了以下awk命令(因为我正在尝试使用awk)来完成此任务:

$ awk 'NR==1 {print new_column "," $0} fname != FILENAME { fname = FILENAME; idx++ } FNR > 1 && idx == 1 { print group1 "," $0 } FNR > 1 && idx == 2 { print group2 "," $0 }' new_column=type group1=letters group2=numbers letters.csv numbers.csv

其中:

  • new_column是要添加的新列的名称。
  • group1是如果行来自letters.csv文件时,新列的值。
  • group2是如果行来自numbers.csv文件时,新列的值。

我得到的输出是:

type,h1,h2
letters,a,b
letters,c,d
numbers,1,2
numbers,3,4

问题在于我不知道是否有一种方法可以使awk程序接受n个参数和n个文件,而不必每次都重写它,像这样:

$ awk '....' new_column=group_name group1=letters group2=numbers ... groupN=whatever  letters.csv numbers.csv .... morefiles_to_N.csv
英文:

I have these two csv files:

$ cat numbers.csv
h1,h2
1, 2
3, 4

$ cat letters.csv
h1,h2
a, b
c, d

I want to concatenate both csv files and I want to add a new column with the value letters or columns if the row comes from letters.csv or numbers.csv so I wrote this awk command (because I'm trying out awk) that does the job:

$ awk 'NR==1 {print new_column","$0} fname != FILENAME { fname = FILENAME; idx++ } FNR > 1 && idx == 1 { print group1","$0 } FNR > 1 && idx == 2 { print group2","$0 }' new_column=type group1=letters group2=numbers letters.csv numbers.csv

new_column is of the column that is going to be added

group1 is the is the value of new_column if the row is added from the letters.csv file

group2 is the is the value of new_column if the row is added from the numbers.csv file

The output I get is:

type,h1,h2
letters,a, b
letters,c, d
numbers,1, 2
numbers,3, 4

The problem is that I don't know if there is a way to make the awk program run with n arguments and n files without having to rewrite it everytime like so:

$ awk '....' new_column=group_name group1=letters group2=numbers ... groupN=whatever  letters.csv numbers.csv .... morefiles_to_N.csv

答案1

得分: 2

也许像这样

% awk 'NR==1{print "类型,",$0} 
       FNR==1{nm=FILENAME; sub(/\..*/,"",nm); next} 
             {print nm",",$0}' num.csv let.csv
类型, h1, h2
num, 1, 2
num, 3, 4
let, a, b
let, c, d
英文:

Maybe like this

% awk 'NR==1{print "type,"$0} 
       FNR==1{nm=FILENAME; sub(/\..*/,"",nm); next} 
             {print nm","$0}' num.csv let.csv
type,h1,h2
num,1, 2
num,3, 4
let,a, b
let,c, d

答案2

得分: 2

建议使用单个变量来存储所有所需的名称,然后使用split函数进行反序列化,考虑一个简单的示例,假设我有一个名为file1.txt的文件,内容如下:

1
2
3

名为file10.txt的文件,内容如下:

10
20

以及名为file100.txt的文件,内容如下:

100

并且想要将它们合并,分别为来自各个文件的数据提供名称ABLE,BAKER和CHARLIE,然后我可以执行以下操作:

awk -v names=ABLE:BAKER:CHARLIE 'BEGIN{split(names,arr,/:/)}FNR==1{i+=1}{print arr[i] "," $0}' file1.txt file10.txt file100.txt

输出如下:

ABLE,1
ABLE,2
ABLE,3
BAKER,10
BAKER,20
CHARLIE,100

说明:我选择了:作为分隔符,选择分隔符时请注意分隔符在任何名称中都不得存在。在BEGIN中,我使用:将名称分割成数组arr,所以arr[1]变成了ABLEarr[2]变成了BAKERarr[3]变成了CHARLIE。对于每个文件的第一行(FNR),我将i增加1。对于每一行,我使用,字符将名称与整行($0)连接并进行print

(在GNU Awk 5.1.0中测试通过)

英文:

I suggest taking look at using single variable for ramming all needed names and then deserialize it using split function, consider simple example, let say I have file1.txt with following content

1
2
3

file10.txt with following content

10
20

and file100.txt with following content

100

and want to fuse them, giving names ABLE, BAKER, CHARLIE for data from respective filess, then I could do

awk -v names=ABLE:BAKER:CHARLIE 'BEGIN{split(names,arr,/:/)}FNR==1{i+=1}{print arr[i] "," $0}' file1.txt file10.txt file100.txt

gives output

ABLE,1
ABLE,2
ABLE,3
BAKER,10
BAKER,20
CHARLIE,100

Explanation: I elected : for separating names, when making choice keep in mind separator MUST NOT be present in any name. In BEGIN I split names at : into array arr, so arr[1] becomes ABLE, arr2 becomes BAKER, arr3 becomes CHARLIE. For each first row of file (FNR) I increase i by 1. For each row I print name concatenated with whole line ($0) using , character.

(tested in GNU Awk 5.1.0)

答案3

得分: 1

awk -v header=type -v groups=letters,numbers '
BEGIN{ OFS=","; split(groups, grps,",") }
FNR==1{i++}
NR==1{ print header,$0; next }
FNR!=1{ print grps[i],$0 }
' letters.csv numbers.csv

type,h1,h2
letters,a, b
letters,c, d
numbers,1, 2
numbers,3, 4

英文:
awk -v header=type -v groups=letters,numbers '
    BEGIN{ OFS=","; split(groups, grps,",") } 
    FNR==1{i++} 
    NR==1{ print header,$0; next } 
    FNR!=1{ print grps[i],$0 }
' letters.csv numbers.csv

type,h1,h2
letters,a, b
letters,c, d
numbers,1, 2
numbers,3, 4

答案4

得分: 1

我添加了一个不使用awk的答案,因为我知道你正在寻找使用其他工具(如qsv)的解决方案。

我用于此类任务的工具是Miller。运行以下命令:

mlr --csv put '$type=sub(FILENAME,"\..+","")' letters.csv numbers.csv

你将得到:

h1,h2,type
a, b,letters
c, d,letters
1, 2,numbers
3, 4,numbers

put 是用于构建Miller表达式的动词。在这里,我创建了新字段type,将其设置为FILENAME内置变量。我使用sub函数来删除文件扩展名。

如果你想将type作为第一个字段添加,只需添加reorder -f type

mlr --csv put '$type=sub(FILENAME,"\..+","")' then reorder -f type letters.csv numbers.csv
英文:

I'm adding a no awk answer because I know you are looking for a solution using also other tools (as qsv).

The tool I use for this kind of task is Miller. Running

mlr --csv put '$type=sub(FILENAME,"\..+","")' letters.csv numbers.csv

you get

h1,h2,type
a, b,letters
c, d,letters
1, 2,numbers
3, 4,numbers

put is the verb to build Miller expressions. Here I create the new field type, I put it equal to the FILENAME built-in variable. I use the sub function to remove the file extension.

If you want type ad the first field simply add reorder -f type

mlr --csv put '$type=sub(FILENAME,"\..+","")' then reorder -f type letters.csv numbers.csv

huangapple
  • 本文由 发表于 2023年6月19日 19:58:00
  • 转载请务必保留本文链接:https://go.coder-hub.com/76506396.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定