英文:
Generalize an awk program that concatenates csv files and adds a column
问题
我有这两个CSV文件:
$ cat numbers.csv
h1,h2
1,2
3,4
$ cat letters.csv
h1,h2
a,b
c,d
我想要将这两个CSV文件连接起来,并且添加一个新列,如果行来自letters.csv则该列的值为letters,如果来自numbers.csv则该列的值为columns,因此我编写了以下awk命令(因为我正在尝试使用awk)来完成此任务:
$ awk 'NR==1 {print new_column "," $0} fname != FILENAME { fname = FILENAME; idx++ } FNR > 1 && idx == 1 { print group1 "," $0 } FNR > 1 && idx == 2 { print group2 "," $0 }' new_column=type group1=letters group2=numbers letters.csv numbers.csv
其中:
new_column是要添加的新列的名称。group1是如果行来自letters.csv文件时,新列的值。group2是如果行来自numbers.csv文件时,新列的值。
我得到的输出是:
type,h1,h2
letters,a,b
letters,c,d
numbers,1,2
numbers,3,4
问题在于我不知道是否有一种方法可以使awk程序接受n个参数和n个文件,而不必每次都重写它,像这样:
$ awk '....' new_column=group_name group1=letters group2=numbers ... groupN=whatever letters.csv numbers.csv .... morefiles_to_N.csv
英文:
I have these two csv files:
$ cat numbers.csv
h1,h2
1, 2
3, 4
$ cat letters.csv
h1,h2
a, b
c, d
I want to concatenate both csv files and I want to add a new column with the value letters or columns if the row comes from letters.csv or numbers.csv so I wrote this awk command (because I'm trying out awk) that does the job:
$ awk 'NR==1 {print new_column","$0} fname != FILENAME { fname = FILENAME; idx++ } FNR > 1 && idx == 1 { print group1","$0 } FNR > 1 && idx == 2 { print group2","$0 }' new_column=type group1=letters group2=numbers letters.csv numbers.csv
new_column is of the column that is going to be added
group1 is the is the value of new_column if the row is added from the letters.csv file
group2 is the is the value of new_column if the row is added from the numbers.csv file
The output I get is:
type,h1,h2
letters,a, b
letters,c, d
numbers,1, 2
numbers,3, 4
The problem is that I don't know if there is a way to make the awk program run with n arguments and n files without having to rewrite it everytime like so:
$ awk '....' new_column=group_name group1=letters group2=numbers ... groupN=whatever letters.csv numbers.csv .... morefiles_to_N.csv
答案1
得分: 2
也许像这样
% awk 'NR==1{print "类型,",$0}
FNR==1{nm=FILENAME; sub(/\..*/,"",nm); next}
{print nm",",$0}' num.csv let.csv
类型, h1, h2
num, 1, 2
num, 3, 4
let, a, b
let, c, d
英文:
Maybe like this
% awk 'NR==1{print "type,"$0}
FNR==1{nm=FILENAME; sub(/\..*/,"",nm); next}
{print nm","$0}' num.csv let.csv
type,h1,h2
num,1, 2
num,3, 4
let,a, b
let,c, d
答案2
得分: 2
建议使用单个变量来存储所有所需的名称,然后使用split函数进行反序列化,考虑一个简单的示例,假设我有一个名为file1.txt的文件,内容如下:
1
2
3
名为file10.txt的文件,内容如下:
10
20
以及名为file100.txt的文件,内容如下:
100
并且想要将它们合并,分别为来自各个文件的数据提供名称ABLE,BAKER和CHARLIE,然后我可以执行以下操作:
awk -v names=ABLE:BAKER:CHARLIE 'BEGIN{split(names,arr,/:/)}FNR==1{i+=1}{print arr[i] "," $0}' file1.txt file10.txt file100.txt
输出如下:
ABLE,1
ABLE,2
ABLE,3
BAKER,10
BAKER,20
CHARLIE,100
说明:我选择了:作为分隔符,选择分隔符时请注意分隔符在任何名称中都不得存在。在BEGIN中,我使用:将名称分割成数组arr,所以arr[1]变成了ABLE,arr[2]变成了BAKER,arr[3]变成了CHARLIE。对于每个文件的第一行(FNR),我将i增加1。对于每一行,我使用,字符将名称与整行($0)连接并进行print。
(在GNU Awk 5.1.0中测试通过)
英文:
I suggest taking look at using single variable for ramming all needed names and then deserialize it using split function, consider simple example, let say I have file1.txt with following content
1
2
3
file10.txt with following content
10
20
and file100.txt with following content
100
and want to fuse them, giving names ABLE, BAKER, CHARLIE for data from respective filess, then I could do
awk -v names=ABLE:BAKER:CHARLIE 'BEGIN{split(names,arr,/:/)}FNR==1{i+=1}{print arr[i] "," $0}' file1.txt file10.txt file100.txt
gives output
ABLE,1
ABLE,2
ABLE,3
BAKER,10
BAKER,20
CHARLIE,100
Explanation: I elected : for separating names, when making choice keep in mind separator MUST NOT be present in any name. In BEGIN I split names at : into array arr, so arr[1] becomes ABLE, arr2 becomes BAKER, arr3 becomes CHARLIE. For each first row of file (FNR) I increase i by 1. For each row I print name concatenated with whole line ($0) using , character.
(tested in GNU Awk 5.1.0)
答案3
得分: 1
awk -v header=type -v groups=letters,numbers '
BEGIN{ OFS=","; split(groups, grps,",") }
FNR==1{i++}
NR==1{ print header,$0; next }
FNR!=1{ print grps[i],$0 }
' letters.csv numbers.csv
type,h1,h2
letters,a, b
letters,c, d
numbers,1, 2
numbers,3, 4
英文:
awk -v header=type -v groups=letters,numbers '
BEGIN{ OFS=","; split(groups, grps,",") }
FNR==1{i++}
NR==1{ print header,$0; next }
FNR!=1{ print grps[i],$0 }
' letters.csv numbers.csv
type,h1,h2
letters,a, b
letters,c, d
numbers,1, 2
numbers,3, 4
答案4
得分: 1
我添加了一个不使用awk的答案,因为我知道你正在寻找使用其他工具(如qsv)的解决方案。
我用于此类任务的工具是Miller。运行以下命令:
mlr --csv put '$type=sub(FILENAME,"\..+","")' letters.csv numbers.csv
你将得到:
h1,h2,type
a, b,letters
c, d,letters
1, 2,numbers
3, 4,numbers
put 是用于构建Miller表达式的动词。在这里,我创建了新字段type,将其设置为FILENAME内置变量。我使用sub函数来删除文件扩展名。
如果你想将type作为第一个字段添加,只需添加reorder -f type:
mlr --csv put '$type=sub(FILENAME,"\..+","")' then reorder -f type letters.csv numbers.csv
英文:
I'm adding a no awk answer because I know you are looking for a solution using also other tools (as qsv).
The tool I use for this kind of task is Miller. Running
mlr --csv put '$type=sub(FILENAME,"\..+","")' letters.csv numbers.csv
you get
h1,h2,type
a, b,letters
c, d,letters
1, 2,numbers
3, 4,numbers
put is the verb to build Miller expressions. Here I create the new field type, I put it equal to the FILENAME built-in variable. I use the sub function to remove the file extension.
If you want type ad the first field simply add reorder -f type
mlr --csv put '$type=sub(FILENAME,"\..+","")' then reorder -f type letters.csv numbers.csv
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。


评论