英文:
How to ignore specific column when comparing two files using awk
问题
Output File:
4|soccer|play4
5|golf|play6
英文:
File1:
1|footbal|play1
2|cricket|play2
3|tennis|play3
5|golf|play5
File2:
1|footbal|play1
2|cricket|play2
3|tennis1|play3
4|soccer|play4
5|golf|play6
Output File:
4|soccer|play4
5|golf|play6
I am comparing all columns of file1 and file2, but I need to ignore the second column while comparing.
awk 'NR==FNR {exclude[$0];next} !($0 in exclude)' file1 file2 > file3
答案1
得分: 4
以下是翻译好的部分:
要忽略的部分:
$ awk -F\| '
NR==FNR { # 第一个文件
$2="" # 清空不需要的字段,链接它们:$2=$3=...$n=""
a[$0] # 在$0上创建哈希
next # 下一条记录
}
{ # 第二个文件
b=$0 # 备份记录到b
$2="" # 清空相同的字段
if(!($0 in a)) # 参考
print b # 输出备份
}' file file2
输出:
4|soccer|play4
5|golf|play6
当然,只有当NF >> 需要置空的字段数量时才有意义。在其他情况下,请使用其他解决方案。
英文:
To ignore:
$ awk -F\| '
NR==FNR { # first file
$2="" # empty the unwanted fields, chain them: $2=$3=...$n=""
a[$0] # hash on $0
next # next record
}
{ # second file
b=$0 # backup the record to b
$2="" # empty the same field
if(!($0 in a)) # refer
print b # output the backup
}' file file2
Output:
4|soccer|play4
5|golf|play6
This of course makes sense only if NF >> amount of fields to null. In other case use the other solutions.
答案2
得分: 2
逗号,
运算符在这里很方便,可以分别处理字段。如果您只想要第一个和第三个字段,您可以使用您已经有的相同模式:
$ awk -F\| 'NR==FNR {exclude[$1,$3];next} !(($1,$3) in exclude)' file1 file2
4|soccer|play4
5|golf|play6
英文:
The comma ,
operator is handy here to treat the fields individually. If you just want the first and third fields you can use the same pattern you already have:
$ awk -F\| 'NR==FNR {exclude[$1,$3];next} !(($1,$3) in exclude)' file1 file2
4|soccer|play4
5|golf|play6
答案3
得分: 2
EDIT2(Generic solution): 要在两个输入文件中将超过1列设置为null,可以尝试以下方法。我已经为此创建了变量,因此您不需要在代码中硬编码要设置为null的字段。应该在awk程序的-v file1_ignore
和file2_ignore
变量中以逗号分隔的形式提到所有字段编号。
BEGIN{
FS=OFS="|"
num1=split(file1_ignore,array1,",")
num2=split(file2_ignore,array2,",")
}
FNR==NR{
for(i=1;i<=num1;i++){
$array1[i]=""
}
a[$0]
next
}
{
val=$0
for(i=1;i<=num2;i++){
$array2[i]=""
}
}
!($0 in a){
print val
val=""
}
' file1 file2
Explanation: 对上述代码添加详细的解释。
BEGIN{ ##从这里开始BEGIN部分。
FS=OFS="|" ##将字段分隔符和输出字段分隔符设置为|。
num1=split(file1_ignore,array1,",") ##使用逗号作为分隔符将file1_ignore变量拆分到array1中。
num2=split(file2_ignore,array2,",") ##使用逗号作为分隔符将file2_ignore变量拆分到array2中。
} ##关闭此代码的BEGIN块。
FNR==NR{ ##检查条件,对于第一个输入文件Input_file1,这将为TRUE。
for(i=1;i<=num1;i++){ ##从这里开始运行直到num1变量的for循环。
$array1[i]="" ##将字段(通过array1的值获取)设置为空。
} ##关闭上述for循环块。
a[$0] ##创建一个以当前行为索引的数组。
next ##next将跳过从这里开始的所有进一步语句。
} ##关闭FNR==NR条件的BLOCK。
{
val=$0 ##创建一个变量val,其值为当前行。
for(i=1;i<=num2;i++){ ##从这里开始运行直到num2变量的for循环。
$array2[i]="" ##将字段(通过array2的值获取)设置为空。
} ##关闭上述for循环块。
}
!($0 in a){ ##检查条件,如果当前行不在数组a中,则继续运行后续语句。
print val ##打印变量val。
val="" ##将变量val设置为空。
}
' file1 file2 ##指定Input_file的名称。
英文:
EDIT2(Generic solution): To nullify more than 1 column(s) in both the Input_file(s) could try following. I have made variables for it so you need not to hard code fields which you want to nullify in your code. One should mention all field numbers separated with ,
in -v file1_ignore
and file2_ignore
variables of this awk program.
awk -v file1_ignore="2,3" -v file2_ignore="2,3" '
BEGIN{
FS=OFS="|"
num1=split(file1_ignore,array1,",")
num2=split(file2_ignore,array2,",")
}
FNR==NR{
for(i=1;i<=num1;i++){
$array1[i]=""
}
a[$0]
next
}
{
val=$0
for(i=1;i<=num2;i++){
$array2[i]=""
}
}
!($0 in a){
print val
val=""
}
' file1 file2
Explanation: Adding a detailed explanation for above code.
awk -v file1_ignore="2,3" -v file2_ignore="2,3" ' ##Starting awk program from here and setting variables named file1_ignore(which will be used to ignoring fields in Input_file1), file2_ignore(which will be used to ignoring fields in Input_file2).
BEGIN{ ##Starting BEGIN section from here.
FS=OFS="|" ##Setting field seaprator and output field separator as | here.
num1=split(file1_ignore,array1,",") ##Spitting file1_ignore variable to array1 here with separator as , here.
num2=split(file2_ignore,array2,",") ##Spitting file1_ignore variable to array2 here with separator as , here.
} ##Closing BEGIN BLOCK for this code here.
FNR==NR{ ##Checking condition which will be TRUE for first Input_file Input_file1 here.
for(i=1;i<=num1;i++){ ##Starting for loop to run till variable num1 here.
$array1[i]="" ##Nullifying field(which will be get by value of array1).
} ##Closing above for loop BLOCK here.
a[$0] ##Creating an array with index of current line.
next ##next will skip all further statements from here.
} ##Closing BLOCK for FNR==NR condition here.
{
val=$0 ##Creating a variable val whose value is current line.
for(i=1;i<=num2;i++){ ##Starting for loop to run till variable num2 here.
$array2[i]="" ##Nullifying field(which will be get by value of array2).
} ##Closing above for loop BLOCK here.
}
!($0 in a){ ##Checking condition if current line is NOT present in array a then run futher statements.
print val ##Printing variable val here.
val="" ##Nullify variable val here.
}
' file1 file2 ##Mentioning Input_file(s) name here.
EDIT1: To ignore multiple and different columns in both the files try following, I have taken example of same column number 2 to be nullified in both files you could keep it as per your need too.
awk -v file1_ignore="2" -v file2_ignore="2" '
BEGIN{
FS=OFS="|"
}
FNR==NR{
$file1_ignore=""
a[$0]
next
}
{
val=$0
$file2_ignore=""
}
!($0 in a){
print val
val=""
}
' file1 file2
Could you please try following.
awk 'BEGIN{FS="|"}FNR==NR{a[$1,$3];next} !(($1,$3) in a)' file1 file2
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论