如何使用awk比较两个文件时忽略特定列

huangapple go评论63阅读模式
英文:

How to ignore specific column when comparing two files using awk

问题

Output File:

4|soccer|play4
5|golf|play6
英文:

File1:

1|footbal|play1
2|cricket|play2
3|tennis|play3
5|golf|play5

File2:

1|footbal|play1
2|cricket|play2
3|tennis1|play3
4|soccer|play4
5|golf|play6

Output File:

4|soccer|play4
5|golf|play6

I am comparing all columns of file1 and file2, but I need to ignore the second column while comparing.

awk 'NR==FNR {exclude[$0];next} !($0 in exclude)' file1 file2 > file3

答案1

得分: 4

以下是翻译好的部分:

要忽略的部分:

$ awk -F\| '
NR==FNR {           # 第一个文件
    $2=""           # 清空不需要的字段,链接它们:$2=$3=...$n=""
    a[$0]           # 在$0上创建哈希
    next            # 下一条记录
}
{                   # 第二个文件
    b=$0            # 备份记录到b
    $2=""           # 清空相同的字段
    if(!($0 in a))  # 参考
        print b     # 输出备份
}' file file2

输出:

4|soccer|play4
5|golf|play6

当然,只有当NF >> 需要置空的字段数量时才有意义。在其他情况下,请使用其他解决方案。
英文:

To ignore:

$ awk -F\| '
NR==FNR {           # first file
    $2=""           # empty the unwanted fields, chain them: $2=$3=...$n=""
    a[$0]           # hash on $0
    next            # next record
}
{                   # second file
    b=$0            # backup the record to b
    $2=""           # empty the same field
    if(!($0 in a))  # refer
        print b     # output the backup
}' file file2

Output:

4|soccer|play4
5|golf|play6

This of course makes sense only if NF >> amount of fields to null. In other case use the other solutions.

答案2

得分: 2

逗号,运算符在这里很方便,可以分别处理字段。如果您只想要第一个和第三个字段,您可以使用您已经有的相同模式:

$ awk -F\| 'NR==FNR {exclude[$1,$3];next} !(($1,$3) in exclude)' file1 file2
4|soccer|play4
5|golf|play6
英文:

The comma , operator is handy here to treat the fields individually. If you just want the first and third fields you can use the same pattern you already have:

$ awk -F\| 'NR==FNR {exclude[$1,$3];next} !(($1,$3) in exclude)' file1 file2
4|soccer|play4
5|golf|play6

答案3

得分: 2

EDIT2(Generic solution): 要在两个输入文件中将超过1列设置为null,可以尝试以下方法。我已经为此创建了变量,因此您不需要在代码中硬编码要设置为null的字段。应该在awk程序的-v file1_ignorefile2_ignore变量中以逗号分隔的形式提到所有字段编号。

BEGIN{
  FS=OFS="|"
  num1=split(file1_ignore,array1,",")
  num2=split(file2_ignore,array2,",")
}
FNR==NR{
  for(i=1;i<=num1;i++){
    $array1[i]=""
  }
  a[$0]
  next
}
{
  val=$0
  for(i=1;i<=num2;i++){
    $array2[i]=""
  }
}
!($0 in a){
  print val
  val=""
}
' file1 file2

Explanation: 对上述代码添加详细的解释。

BEGIN{                                               ##从这里开始BEGIN部分。
  FS=OFS="|"                                         ##将字段分隔符和输出字段分隔符设置为|。
  num1=split(file1_ignore,array1,",")                ##使用逗号作为分隔符将file1_ignore变量拆分到array1中。
  num2=split(file2_ignore,array2,",")                ##使用逗号作为分隔符将file2_ignore变量拆分到array2中。
}                                                    ##关闭此代码的BEGIN块。
FNR==NR{                                             ##检查条件,对于第一个输入文件Input_file1,这将为TRUE。
  for(i=1;i<=num1;i++){                              ##从这里开始运行直到num1变量的for循环。
    $array1[i]=""                                    ##将字段(通过array1的值获取)设置为空。
  }                                                  ##关闭上述for循环块。
  a[$0]                                              ##创建一个以当前行为索引的数组。
  next                                               ##next将跳过从这里开始的所有进一步语句。
}                                                    ##关闭FNR==NR条件的BLOCK。
{
  val=$0                                             ##创建一个变量val,其值为当前行。
  for(i=1;i<=num2;i++){                              ##从这里开始运行直到num2变量的for循环。
    $array2[i]=""                                    ##将字段(通过array2的值获取)设置为空。
  }                                                  ##关闭上述for循环块。
}
!($0 in a){                                          ##检查条件,如果当前行不在数组a中,则继续运行后续语句。
  print val                                          ##打印变量val。
  val=""                                             ##将变量val设置为空。
}
' file1 file2                                        ##指定Input_file的名称。

英文:

EDIT2(Generic solution): To nullify more than 1 column(s) in both the Input_file(s) could try following. I have made variables for it so you need not to hard code fields which you want to nullify in your code. One should mention all field numbers separated with , in -v file1_ignore and file2_ignore variables of this awk program.

awk -v file1_ignore=&quot;2,3&quot; -v file2_ignore=&quot;2,3&quot; &#39;
BEGIN{
  FS=OFS=&quot;|&quot;
  num1=split(file1_ignore,array1,&quot;,&quot;)
  num2=split(file2_ignore,array2,&quot;,&quot;)
}
FNR==NR{
  for(i=1;i&lt;=num1;i++){
    $array1[i]=&quot;&quot;
  }
  a[$0]
  next
}
{
  val=$0
  for(i=1;i&lt;=num2;i++){
    $array2[i]=&quot;&quot;
  }
}
!($0 in a){
  print val
  val=&quot;&quot;
}
&#39; file1 file2

Explanation: Adding a detailed explanation for above code.

awk -v file1_ignore=&quot;2,3&quot; -v file2_ignore=&quot;2,3&quot; &#39;    ##Starting awk program from here and setting variables named file1_ignore(which will be used to ignoring fields in Input_file1), file2_ignore(which will be used to ignoring fields in Input_file2).
BEGIN{                                               ##Starting BEGIN section from here.
  FS=OFS=&quot;|&quot;                                         ##Setting field seaprator and output field separator as | here.
  num1=split(file1_ignore,array1,&quot;,&quot;)                ##Spitting file1_ignore variable to array1 here with separator as , here.
  num2=split(file2_ignore,array2,&quot;,&quot;)                ##Spitting file1_ignore variable to array2 here with separator as , here.
}                                                    ##Closing BEGIN BLOCK for this code here.
FNR==NR{                                             ##Checking condition which will be TRUE for first Input_file Input_file1 here.
  for(i=1;i&lt;=num1;i++){                              ##Starting for loop to run till variable num1 here.
    $array1[i]=&quot;&quot;                                    ##Nullifying field(which will be get by value of array1).
  }                                                  ##Closing above for loop BLOCK here.
  a[$0]                                              ##Creating an array with index of current line.
  next                                               ##next will skip all further statements from here.
}                                                    ##Closing BLOCK for FNR==NR condition here.
{
  val=$0                                             ##Creating a variable val whose value is current line.
  for(i=1;i&lt;=num2;i++){                              ##Starting for loop to run till variable num2 here.
    $array2[i]=&quot;&quot;                                    ##Nullifying field(which will be get by value of array2).
  }                                                  ##Closing above for loop BLOCK here.
}
!($0 in a){                                          ##Checking condition if current line is NOT present in array a then run futher statements.
  print val                                          ##Printing variable val here.
  val=&quot;&quot;                                             ##Nullify variable val here.
}
&#39; file1 file2                                        ##Mentioning Input_file(s) name here.


EDIT1: To ignore multiple and different columns in both the files try following, I have taken example of same column number 2 to be nullified in both files you could keep it as per your need too.

awk -v file1_ignore=&quot;2&quot; -v file2_ignore=&quot;2&quot; &#39;
BEGIN{
  FS=OFS=&quot;|&quot;
}
FNR==NR{
  $file1_ignore=&quot;&quot;
  a[$0]
  next
}
{
  val=$0
  $file2_ignore=&quot;&quot;
}
!($0 in a){
  print val
  val=&quot;&quot;
}
&#39; file1 file2


Could you please try following.

awk &#39;BEGIN{FS=&quot;|&quot;}FNR==NR{a[$1,$3];next} !(($1,$3) in a)&#39; file1 file2

huangapple
  • 本文由 发表于 2020年1月6日 23:35:54
  • 转载请务必保留本文链接:https://go.coder-hub.com/59614851.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定