awk函数循环基于第二个文件获得的值筛选文件的每一列。

huangapple go评论55阅读模式
英文:

awk function to filter each column of a file in a loop based on value obtained from a second file

问题

我正在寻找一种awk解决方案,可以对file2.txt进行两个条件的筛选。

我有两个文件:
file1.txt

t1 9
t2 8
t3 5

file2.txt

t1 t2 t3
A/A:10,0:10 B/B:0,2:2 n/n
A/B:10,8:18 A/B:2,8:10 n/n
A/B:0,2:2 n/n B/B:0,1:1
t1 t2 t3
A/A:10,0:10 B/B:0,2:2 n/n
A/B:10,8:18 A/B:2,8:10 n/n
A/B:0,2:2 n/n B/B:0,1:1

desired output:

t1 t2 t3
A/A:10,0:10 n/n n/n
A/B:10,8:18 n/n n/n
n/n n/n n/n
t1 t2 t3
A/A:10,0:10 n/n n/n
A/B:10,8:18 n/n n/n
n/n n/n n/n

根据file1.txt的值,我能够创建一个数组来存储B列中的值

awk 'BEGIN { FS="\t"} FILENAME=="file1.txt" { descr[$1]=$2 ; next }

我想通过遍历列来过滤file2.txt,以满足以下条件,以获得所需的输出。file2.txt标题中的所有示例都包含在file1.txt中,并且顺序相同。

  1. 第三个值(由*:分隔)应该在基于file1.txt对应值的特定区间内。例如,t1样本的第一个值[10]:10>=(9/2) && 10<=(92) - 在区间内,因此保持字符串不变。在除法和乘法中使用的数字2是硬编码的,而数字9来自于file1.txt。如果不在区间内,则写入n/n(例如样本t2的第一个值)。

  2. 如果第一个值/(第二字段的值总和)的比例包含在区间0.1-0.25或0.75-0.9内,则写入n/n(在这种情况下由*,*分隔)。

例如,t2样本的第二个值[2,8]具有比率2/10=0.2,该比率包含在区间内,因此需要将该字段设置为n/n。数字10来自以逗号分隔的字段中的两个值的总和。

我尝试了一些for循环和if/split条件的组合,但我在使用数组进行迭代时遇到了一些问题,结合了条件。感谢您提前提供帮助!

英文:

I am looking for an awk solution that enable to filter file2.txt for two conditions.

I have two files:
file1.txt

t1 9
t2 8
t3 5

file2.txt

t1 t2 t3
A/A:10,0:10 B/B:0,2:2 n/n
A/B:10,8:18 A/B:2,8:10 n/n
A/B:0,2:2 n/n B/B:0,1:1
t1 t2 t3
A/A:10,0:10 B/B:0,2:2 n/n
A/B:10,8:18 A/B:2,8:10 n/n
A/B:0,2:2 n/n B/B:0,1:1

desired output:

t1 t2 t3
A/A:10,0:10 n/n n/n
A/B:10,8:18 n/n n/n
n/n n/n n/n
t1 t2 t3
A/A:10,0:10 n/n n/n
A/B:10,8:18 n/n n/n
n/n n/n n/n

Based on the values of file1.txt , for which I was able to create an array which store the value in B column

awk &#39;BEGIN { FS=&quot;\t&quot;}
FILENAME==&quot;file1.txt&quot; { descr[$1]=$2 ; next }

I would like to filter the file2.txt, by iterating through the columns for the following conditions, in order to obtained what reported in the desired output. All samples in the header of file2.txt are included in file1.txt and the order is the same.

  1. the third value (split by :) should be within a specific interval based on corresponding value from file1.txt
    For example the first value of sample t1 [10]:
    10>=(9/2) && 10<=(9*2) - is within the interval and thus keep the string as it is. The number 2 used in the division and multiplication is hard-coded and the number 9 is obtained from file1.txt. If not within the interval, then write n/n (for example first value for sample t2).

  2. if the ratio of the first value/(sum of values in the second field) is included within an interval 0.1-0.25 or 0.75-0.9, then write n/n (split by , in this case).

For example second value of sample t2 [2,8] has a ratio 2/10=0.2, which is included in the interval, and thus need to set to n/n the field. The number 10 is derived from the sum of the two values in the field, separated by comma.

I tried a combination of for loop and if / split conditions, but I have some issues in order to iterate through the columns using the array, combined with the conditions.

Thank you in advance for any help!

答案1

得分: 1

假设:

  • 值2、0.1、0.25、0.75和0.9是硬编码的。
  • 区间是closed,意味着端点包括在内(如果不是这样,请在&lt;&gt;旁边去掉=符号)。
  • file1.txt中的列名可能与file2.txt中的标题不按相同顺序排列。

然后请尝试以下操作:

awk -v OFS="	" -v nn="n/n" '
    NR==FNR {a[$1] = $2; next}                                  
    FNR==1 {
        for (i = 1; i <= NF; i++) b[i] = a[$i]                  
        print                                                   
        next
    }
    {                                                           
        for (i = 1; i <= NF; i++) {                             
            if ($i != nn) {
                split($i, c, /:/)
                split(c[2], d, /,/)
                if (! (c[3] >= b[i] / 2 && c[3] <= b[i] * 2))   
                    $i = nn
                if (d[1] + d[2] > 0) {                          
                    ratio = d[1] / (d[1] + d[2])
                    if (ratio >= 0.1 && ratio <= 0.25 || ratio >= 0.75 && ratio <= 0.9)
                        $i = nn
                }
            }
        }
        print                                                   
    }
' file1.txt file2.txt

输出:

t1	t2	t3
A/A:10,0:10	n/n	n/n
A/B:10,8:18	n/n	n/n
n/n	n/n	n/n

顺便说一下,期望输出中第2行的第2列B/B:0,10:10n/n的拼写错误。

英文:

Assuming:

  • the values: 2, 0.1, 0.25, 0.75 and 0.9 are hard-coded.
  • the intervals are closed, meaning the endpoints are included
    (if not, drop the = signs next to &lt; and &gt;).
  • the column names in file1.txt may or may not be in the same order
    as the heder in file2.txt.

then would you please try the following:

awk -v OFS=&quot;\t&quot; -v nn=&quot;n/n&quot; &#39;
    NR==FNR {a[$1] = $2; next}                                  # read file1.txt to associate the numerator value with the column name
    FNR==1 {
        for (i = 1; i &lt;= NF; i++) b[i] = a[$i]                  # read the header line in file2.txt to associate the numerator value with the field number
        print                                                   # print the header line
        next
    }
    {                                                           # process bodies in file2.txt
        for (i = 1; i &lt;= NF; i++) {                             # loop over the fields
            if ($i != nn) {
                split($i, c, /:/)
                split(c[2], d, /,/)
                if (! (c[3] &gt;= b[i] / 2 &amp;&amp; c[3] &lt;= b[i] * 2))   # test the condition 1
                    $i = nn
                if (d[1] + d[2] &gt; 0) {                          # avoid division-by-zero
                    ratio = d[1] / (d[1] + d[2])
                    if (ratio &gt;= 0.1 &amp;&amp; ratio &lt;= 0.25 || ratio &gt;= 0.75 &amp;&amp; ratio &lt;= 0.9)
                                                                # test the condition 2
                        $i = nn
                }
            }
        }
        print                                                   # print the line
    }
&#39; file1.txt file2.txt

Output:

t1      t2      t3
A/A:10,0:10     n/n     n/n
A/B:10,8:18     n/n     n/n
n/n     n/n     n/n

BTW the 2nd column of the 2nd line B/B:0,10:10 in the desired output
should be a typo for n/n.

huangapple
  • 本文由 发表于 2023年4月17日 21:15:17
  • 转载请务必保留本文链接:https://go.coder-hub.com/76035578.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定