2023年4月17日 21:15:17go评论87阅读模式

英文:

awk function to filter each column of a file in a loop based on value obtained from a second file

问题

我正在寻找一种awk解决方案，可以对file2.txt进行两个条件的筛选。

我有两个文件：
file1.txt

t1 9
t2 8
t3 5

file2.txt

t1 t2 t3
A/A:10,0:10 B/B:0,2:2 n/n
A/B:10,8:18 A/B:2,8:10 n/n
A/B:0,2:2 n/n B/B:0,1:1

t1	t2	t3
A/A:10,0:10	B/B:0,2:2	n/n
A/B:10,8:18	A/B:2,8:10	n/n
A/B:0,2:2	n/n	B/B:0,1:1

desired output:

t1 t2 t3
A/A:10,0:10 n/n n/n
A/B:10,8:18 n/n n/n
n/n n/n n/n

t1	t2	t3
A/A:10,0:10	n/n	n/n
A/B:10,8:18	n/n	n/n
n/n	n/n	n/n

根据file1.txt的值，我能够创建一个数组来存储B列中的值

awk 'BEGIN { FS="\t"} FILENAME=="file1.txt" { descr[$1]=$2 ; next }

我想通过遍历列来过滤file2.txt，以满足以下条件，以获得所需的输出。file2.txt标题中的所有示例都包含在file1.txt中，并且顺序相同。

第三个值（由*：分隔）应该在基于file1.txt对应值的特定区间内。例如，t1样本的第一个值[10]：10>=(9/2) && 10<=(92) - 在区间内，因此保持字符串不变。在除法和乘法中使用的数字2是硬编码的，而数字9来自于file1.txt。如果不在区间内，则写入n/n（例如样本t2的第一个值）。
如果第一个值/（第二字段的值总和）的比例包含在区间0.1-0.25或0.75-0.9内，则写入n/n（在这种情况下由*，*分隔）。

例如，t2样本的第二个值[2,8]具有比率2/10=0.2，该比率包含在区间内，因此需要将该字段设置为n/n。数字10来自以逗号分隔的字段中的两个值的总和。

我尝试了一些for循环和if/split条件的组合，但我在使用数组进行迭代时遇到了一些问题，结合了条件。感谢您提前提供帮助！

英文:

I am looking for an awk solution that enable to filter file2.txt for two conditions.

I have two files:
file1.txt

t1 9
t2 8
t3 5

file2.txt

t1 t2 t3
A/A:10,0:10 B/B:0,2:2 n/n
A/B:10,8:18 A/B:2,8:10 n/n
A/B:0,2:2 n/n B/B:0,1:1

t1	t2	t3
A/A:10,0:10	B/B:0,2:2	n/n
A/B:10,8:18	A/B:2,8:10	n/n
A/B:0,2:2	n/n	B/B:0,1:1

desired output:

t1 t2 t3
A/A:10,0:10 n/n n/n
A/B:10,8:18 n/n n/n
n/n n/n n/n

t1	t2	t3
A/A:10,0:10	n/n	n/n
A/B:10,8:18	n/n	n/n
n/n	n/n	n/n

Based on the values of file1.txt , for which I was able to create an array which store the value in B column

awk 'BEGIN { FS="\t"} FILENAME=="file1.txt" { descr[$1]=$2 ; next }

I would like to filter the file2.txt, by iterating through the columns for the following conditions, in order to obtained what reported in the desired output. All samples in the header of file2.txt are included in file1.txt and the order is the same.

the third value (split by :) should be within a specific interval based on corresponding value from file1.txt
For example the first value of sample t1 [10]:
10>=(9/2) && 10<=(9*2) - is within the interval and thus keep the string as it is. The number 2 used in the division and multiplication is hard-coded and the number 9 is obtained from file1.txt. If not within the interval, then write n/n (for example first value for sample t2).
if the ratio of the first value/(sum of values in the second field) is included within an interval 0.1-0.25 or 0.75-0.9, then write n/n (split by , in this case).

For example second value of sample t2 [2,8] has a ratio 2/10=0.2, which is included in the interval, and thus need to set to n/n the field. The number 10 is derived from the sum of the two values in the field, separated by comma.

I tried a combination of for loop and if / split conditions, but I have some issues in order to iterate through the columns using the array, combined with the conditions.

Thank you in advance for any help!

答案1

得分: 1

假设：

值2、0.1、0.25、0.75和0.9是硬编码的。
区间是closed，意味着端点包括在内（如果不是这样，请在<和>旁边去掉=符号）。
file1.txt中的列名可能与file2.txt中的标题不按相同顺序排列。

然后请尝试以下操作：

awk -v OFS="	" -v nn="n/n" '
    NR==FNR {a[$1] = $2; next}                                  
    FNR==1 {
        for (i = 1; i <= NF; i++) b[i] = a[$i]                  
        print                                                   
        next
    }
    {                                                           
        for (i = 1; i <= NF; i++) {                             
            if ($i != nn) {
                split($i, c, /:/)
                split(c[2], d, /,/)
                if (! (c[3] >= b[i] / 2 && c[3] <= b[i] * 2))   
                    $i = nn
                if (d[1] + d[2] > 0) {                          
                    ratio = d[1] / (d[1] + d[2])
                    if (ratio >= 0.1 && ratio <= 0.25 || ratio >= 0.75 && ratio <= 0.9)
                        $i = nn
                }
            }
        }
        print                                                   
    }
' file1.txt file2.txt

输出：

t1	t2	t3
A/A:10,0:10	n/n	n/n
A/B:10,8:18	n/n	n/n
n/n	n/n	n/n

顺便说一下，期望输出中第2行的第2列B/B:0,10:10是n/n的拼写错误。

英文:

Assuming:

the values: 2, 0.1, 0.25, 0.75 and 0.9 are hard-coded.
the intervals are closed, meaning the endpoints are included
(if not, drop the = signs next to < and >).
the column names in file1.txt may or may not be in the same order
as the heder in file2.txt.

then would you please try the following:

awk -v OFS=&quot;\t&quot; -v nn=&quot;n/n&quot; &#39;
    NR==FNR {a[$1] = $2; next}                                  # read file1.txt to associate the numerator value with the column name
    FNR==1 {
        for (i = 1; i &lt;= NF; i++) b[i] = a[$i]                  # read the header line in file2.txt to associate the numerator value with the field number
        print                                                   # print the header line
        next
    }
    {                                                           # process bodies in file2.txt
        for (i = 1; i &lt;= NF; i++) {                             # loop over the fields
            if ($i != nn) {
                split($i, c, /:/)
                split(c[2], d, /,/)
                if (! (c[3] &gt;= b[i] / 2 &amp;&amp; c[3] &lt;= b[i] * 2))   # test the condition 1
                    $i = nn
                if (d[1] + d[2] &gt; 0) {                          # avoid division-by-zero
                    ratio = d[1] / (d[1] + d[2])
                    if (ratio &gt;= 0.1 &amp;&amp; ratio &lt;= 0.25 || ratio &gt;= 0.75 &amp;&amp; ratio &lt;= 0.9)
                                                                # test the condition 2
                        $i = nn
                }
            }
        }
        print                                                   # print the line
    }
&#39; file1.txt file2.txt

Output:

t1      t2      t3
A/A:10,0:10     n/n     n/n
A/B:10,8:18     n/n     n/n
n/n     n/n     n/n

BTW the 2nd column of the 2nd line B/B:0,10:10 in the desired output
should be a typo for n/n.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

awk函数循环基于第二个文件获得的值筛选文件的每一列。

问题

答案1

修复两个方法，这些方法应该移除重复和反向顺序的成对。

如何使用Python和piheean编码一个数字数组？

如何将二维数组中的项目连接成一个单独的字符串？

List passed by reference?

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。