2023年6月29日 00:11:25go评论106阅读模式

英文:

awk to get values between start and end pattern if a floating number is larger than threshold can't handle 4 digits before point

问题

I do have some textfiles that contain multiple tables and matrices of which I need to target one and print a line if the second column value (floating point number) is smaller than a given threshold.

Looks like this (original format):

(some tables, matrices etc)
        Dataset of interest            :     
                  0    
      0       0.000000
      1       0.000000
      6      41.572543
      7      47.082104
      8      66.534233
      9     112.409794
     10     121.771594
     11     134.398445
     20     302.705391
     21     304.451176
     28     474.496684
     29     495.522236
     30     502.182750
     31     522.627098
     32     544.878024
     56     1334.207831
     97     3214.104108
        more properties                :          certainly
(even more tables, matrices etc)

Example for expected output (lines with $2 < 200.0 but > 0.0 from "Dataset of interest" only):

      6      41.572543
      7      47.082104
      8      66.534233
      9     112.409794
     10     121.771594
     11     134.398445

$2 > 0 will keep the 0.000000 lines in output, so I tried this:

awk '/Dataset of interest/,/more properties/ { if ( $2 > 0.000001 && $2 < 500) print $1, $2}' file.txt

and this:

awk '/Dataset of interest/,/more properties/ {for(i = 2 ; i <= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ && $i > 0.000001 && $i < 500 ) print $1, $2}}' file.txt

which both print the same:

      6      41.572543
      7      47.082104
      8      66.534233
      9     112.409794
     10     121.771594
     11     134.398445
     20     302.705391
     21     304.451176
     28     474.496684
     29     495.522236
     97     3214.104108

The 3214 line remains, even though this gets rid of the 1334 line. After testing, I can confirm that it gets rid of anything lower than 2000, but 2000 or higher will remain in output, which doesn't make any sense to me. The upper threshold should be a variable, but changing the limit for example to $2 < 200 won't print anything at all. What did I do wrong?

Edit:
Thanks for the quick responses!

awk '$2 < 200 && $2 > 0' file

returns:

0      0.000000
1      0.000000
9      112.409794
10     121.771594
11     134.398445
56     1334.207831

Sorry for not specifying earlier, I'm using mawk 1.3.4

grep -E '544|1334|3214' file.txt | xxd
00000000: 3332 2020 2020 2035 3434 2e38 3738 3032  32     544.87802
00000010: 340a 3536 2020 2020 2031 3333 342e 3230  4.56     1334.20
00000020: 3738 3331 0a20 3937 2020 2020 3332 3134  7831. 97    3214
00000030: 2e31 3034 3130 380a                      .104108.
grep -E '544|1334|3214' file.txt | od -c
0000000   3   2                       5   4   4   .   8   7   8   0   2
0000020   4  \n   5   6                       1   3   3   4   .   2   0
0000040   7   8   3   1  \n       9   7                   3   2   1   4
0000060   .   1   0   4   1   0   8  \n
0000070

Space upfront 97 should not affect the column assignment (or does it?), looks fine to me.

Edit2:

Thanks @shellter, that was it! (feeling a bit stupid now...)

awk '/Dataset of interest/,/more properties/ {for(i = 2 ; i <= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ && $i+0 > 0 && $i+0 < 500 ) print $1, $2}}' file.txt

(changed conditions $i >/< to $i+0 >/<) does exactly what I need. Putting $i ~ /^[0-9]+[.][0-9]+$/ doesn't cover it. Btw $i+0.0 also works and yields same output.

英文:

I do have some textfiles that contain multiple tables and matrices of which I need to target one and print a line if the second column value (floating point number) is smaller than a given threshold.

Looks like this (original format):

(some tables, matrices etc)
        Dataset of interest            :     
                  0    
      0       0.000000
      1       0.000000
      6      41.572543
      7      47.082104
      8      66.534233
      9     112.409794
     10     121.771594
     11     134.398445
     20     302.705391
     21     304.451176
     28     474.496684
     29     495.522236
     30     502.182750
     31     522.627098
     32     544.878024
     56     1334.207831
     97     3214.104108
        more properties                :          certainly
(even more tables, matrices etc)

example for expected output (lines with $2 <200.0 but > 0.0 from "Dataset of interest" only):

      6      41.572543
      7      47.082104
      8      66.534233
      9     112.409794
     10     121.771594
     11     134.398445

$2 > 0 will keep the 0.000000 lines in output, so I tried this:

awk &#39;/Dataset of interest/,/more properties/ { if ( $2 &gt; 0.000001 &amp;&amp; $2 &lt; 500) print $1, $2}&#39; file.txt

and this:

awk &#39;/Dataset of interest/,/more properties/ {for(i = 2 ; i &lt;= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ &amp;&amp; $i &gt; 0.000001 &amp;&amp; $i &lt; 500 ) print $1, $2}}&#39; file.txt

which both print the same:

      6      41.572543
      7      47.082104
      8      66.534233
      9     112.409794
     10     121.771594
     11     134.398445
     20     302.705391
     21     304.451176
     28     474.496684
     29     495.522236
     97     3214.104108

Edit:
Thanks for the quick responses!

awk &#39;$2 &lt; 200 &amp;&amp; $2 &gt; 0&#39; file

returns:

0      0.000000
1      0.000000
9      112.409794
10     121.771594
11     134.398445
56     1334.207831

Sorry for not specifying earlier, I'm using mawk 1.3.4

grep -E &#39;544|1334|3214&#39; file.txt | xxd
00000000: 3332 2020 2020 2035 3434 2e38 3738 3032  32     544.87802
00000010: 340a 3536 2020 2020 2031 3333 342e 3230  4.56     1334.20
00000020: 3738 3331 0a20 3937 2020 2020 3332 3134  7831. 97    3214
00000030: 2e31 3034 3130 380a                      .104108.
grep -E &#39;544|1334|3214&#39; file.txt | od -c
0000000   3   2                       5   4   4   .   8   7   8   0   2
0000020   4  \n   5   6                       1   3   3   4   .   2   0
0000040   7   8   3   1  \n       9   7                   3   2   1   4
0000060   .   1   0   4   1   0   8  \n
0000070

Space upfront 97 should not affect the column assignment (or does it?), looks fine to me.

Edit2:

Thanks @shellter, that was it! (feeling a bit stupid now...)

awk &#39;/Dataset of interest/,/more properties/ {for(i = 2 ; i &lt;= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ &amp;&amp; $i+0 &gt; 0 &amp;&amp; $i+0 &lt; 500 ) print $1, $2}}&#39; file.txt

(changed conditions $i >/< to $i+0 >/<) does exactly what I need. Putting $i ~ /^[0-9]+[.][0-9]+$/ doesn't cover it. Btw $i+0.0 also works and yields same output.

答案1

得分: 2

在环境中设置LC_NUMERIC会影响mawk从输入数据中解析数字的方式。

如果使用数字呈现方式不同的区域设置，mawk可能不会将您的数据的第二个字段视为数字，并将其与文本字符串进行比较。

考虑：

#!/bin/sh
t(){
    date
    printf '1.00\n2.01\n3,01\n' | mawk '
        {
            printf "%s -> %f; >%d? %s\n",
                $1, ($1+0), NR, ($1>NR ? "yes" : "no") 
        }
    '
    echo
}
LANG=C t
LANG=en_US.utf8 t
LANG=de_DE.utf8 t

在这台机器上使用给定的区域设置定义运行上述脚本会得到以下输出：

周四 六月 29 04:54:42 BST 2023
1.00 -> 1.000000; >1? no
2.01 -> 2.010000; >2? yes
3,01 -> 3.000000; >3? yes
周四 六月 29 04:54:42 AM BST 2023
1.00 -> 1.000000; >1? no
2.01 -> 2.010000; >2? yes
3,01 -> 3.000000; >3? yes
Do 29. Jun 04:54:42 BST 2023
1.00 -> 1,000000; >1? yes
2.01 -> 2,000000; >2? yes
3,01 -> 3,010000; >3? yes

请注意，强制转换不一定会得到您可能期望的数字，因为不能作为数字的字符串尾部会被简单地丢弃。

与其进行强制转换，更好的解决方法是在运行mawk时指定适当的区域设置。

英文:

The setting of LC_NUMERIC in the environment affects how mawk parses numbers from input data.

If you use a locale where numbers are presented differently, mawk may not consider the second field of your data to be a number and comparisons against it will be done as a textual string.

Consider:

#!/bin/sh
t(){
    date
    printf &#39;1.00\n2.01\n3,01\n&#39; | mawk &#39;
        {
            printf &quot;%s -&gt; %f; &gt;%d? %s\n&quot;,
                $1, ($1+0), NR, ($1&gt;NR ? &quot;yes&quot; : &quot;no&quot;) 
        }
    &#39;
    echo
}
LANG=C t
LANG=en_US.utf8 t
LANG=de_DE.utf8 t

which, with the locale definitions on this machine, prints:

Thu Jun 29 04:54:42 BST 2023
1.00 -&gt; 1.000000; &gt;1? no
2.01 -&gt; 2.010000; &gt;2? yes
3,01 -&gt; 3.000000; &gt;3? yes
Thu Jun 29 04:54:42 AM BST 2023
1.00 -&gt; 1.000000; &gt;1? no
2.01 -&gt; 2.010000; &gt;2? yes
3,01 -&gt; 3.000000; &gt;3? yes
Do 29. Jun 04:54:42 BST 2023
1.00 -&gt; 1,000000; &gt;1? yes
2.01 -&gt; 2,000000; &gt;2? yes
3,01 -&gt; 3,010000; &gt;3? yes

Observe that casting does not necessarily result in the number you may expect as any tail of the string that cannot be treated as part of a number is simply discarded.

Instead of casting, the better fix is to specify an appropriate locale when running mawk.

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

awk to get values between start and end pattern if a floating number is larger than threshold can't handle 4 digits before point

问题

答案1

Bash script to update GO project on Ubuntu 16.04

运行并行的 bash 函数。

OSX 脚本编辑器与 AppleScript

将变量从Bash脚本传递到Jenkins管道，无需插件。

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。