awk to get values between start and end pattern if a floating number is larger than threshold can't handle 4 digits before point

huangapple go评论54阅读模式
英文:

awk to get values between start and end pattern if a floating number is larger than threshold can't handle 4 digits before point

问题

I do have some textfiles that contain multiple tables and matrices of which I need to target one and print a line if the second column value (floating point number) is smaller than a given threshold.

Looks like this (original format):

(some tables, matrices etc)
        Dataset of interest            :     
                  0    
      0       0.000000
      1       0.000000
      6      41.572543
      7      47.082104
      8      66.534233
      9     112.409794
     10     121.771594
     11     134.398445
     20     302.705391
     21     304.451176
     28     474.496684
     29     495.522236
     30     502.182750
     31     522.627098
     32     544.878024
     56     1334.207831
     97     3214.104108
        more properties                :          certainly
(even more tables, matrices etc)

Example for expected output (lines with $2 < 200.0 but > 0.0 from "Dataset of interest" only):

      6      41.572543
      7      47.082104
      8      66.534233
      9     112.409794
     10     121.771594
     11     134.398445

$2 > 0 will keep the 0.000000 lines in output, so I tried this:

awk '/Dataset of interest/,/more properties/ { if ( $2 > 0.000001 && $2 < 500) print $1, $2}' file.txt

and this:

awk '/Dataset of interest/,/more properties/ {for(i = 2 ; i <= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ && $i > 0.000001 && $i < 500 ) print $1, $2}}' file.txt

which both print the same:

      6      41.572543
      7      47.082104
      8      66.534233
      9     112.409794
     10     121.771594
     11     134.398445
     20     302.705391
     21     304.451176
     28     474.496684
     29     495.522236
     97     3214.104108

The 3214 line remains, even though this gets rid of the 1334 line. After testing, I can confirm that it gets rid of anything lower than 2000, but 2000 or higher will remain in output, which doesn't make any sense to me. The upper threshold should be a variable, but changing the limit for example to $2 < 200 won't print anything at all. What did I do wrong?

Edit:
Thanks for the quick responses!

awk '$2 < 200 && $2 > 0' file

returns:

0      0.000000
1      0.000000
9      112.409794
10     121.771594
11     134.398445
56     1334.207831

Sorry for not specifying earlier, I'm using mawk 1.3.4

grep -E '544|1334|3214' file.txt | xxd
00000000: 3332 2020 2020 2035 3434 2e38 3738 3032  32     544.87802
00000010: 340a 3536 2020 2020 2031 3333 342e 3230  4.56     1334.20
00000020: 3738 3331 0a20 3937 2020 2020 3332 3134  7831. 97    3214
00000030: 2e31 3034 3130 380a                      .104108.

grep -E '544|1334|3214' file.txt | od -c
0000000   3   2                       5   4   4   .   8   7   8   0   2
0000020   4  \n   5   6                       1   3   3   4   .   2   0
0000040   7   8   3   1  \n       9   7                   3   2   1   4
0000060   .   1   0   4   1   0   8  \n
0000070

Space upfront 97 should not affect the column assignment (or does it?), looks fine to me.

Edit2:

Thanks @shellter, that was it! (feeling a bit stupid now...)

awk '/Dataset of interest/,/more properties/ {for(i = 2 ; i <= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ && $i+0 > 0 && $i+0 < 500 ) print $1, $2}}' file.txt

(changed conditions $i >/< to $i+0 >/<) does exactly what I need. Putting $i ~ /^[0-9]+[.][0-9]+$/ doesn't cover it. Btw $i+0.0 also works and yields same output.

英文:

I do have some textfiles that contain multiple tables and matrices of which I need to target one and print a line if the second column value (floating point number) is smaller than a given threshold.

Looks like this (original format):

(some tables, matrices etc)
        Dataset of interest            :     
                  0    
      0       0.000000
      1       0.000000
      6      41.572543
      7      47.082104
      8      66.534233
      9     112.409794
     10     121.771594
     11     134.398445
     20     302.705391
     21     304.451176
     28     474.496684
     29     495.522236
     30     502.182750
     31     522.627098
     32     544.878024
     56     1334.207831
     97     3214.104108
        more properties                :          certainly
(even more tables, matrices etc)

example for expected output (lines with $2 <200.0 but > 0.0 from "Dataset of interest" only):

      6      41.572543
      7      47.082104
      8      66.534233
      9     112.409794
     10     121.771594
     11     134.398445

$2 > 0 will keep the 0.000000 lines in output, so I tried this:

awk &#39;/Dataset of interest/,/more properties/ { if ( $2 &gt; 0.000001 &amp;&amp; $2 &lt; 500) print $1, $2}&#39; file.txt

and this:

awk &#39;/Dataset of interest/,/more properties/ {for(i = 2 ; i &lt;= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ &amp;&amp; $i &gt; 0.000001 &amp;&amp; $i &lt; 500 ) print $1, $2}}&#39; file.txt

which both print the same:

      6      41.572543
      7      47.082104
      8      66.534233
      9     112.409794
     10     121.771594
     11     134.398445
     20     302.705391
     21     304.451176
     28     474.496684
     29     495.522236
     97     3214.104108

The 3214 line remains, even though this gets rid of the 1334 line. After testing, I can confirm that it gets rid of anything lower than 2000, but 2000 or higher will remain in output, which doesn't make any sense to me. The upper threshold should be a variable, but changing the limit for example to $2 < 200 won't print anything at all. What did I do wrong?

Edit:
Thanks for the quick responses!

awk &#39;$2 &lt; 200 &amp;&amp; $2 &gt; 0&#39; file

returns:

0      0.000000
1      0.000000
9      112.409794
10     121.771594
11     134.398445
56     1334.207831

Sorry for not specifying earlier, I'm using mawk 1.3.4

grep -E &#39;544|1334|3214&#39; file.txt | xxd
00000000: 3332 2020 2020 2035 3434 2e38 3738 3032  32     544.87802
00000010: 340a 3536 2020 2020 2031 3333 342e 3230  4.56     1334.20
00000020: 3738 3331 0a20 3937 2020 2020 3332 3134  7831. 97    3214
00000030: 2e31 3034 3130 380a                      .104108.

grep -E &#39;544|1334|3214&#39; file.txt | od -c
0000000   3   2                       5   4   4   .   8   7   8   0   2
0000020   4  \n   5   6                       1   3   3   4   .   2   0
0000040   7   8   3   1  \n       9   7                   3   2   1   4
0000060   .   1   0   4   1   0   8  \n
0000070

Space upfront 97 should not affect the column assignment (or does it?), looks fine to me.

Edit2:

Thanks @shellter, that was it! (feeling a bit stupid now...)

awk &#39;/Dataset of interest/,/more properties/ {for(i = 2 ; i &lt;= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ &amp;&amp; $i+0 &gt; 0 &amp;&amp; $i+0 &lt; 500 ) print $1, $2}}&#39; file.txt

(changed conditions $i >/< to $i+0 >/<) does exactly what I need. Putting $i ~ /^[0-9]+[.][0-9]+$/ doesn't cover it. Btw $i+0.0 also works and yields same output.

答案1

得分: 2

在环境中设置LC_NUMERIC会影响mawk从输入数据中解析数字的方式。

如果使用数字呈现方式不同的区域设置,mawk可能不会将您的数据的第二个字段视为数字,并将其与文本字符串进行比较。

考虑:

#!/bin/sh

t(){
    date
    printf '1.00\n2.01\n3,01\n' | mawk '
        {
            printf "%s -> %f; >%d? %s\n",
                $1, ($1+0), NR, ($1>NR ? "yes" : "no") 
        }
    '
    echo
}

LANG=C t
LANG=en_US.utf8 t
LANG=de_DE.utf8 t

在这台机器上使用给定的区域设置定义运行上述脚本会得到以下输出:

周四 六月 29 04:54:42 BST 2023
1.00 -> 1.000000; >1? no
2.01 -> 2.010000; >2? yes
3,01 -> 3.000000; >3? yes

周四 六月 29 04:54:42 AM BST 2023
1.00 -> 1.000000; >1? no
2.01 -> 2.010000; >2? yes
3,01 -> 3.000000; >3? yes

Do 29. Jun 04:54:42 BST 2023
1.00 -> 1,000000; >1? yes
2.01 -> 2,000000; >2? yes
3,01 -> 3,010000; >3? yes

请注意,强制转换不一定会得到您可能期望的数字,因为不能作为数字的字符串尾部会被简单地丢弃。

与其进行强制转换,更好的解决方法是在运行mawk时指定适当的区域设置。

英文:

The setting of LC_NUMERIC in the environment affects how mawk parses numbers from input data.

If you use a locale where numbers are presented differently, mawk may not consider the second field of your data to be a number and comparisons against it will be done as a textual string.

Consider:

#!/bin/sh

t(){
    date
    printf &#39;1.00\n2.01\n3,01\n&#39; | mawk &#39;
        {
            printf &quot;%s -&gt; %f; &gt;%d? %s\n&quot;,
                $1, ($1+0), NR, ($1&gt;NR ? &quot;yes&quot; : &quot;no&quot;) 
        }
    &#39;
    echo
}

LANG=C t
LANG=en_US.utf8 t
LANG=de_DE.utf8 t

which, with the locale definitions on this machine, prints:

Thu Jun 29 04:54:42 BST 2023
1.00 -&gt; 1.000000; &gt;1? no
2.01 -&gt; 2.010000; &gt;2? yes
3,01 -&gt; 3.000000; &gt;3? yes

Thu Jun 29 04:54:42 AM BST 2023
1.00 -&gt; 1.000000; &gt;1? no
2.01 -&gt; 2.010000; &gt;2? yes
3,01 -&gt; 3.000000; &gt;3? yes

Do 29. Jun 04:54:42 BST 2023
1.00 -&gt; 1,000000; &gt;1? yes
2.01 -&gt; 2,000000; &gt;2? yes
3,01 -&gt; 3,010000; &gt;3? yes

Observe that casting does not necessarily result in the number you may expect as any tail of the string that cannot be treated as part of a number is simply discarded.

Instead of casting, the better fix is to specify an appropriate locale when running mawk.

huangapple
  • 本文由 发表于 2023年6月29日 00:11:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76574979.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定