awk to get values between start and end pattern if a floating number is larger than threshold can't handle 4 digits before point

huangapple go评论106阅读模式
英文:

awk to get values between start and end pattern if a floating number is larger than threshold can't handle 4 digits before point

问题

I do have some textfiles that contain multiple tables and matrices of which I need to target one and print a line if the second column value (floating point number) is smaller than a given threshold.

Looks like this (original format):

  1. (some tables, matrices etc)
  2. Dataset of interest :
  3. 0
  4. 0 0.000000
  5. 1 0.000000
  6. 6 41.572543
  7. 7 47.082104
  8. 8 66.534233
  9. 9 112.409794
  10. 10 121.771594
  11. 11 134.398445
  12. 20 302.705391
  13. 21 304.451176
  14. 28 474.496684
  15. 29 495.522236
  16. 30 502.182750
  17. 31 522.627098
  18. 32 544.878024
  19. 56 1334.207831
  20. 97 3214.104108
  21. more properties : certainly
  22. (even more tables, matrices etc)

Example for expected output (lines with $2 < 200.0 but > 0.0 from "Dataset of interest" only):

  1. 6 41.572543
  2. 7 47.082104
  3. 8 66.534233
  4. 9 112.409794
  5. 10 121.771594
  6. 11 134.398445

$2 > 0 will keep the 0.000000 lines in output, so I tried this:

  1. awk '/Dataset of interest/,/more properties/ { if ( $2 > 0.000001 && $2 < 500) print $1, $2}' file.txt

and this:

  1. awk '/Dataset of interest/,/more properties/ {for(i = 2 ; i <= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ && $i > 0.000001 && $i < 500 ) print $1, $2}}' file.txt

which both print the same:

  1. 6 41.572543
  2. 7 47.082104
  3. 8 66.534233
  4. 9 112.409794
  5. 10 121.771594
  6. 11 134.398445
  7. 20 302.705391
  8. 21 304.451176
  9. 28 474.496684
  10. 29 495.522236
  11. 97 3214.104108

The 3214 line remains, even though this gets rid of the 1334 line. After testing, I can confirm that it gets rid of anything lower than 2000, but 2000 or higher will remain in output, which doesn't make any sense to me. The upper threshold should be a variable, but changing the limit for example to $2 < 200 won't print anything at all. What did I do wrong?

Edit:
Thanks for the quick responses!

  1. awk '$2 < 200 && $2 > 0' file

returns:

  1. 0 0.000000
  2. 1 0.000000
  3. 9 112.409794
  4. 10 121.771594
  5. 11 134.398445
  6. 56 1334.207831

Sorry for not specifying earlier, I'm using mawk 1.3.4

  1. grep -E '544|1334|3214' file.txt | xxd
  2. 00000000: 3332 2020 2020 2035 3434 2e38 3738 3032 32 544.87802
  3. 00000010: 340a 3536 2020 2020 2031 3333 342e 3230 4.56 1334.20
  4. 00000020: 3738 3331 0a20 3937 2020 2020 3332 3134 7831. 97 3214
  5. 00000030: 2e31 3034 3130 380a .104108.
  6. grep -E '544|1334|3214' file.txt | od -c
  7. 0000000 3 2 5 4 4 . 8 7 8 0 2
  8. 0000020 4 \n 5 6 1 3 3 4 . 2 0
  9. 0000040 7 8 3 1 \n 9 7 3 2 1 4
  10. 0000060 . 1 0 4 1 0 8 \n
  11. 0000070

Space upfront 97 should not affect the column assignment (or does it?), looks fine to me.

Edit2:

Thanks @shellter, that was it! (feeling a bit stupid now...)

  1. awk '/Dataset of interest/,/more properties/ {for(i = 2 ; i <= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ && $i+0 > 0 && $i+0 < 500 ) print $1, $2}}' file.txt

(changed conditions $i >/< to $i+0 >/<) does exactly what I need. Putting $i ~ /^[0-9]+[.][0-9]+$/ doesn't cover it. Btw $i+0.0 also works and yields same output.

英文:

I do have some textfiles that contain multiple tables and matrices of which I need to target one and print a line if the second column value (floating point number) is smaller than a given threshold.

Looks like this (original format):

  1. (some tables, matrices etc)
  2. Dataset of interest :
  3. 0
  4. 0 0.000000
  5. 1 0.000000
  6. 6 41.572543
  7. 7 47.082104
  8. 8 66.534233
  9. 9 112.409794
  10. 10 121.771594
  11. 11 134.398445
  12. 20 302.705391
  13. 21 304.451176
  14. 28 474.496684
  15. 29 495.522236
  16. 30 502.182750
  17. 31 522.627098
  18. 32 544.878024
  19. 56 1334.207831
  20. 97 3214.104108
  21. more properties : certainly
  22. (even more tables, matrices etc)

example for expected output (lines with $2 <200.0 but > 0.0 from "Dataset of interest" only):

  1. 6 41.572543
  2. 7 47.082104
  3. 8 66.534233
  4. 9 112.409794
  5. 10 121.771594
  6. 11 134.398445

$2 > 0 will keep the 0.000000 lines in output, so I tried this:

  1. awk &#39;/Dataset of interest/,/more properties/ { if ( $2 &gt; 0.000001 &amp;&amp; $2 &lt; 500) print $1, $2}&#39; file.txt

and this:

  1. awk &#39;/Dataset of interest/,/more properties/ {for(i = 2 ; i &lt;= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ &amp;&amp; $i &gt; 0.000001 &amp;&amp; $i &lt; 500 ) print $1, $2}}&#39; file.txt

which both print the same:

  1. 6 41.572543
  2. 7 47.082104
  3. 8 66.534233
  4. 9 112.409794
  5. 10 121.771594
  6. 11 134.398445
  7. 20 302.705391
  8. 21 304.451176
  9. 28 474.496684
  10. 29 495.522236
  11. 97 3214.104108

The 3214 line remains, even though this gets rid of the 1334 line. After testing, I can confirm that it gets rid of anything lower than 2000, but 2000 or higher will remain in output, which doesn't make any sense to me. The upper threshold should be a variable, but changing the limit for example to $2 < 200 won't print anything at all. What did I do wrong?

Edit:
Thanks for the quick responses!

  1. awk &#39;$2 &lt; 200 &amp;&amp; $2 &gt; 0&#39; file

returns:

  1. 0 0.000000
  2. 1 0.000000
  3. 9 112.409794
  4. 10 121.771594
  5. 11 134.398445
  6. 56 1334.207831

Sorry for not specifying earlier, I'm using mawk 1.3.4

  1. grep -E &#39;544|1334|3214&#39; file.txt | xxd
  2. 00000000: 3332 2020 2020 2035 3434 2e38 3738 3032 32 544.87802
  3. 00000010: 340a 3536 2020 2020 2031 3333 342e 3230 4.56 1334.20
  4. 00000020: 3738 3331 0a20 3937 2020 2020 3332 3134 7831. 97 3214
  5. 00000030: 2e31 3034 3130 380a .104108.
  6. grep -E &#39;544|1334|3214&#39; file.txt | od -c
  7. 0000000 3 2 5 4 4 . 8 7 8 0 2
  8. 0000020 4 \n 5 6 1 3 3 4 . 2 0
  9. 0000040 7 8 3 1 \n 9 7 3 2 1 4
  10. 0000060 . 1 0 4 1 0 8 \n
  11. 0000070

Space upfront 97 should not affect the column assignment (or does it?), looks fine to me.

Edit2:

Thanks @shellter, that was it! (feeling a bit stupid now...)

  1. awk &#39;/Dataset of interest/,/more properties/ {for(i = 2 ; i &lt;= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ &amp;&amp; $i+0 &gt; 0 &amp;&amp; $i+0 &lt; 500 ) print $1, $2}}&#39; file.txt

(changed conditions $i >/< to $i+0 >/<) does exactly what I need. Putting $i ~ /^[0-9]+[.][0-9]+$/ doesn't cover it. Btw $i+0.0 also works and yields same output.

答案1

得分: 2

在环境中设置LC_NUMERIC会影响mawk从输入数据中解析数字的方式。

如果使用数字呈现方式不同的区域设置,mawk可能不会将您的数据的第二个字段视为数字,并将其与文本字符串进行比较。

考虑:

  1. #!/bin/sh
  2. t(){
  3. date
  4. printf '1.00\n2.01\n3,01\n' | mawk '
  5. {
  6. printf "%s -> %f; >%d? %s\n",
  7. $1, ($1+0), NR, ($1>NR ? "yes" : "no")
  8. }
  9. '
  10. echo
  11. }
  12. LANG=C t
  13. LANG=en_US.utf8 t
  14. LANG=de_DE.utf8 t

在这台机器上使用给定的区域设置定义运行上述脚本会得到以下输出:

  1. 周四 六月 29 04:54:42 BST 2023
  2. 1.00 -> 1.000000; >1? no
  3. 2.01 -> 2.010000; >2? yes
  4. 3,01 -> 3.000000; >3? yes
  5. 周四 六月 29 04:54:42 AM BST 2023
  6. 1.00 -> 1.000000; >1? no
  7. 2.01 -> 2.010000; >2? yes
  8. 3,01 -> 3.000000; >3? yes
  9. Do 29. Jun 04:54:42 BST 2023
  10. 1.00 -> 1,000000; >1? yes
  11. 2.01 -> 2,000000; >2? yes
  12. 3,01 -> 3,010000; >3? yes

请注意,强制转换不一定会得到您可能期望的数字,因为不能作为数字的字符串尾部会被简单地丢弃。

与其进行强制转换,更好的解决方法是在运行mawk时指定适当的区域设置。

英文:

The setting of LC_NUMERIC in the environment affects how mawk parses numbers from input data.

If you use a locale where numbers are presented differently, mawk may not consider the second field of your data to be a number and comparisons against it will be done as a textual string.

Consider:

  1. #!/bin/sh
  2. t(){
  3. date
  4. printf &#39;1.00\n2.01\n3,01\n&#39; | mawk &#39;
  5. {
  6. printf &quot;%s -&gt; %f; &gt;%d? %s\n&quot;,
  7. $1, ($1+0), NR, ($1&gt;NR ? &quot;yes&quot; : &quot;no&quot;)
  8. }
  9. &#39;
  10. echo
  11. }
  12. LANG=C t
  13. LANG=en_US.utf8 t
  14. LANG=de_DE.utf8 t

which, with the locale definitions on this machine, prints:

  1. Thu Jun 29 04:54:42 BST 2023
  2. 1.00 -&gt; 1.000000; &gt;1? no
  3. 2.01 -&gt; 2.010000; &gt;2? yes
  4. 3,01 -&gt; 3.000000; &gt;3? yes
  5. Thu Jun 29 04:54:42 AM BST 2023
  6. 1.00 -&gt; 1.000000; &gt;1? no
  7. 2.01 -&gt; 2.010000; &gt;2? yes
  8. 3,01 -&gt; 3.000000; &gt;3? yes
  9. Do 29. Jun 04:54:42 BST 2023
  10. 1.00 -&gt; 1,000000; &gt;1? yes
  11. 2.01 -&gt; 2,000000; &gt;2? yes
  12. 3,01 -&gt; 3,010000; &gt;3? yes

Observe that casting does not necessarily result in the number you may expect as any tail of the string that cannot be treated as part of a number is simply discarded.

Instead of casting, the better fix is to specify an appropriate locale when running mawk.

huangapple
  • 本文由 发表于 2023年6月29日 00:11:25
  • 转载请务必保留本文链接:https://go.coder-hub.com/76574979.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定