英文:
awk to get values between start and end pattern if a floating number is larger than threshold can't handle 4 digits before point
问题
I do have some textfiles that contain multiple tables and matrices of which I need to target one and print a line if the second column value (floating point number) is smaller than a given threshold.
Looks like this (original format):
(some tables, matrices etc)
Dataset of interest :
0
0 0.000000
1 0.000000
6 41.572543
7 47.082104
8 66.534233
9 112.409794
10 121.771594
11 134.398445
20 302.705391
21 304.451176
28 474.496684
29 495.522236
30 502.182750
31 522.627098
32 544.878024
56 1334.207831
97 3214.104108
more properties : certainly
(even more tables, matrices etc)
Example for expected output (lines with $2 < 200.0 but > 0.0 from "Dataset of interest" only):
6 41.572543
7 47.082104
8 66.534233
9 112.409794
10 121.771594
11 134.398445
$2 > 0 will keep the 0.000000 lines in output, so I tried this:
awk '/Dataset of interest/,/more properties/ { if ( $2 > 0.000001 && $2 < 500) print $1, $2}' file.txt
and this:
awk '/Dataset of interest/,/more properties/ {for(i = 2 ; i <= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ && $i > 0.000001 && $i < 500 ) print $1, $2}}' file.txt
which both print the same:
6 41.572543
7 47.082104
8 66.534233
9 112.409794
10 121.771594
11 134.398445
20 302.705391
21 304.451176
28 474.496684
29 495.522236
97 3214.104108
The 3214 line remains, even though this gets rid of the 1334 line. After testing, I can confirm that it gets rid of anything lower than 2000, but 2000 or higher will remain in output, which doesn't make any sense to me. The upper threshold should be a variable, but changing the limit for example to $2 < 200 won't print anything at all. What did I do wrong?
Edit:
Thanks for the quick responses!
awk '$2 < 200 && $2 > 0' file
returns:
0 0.000000
1 0.000000
9 112.409794
10 121.771594
11 134.398445
56 1334.207831
Sorry for not specifying earlier, I'm using mawk 1.3.4
grep -E '544|1334|3214' file.txt | xxd
00000000: 3332 2020 2020 2035 3434 2e38 3738 3032 32 544.87802
00000010: 340a 3536 2020 2020 2031 3333 342e 3230 4.56 1334.20
00000020: 3738 3331 0a20 3937 2020 2020 3332 3134 7831. 97 3214
00000030: 2e31 3034 3130 380a .104108.
grep -E '544|1334|3214' file.txt | od -c
0000000 3 2 5 4 4 . 8 7 8 0 2
0000020 4 \n 5 6 1 3 3 4 . 2 0
0000040 7 8 3 1 \n 9 7 3 2 1 4
0000060 . 1 0 4 1 0 8 \n
0000070
Space upfront 97 should not affect the column assignment (or does it?), looks fine to me.
Edit2:
Thanks @shellter, that was it! (feeling a bit stupid now...)
awk '/Dataset of interest/,/more properties/ {for(i = 2 ; i <= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ && $i+0 > 0 && $i+0 < 500 ) print $1, $2}}' file.txt
(changed conditions $i >/< to $i+0 >/<) does exactly what I need. Putting $i ~ /^[0-9]+[.][0-9]+$/ doesn't cover it. Btw $i+0.0 also works and yields same output.
英文:
I do have some textfiles that contain multiple tables and matrices of which I need to target one and print a line if the second column value (floating point number) is smaller than a given threshold.
Looks like this (original format):
(some tables, matrices etc)
Dataset of interest :
0
0 0.000000
1 0.000000
6 41.572543
7 47.082104
8 66.534233
9 112.409794
10 121.771594
11 134.398445
20 302.705391
21 304.451176
28 474.496684
29 495.522236
30 502.182750
31 522.627098
32 544.878024
56 1334.207831
97 3214.104108
more properties : certainly
(even more tables, matrices etc)
example for expected output (lines with $2 <200.0 but > 0.0 from "Dataset of interest" only):
6 41.572543
7 47.082104
8 66.534233
9 112.409794
10 121.771594
11 134.398445
$2 > 0 will keep the 0.000000 lines in output, so I tried this:
awk '/Dataset of interest/,/more properties/ { if ( $2 > 0.000001 && $2 < 500) print $1, $2}' file.txt
and this:
awk '/Dataset of interest/,/more properties/ {for(i = 2 ; i <= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ && $i > 0.000001 && $i < 500 ) print $1, $2}}' file.txt
which both print the same:
6 41.572543
7 47.082104
8 66.534233
9 112.409794
10 121.771594
11 134.398445
20 302.705391
21 304.451176
28 474.496684
29 495.522236
97 3214.104108
The 3214 line remains, even though this gets rid of the 1334 line. After testing, I can confirm that it gets rid of anything lower than 2000, but 2000 or higher will remain in output, which doesn't make any sense to me. The upper threshold should be a variable, but changing the limit for example to $2 < 200 won't print anything at all. What did I do wrong?
Edit:
Thanks for the quick responses!
awk '$2 < 200 && $2 > 0' file
returns:
0 0.000000
1 0.000000
9 112.409794
10 121.771594
11 134.398445
56 1334.207831
Sorry for not specifying earlier, I'm using mawk 1.3.4
grep -E '544|1334|3214' file.txt | xxd
00000000: 3332 2020 2020 2035 3434 2e38 3738 3032 32 544.87802
00000010: 340a 3536 2020 2020 2031 3333 342e 3230 4.56 1334.20
00000020: 3738 3331 0a20 3937 2020 2020 3332 3134 7831. 97 3214
00000030: 2e31 3034 3130 380a .104108.
grep -E '544|1334|3214' file.txt | od -c
0000000 3 2 5 4 4 . 8 7 8 0 2
0000020 4 \n 5 6 1 3 3 4 . 2 0
0000040 7 8 3 1 \n 9 7 3 2 1 4
0000060 . 1 0 4 1 0 8 \n
0000070
Space upfront 97 should not affect the column assignment (or does it?), looks fine to me.
Edit2:
Thanks @shellter, that was it! (feeling a bit stupid now...)
awk '/Dataset of interest/,/more properties/ {for(i = 2 ; i <= NF ; i++) { if ($i ~ /^[0-9]+[.][0-9]+$/ && $i+0 > 0 && $i+0 < 500 ) print $1, $2}}' file.txt
(changed conditions $i >/< to $i+0 >/<) does exactly what I need. Putting $i ~ /^[0-9]+[.][0-9]+$/ doesn't cover it. Btw $i+0.0 also works and yields same output.
答案1
得分: 2
在环境中设置LC_NUMERIC
会影响mawk
从输入数据中解析数字的方式。
如果使用数字呈现方式不同的区域设置,mawk
可能不会将您的数据的第二个字段视为数字,并将其与文本字符串进行比较。
考虑:
#!/bin/sh
t(){
date
printf '1.00\n2.01\n3,01\n' | mawk '
{
printf "%s -> %f; >%d? %s\n",
$1, ($1+0), NR, ($1>NR ? "yes" : "no")
}
'
echo
}
LANG=C t
LANG=en_US.utf8 t
LANG=de_DE.utf8 t
在这台机器上使用给定的区域设置定义运行上述脚本会得到以下输出:
周四 六月 29 04:54:42 BST 2023
1.00 -> 1.000000; >1? no
2.01 -> 2.010000; >2? yes
3,01 -> 3.000000; >3? yes
周四 六月 29 04:54:42 AM BST 2023
1.00 -> 1.000000; >1? no
2.01 -> 2.010000; >2? yes
3,01 -> 3.000000; >3? yes
Do 29. Jun 04:54:42 BST 2023
1.00 -> 1,000000; >1? yes
2.01 -> 2,000000; >2? yes
3,01 -> 3,010000; >3? yes
请注意,强制转换不一定会得到您可能期望的数字,因为不能作为数字的字符串尾部会被简单地丢弃。
与其进行强制转换,更好的解决方法是在运行mawk
时指定适当的区域设置。
英文:
The setting of LC_NUMERIC
in the environment affects how mawk
parses numbers from input data.
If you use a locale where numbers are presented differently, mawk
may not consider the second field of your data to be a number and comparisons against it will be done as a textual string.
Consider:
#!/bin/sh
t(){
date
printf '1.00\n2.01\n3,01\n' | mawk '
{
printf "%s -> %f; >%d? %s\n",
$1, ($1+0), NR, ($1>NR ? "yes" : "no")
}
'
echo
}
LANG=C t
LANG=en_US.utf8 t
LANG=de_DE.utf8 t
which, with the locale definitions on this machine, prints:
Thu Jun 29 04:54:42 BST 2023
1.00 -> 1.000000; >1? no
2.01 -> 2.010000; >2? yes
3,01 -> 3.000000; >3? yes
Thu Jun 29 04:54:42 AM BST 2023
1.00 -> 1.000000; >1? no
2.01 -> 2.010000; >2? yes
3,01 -> 3.000000; >3? yes
Do 29. Jun 04:54:42 BST 2023
1.00 -> 1,000000; >1? yes
2.01 -> 2,000000; >2? yes
3,01 -> 3,010000; >3? yes
Observe that casting does not necessarily result in the number you may expect as any tail of the string that cannot be treated as part of a number is simply discarded.
Instead of casting, the better fix is to specify an appropriate locale when running mawk
.
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论