删除所有包括和在第n次出现模式之后的行。

huangapple go评论58阅读模式
英文:

Delete all the lines including and after nth occurance of pattern

问题

我有一些大型的tsv文件(大约2万行),我想要在特定列第二次匹配到字符串后删除其后的所有内容(包括匹配的那一行)。

英文:

Basically what the title says, I have some large tsv file (approx. 20k lines) and I want to delete the rest of the files after a specific column matches a string a second time (including said line)

答案1

得分: 1

awk '{print $0} $1=="yourstring"{if(++found==2)exit}' test.tsv

其中$1是“特定列”,yourstring是你要搜索的字符串。

这会打印每一行,然后检查第一列中是否存在yourstring。如果找到它,它会测试一个变量found,以查看是否达到2。如果是这样,awk会退出。

编辑:如果你想删除第二次出现的内容(以及之后的所有内容),将两个块的位置互换将实现此目的:

awk ' $1=="yourstring"{if(++found==2)exit}{print $0}' test.tsv

英文:
awk '{print $0} $1=="yourstring"{if(++found==2)exit}' test.tsv

Where $1 is the "specific column" and yourstring is the string you are searching for.

This prints each line and then checks for the occurrence of yourstring in the first column. If it finds it, it tests a variable found which we increment, to see if it hits 2. If so awk exits.

Edit: If instead you want to delete the second occurrence (as well as everything after), flipping the two blocks around will accomplish this:

 awk ' $1=="yourstring"{if(++found==2)exit}{print $0}' test.tsv

答案2

得分: 0

我会按以下方式操作,假设 file.txt 的内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

然后执行以下命令:

awk '{c+=/0$/}c>=2{exit}{print}' file.txt

将输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

解释:如果正则表达式匹配行的值为1,否则为0,因此如果有匹配就将 c 的值增加1,如果没有匹配则增加0。我使用 0$ 表示以零结尾的行,仅用于演示目的。如果 c 的值大于或等于2,那么我就会执行 exit 命令,因此删除了第二个匹配及其后的行。如果尚未执行 exit,我就会按原样 print 输出行。

(在GNU Awk 5.1.0中测试通过)

英文:

I would do it following way, let file.txt content be

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

then

awk '{c+=/0$/}c>=2{exit}{print}' file.txt

gives output

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Explanation: if regular expression matches line value is 1, else is 0, so I increase c value by 1 if there is match and by 0 if there is not match. I use 0$ meaning line ending with zero for demonstration purposes. If c value is bigger or equal 2 then I exit thus line where there is 2nd match and following are deleted. If not exit has been done yet I print line as-is.

(tested in GNU Awk 5.1.0)

答案3

得分: 0

假设tsv文件没有带引号的换行符或制表符,用(GNU) sed 解决方案的逻辑非常简单,但与awk相比,查找列要更困难:

sed -n -e '
    0,/BRE/bp
    //q
    :p
    p
' < test.tsv > out.tsv

使用POSIX sed,稍微复杂一些:

sed -n -e '
    /BRE/!bp
    :r
    p
    n
    //q
    br
    :p
    p
' < test.tsv > out.tsv

如果要在上面的代码中将BRE替换为匹配n-1列,然后是要查找的字符串的正则表达式,例如,要找到第4列(共5列或更多)中的the.needle

^\([^\t]*\t\)\{3\}the\.needle\t

如果第4列是最后一列:

\tthe\.needle$
英文:

Assuming the tsv does not have quoted newlines or tabs, the logic for a (GNU) sed solution is quite simple but finding the column is harder than with awk:

sed -n -e &#39;
    0,/BRE/bp
    //q
    :p
    p
&#39; &lt;test.tsv &gt;out.tsv

With POSIX sed, it is a bit more complicated:

sed -n -e &#39;
    /BRE/!bp
    :r
    p
    n
    //q
    br
    :p
    p
&#39; &lt;test.tsv &gt;out.tsv

The methods wouldn't really scale if number of matches desired was higher, as sed can't really count.


In the code above, BRE should be replaced by a regex that matches n-1 columns and then the string being sought. For example, to find the.needle as 4th column (of 5 or more):

^\([^\t]*\t\)\{3\}the\.needle\t

If 4th column is final column:

\tthe\.needle$

huangapple
  • 本文由 发表于 2023年6月1日 23:10:58
  • 转载请务必保留本文链接:https://go.coder-hub.com/76383358.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定