英文:
Delete all the lines including and after nth occurance of pattern
问题
我有一些大型的tsv文件(大约2万行),我想要在特定列第二次匹配到字符串后删除其后的所有内容(包括匹配的那一行)。
英文:
Basically what the title says, I have some large tsv file (approx. 20k lines) and I want to delete the rest of the files after a specific column matches a string a second time (including said line)
答案1
得分: 1
awk '{print $0} $1=="yourstring"{if(++found==2)exit}' test.tsv
其中$1
是“特定列”,yourstring
是你要搜索的字符串。
这会打印每一行,然后检查第一列中是否存在yourstring
。如果找到它,它会测试一个变量found
,以查看是否达到2。如果是这样,awk会退出。
编辑:如果你想删除第二次出现的内容(以及之后的所有内容),将两个块的位置互换将实现此目的:
awk ' $1=="yourstring"{if(++found==2)exit}{print $0}' test.tsv
英文:
awk '{print $0} $1=="yourstring"{if(++found==2)exit}' test.tsv
Where $1
is the "specific column" and yourstring
is the string you are searching for.
This prints each line and then checks for the occurrence of yourstring
in the first column. If it finds it, it tests a variable found
which we increment, to see if it hits 2. If so awk exits.
Edit: If instead you want to delete the second occurrence (as well as everything after), flipping the two blocks around will accomplish this:
awk ' $1=="yourstring"{if(++found==2)exit}{print $0}' test.tsv
答案2
得分: 0
我会按以下方式操作,假设 file.txt
的内容如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
然后执行以下命令:
awk '{c+=/0$/}c>=2{exit}{print}' file.txt
将输出:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
解释:如果正则表达式匹配行的值为1,否则为0,因此如果有匹配就将 c
的值增加1,如果没有匹配则增加0。我使用 0$
表示以零结尾的行,仅用于演示目的。如果 c
的值大于或等于2,那么我就会执行 exit
命令,因此删除了第二个匹配及其后的行。如果尚未执行 exit
,我就会按原样 print
输出行。
(在GNU Awk 5.1.0中测试通过)
英文:
I would do it following way, let file.txt
content be
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
then
awk '{c+=/0$/}c>=2{exit}{print}' file.txt
gives output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Explanation: if regular expression matches line value is 1, else is 0, so I increase c
value by 1 if there is match and by 0 if there is not match. I use 0$
meaning line ending with zero for demonstration purposes. If c
value is bigger or equal 2 then I exit
thus line where there is 2nd match and following are deleted. If not exit
has been done yet I print
line as-is.
(tested in GNU Awk 5.1.0)
答案3
得分: 0
假设tsv文件没有带引号的换行符或制表符,用(GNU) sed
解决方案的逻辑非常简单,但与awk
相比,查找列要更困难:
sed -n -e '
0,/BRE/bp
//q
:p
p
' < test.tsv > out.tsv
使用POSIX sed,稍微复杂一些:
sed -n -e '
/BRE/!bp
:r
p
n
//q
br
:p
p
' < test.tsv > out.tsv
如果要在上面的代码中将BRE
替换为匹配n-1列,然后是要查找的字符串的正则表达式,例如,要找到第4列(共5列或更多)中的the.needle
:
^\([^\t]*\t\)\{3\}the\.needle\t
如果第4列是最后一列:
\tthe\.needle$
英文:
Assuming the tsv does not have quoted newlines or tabs, the logic for a (GNU) sed
solution is quite simple but finding the column is harder than with awk
:
sed -n -e '
0,/BRE/bp
//q
:p
p
' <test.tsv >out.tsv
With POSIX sed, it is a bit more complicated:
sed -n -e '
/BRE/!bp
:r
p
n
//q
br
:p
p
' <test.tsv >out.tsv
The methods wouldn't really scale if number of matches desired was higher, as sed can't really count.
In the code above, BRE
should be replaced by a regex that matches n-1 columns and then the string being sought. For example, to find the.needle
as 4th column (of 5 or more):
^\([^\t]*\t\)\{3\}the\.needle\t
If 4th column is final column:
\tthe\.needle$
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论