2023年6月1日 23:10:58go评论91阅读模式

英文:

Delete all the lines including and after nth occurance of pattern

问题

我有一些大型的tsv文件（大约2万行），我想要在特定列第二次匹配到字符串后删除其后的所有内容（包括匹配的那一行）。

英文:

Basically what the title says, I have some large tsv file (approx. 20k lines) and I want to delete the rest of the files after a specific column matches a string a second time (including said line)

答案1

得分: 1

awk '{print $0} $1=="yourstring"{if(++found==2)exit}' test.tsv

其中$1是“特定列”，yourstring是你要搜索的字符串。

这会打印每一行，然后检查第一列中是否存在yourstring。如果找到它，它会测试一个变量found，以查看是否达到2。如果是这样，awk会退出。

编辑：如果你想删除第二次出现的内容（以及之后的所有内容），将两个块的位置互换将实现此目的：

awk ' $1=="yourstring"{if(++found==2)exit}{print $0}' test.tsv

英文:

awk &#39;{print $0} $1==&quot;yourstring&quot;{if(++found==2)exit}&#39; test.tsv

Where $1 is the "specific column" and yourstring is the string you are searching for.

This prints each line and then checks for the occurrence of yourstring in the first column. If it finds it, it tests a variable found which we increment, to see if it hits 2. If so awk exits.

Edit: If instead you want to delete the second occurrence (as well as everything after), flipping the two blocks around will accomplish this:

 awk &#39; $1==&quot;yourstring&quot;{if(++found==2)exit}{print $0}&#39; test.tsv

答案2

得分: 0

我会按以下方式操作，假设 file.txt 的内容如下：

然后执行以下命令：

awk '{c+=/0$/}c>=2{exit}{print}' file.txt

将输出：

解释：如果正则表达式匹配行的值为1，否则为0，因此如果有匹配就将 c 的值增加1，如果没有匹配则增加0。我使用 0$ 表示以零结尾的行，仅用于演示目的。如果 c 的值大于或等于2，那么我就会执行 exit 命令，因此删除了第二个匹配及其后的行。如果尚未执行 exit，我就会按原样 print 输出行。

（在GNU Awk 5.1.0中测试通过）

英文:

I would do it following way, let file.txt content be

then

awk &#39;{c+=/0$/}c&gt;=2{exit}{print}&#39; file.txt

gives output

Explanation: if regular expression matches line value is 1, else is 0, so I increase c value by 1 if there is match and by 0 if there is not match. I use 0$ meaning line ending with zero for demonstration purposes. If c value is bigger or equal 2 then I exit thus line where there is 2nd match and following are deleted. If not exit has been done yet I print line as-is.

(tested in GNU Awk 5.1.0)

答案3

得分: 0

假设tsv文件没有带引号的换行符或制表符，用(GNU) sed 解决方案的逻辑非常简单，但与awk相比，查找列要更困难：

sed -n -e '
    0,/BRE/bp
    //q
    :p
    p
' < test.tsv > out.tsv

使用POSIX sed，稍微复杂一些：

sed -n -e '
    /BRE/!bp
    :r
    p
    n
    //q
    br
    :p
    p
' < test.tsv > out.tsv

如果要在上面的代码中将BRE替换为匹配n-1列，然后是要查找的字符串的正则表达式，例如，要找到第4列（共5列或更多）中的the.needle：

^\([^\t]*\t\)\{3\}the\.needle\t

如果第4列是最后一列：

\tthe\.needle$

英文:

Assuming the tsv does not have quoted newlines or tabs, the logic for a (GNU) sed solution is quite simple but finding the column is harder than with awk:

sed -n -e &#39;
    0,/BRE/bp
    //q
    :p
    p
&#39; &lt;test.tsv &gt;out.tsv

With POSIX sed, it is a bit more complicated:

sed -n -e &#39;
    /BRE/!bp
    :r
    p
    n
    //q
    br
    :p
    p
&#39; &lt;test.tsv &gt;out.tsv

The methods wouldn't really scale if number of matches desired was higher, as sed can't really count.

In the code above, BRE should be replaced by a regex that matches n-1 columns and then the string being sought. For example, to find the.needle as 4th column (of 5 or more):

^\([^\t]*\t\)\{3\}the\.needle\t

If 4th column is final column:

\tthe\.needle$

通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库，让每个人都能够通过互相帮助和分享经验来进步。

删除所有包括和在第n次出现模式之后的行。

问题

答案1

答案2

答案3

如何从拖放的路径中获取文件

push my local files to my server with git without pulling/cloning files from server

maxscale使用awk重写过滤器。

在不同主机上代理GRPC请求

如何在Playwright视觉比较中屏蔽多个定位器？

在C++中，可以使用可变模板参数来检索类型的内部类型。

selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found

Creating and opening a URL to log in to Website via Basic Auth with Robot Framework/Selenium (Python)

AG Grid 在上下文菜单中以大文本形式打开

What's the correct way to type hint an empty list as a literal in python?

如何在Highcharts Gantt中更改本地化的星期名称

如何在同一个流中使用多个过滤器和映射函数？

如何使用Map/Set来将代码优化到O(n)？

.NET MAUI Android在GitHub Actions上构建失败，错误代码为1。