“Sed未能删除标题列表的模式”

huangapple go评论62阅读模式
英文:

Sed is not removing the pattern for a list of headers

问题

问候,

我有一个文件中有多个DNA序列的以下头部信息

>10 AC_000167.1
>11 AC_000168.1
>12 AC_000169.1
>MT NC_006853.1
>X AC_000187.1
>GPS_000341582.1 NW_003097887.1
>GPS_000341583.1 NW_003097888.1
>GPS_000341584.1 NW_003097889.1
>GPS_000341585.1 NW_003097890.1
>GPS_000341586.1 NW_003097891.1

我正在使用以下sed命令来替换第一个空格后的所有内容。

sed -i 's/[^(>\d+?MT?X?GPS_\d+\.\d+)]\S..\d+\.\d+//g' newHeader.txt

输出应该像这样

>10
>11
>12
>MT
>X
>GPS_000341582.1
>GPS_000341583.1
>GPS_000341584.1
>GPS_000341585.1

然而,该命令似乎不起作用,也没有出现任何错误。我该如何修复?

英文:

Greeting,

I have following headers in a file with multiple dna sequences

>10 AC_000167.1
>11 AC_000168.1
>12 AC_000169.1
>MT NC_006853.1
>X AC_000187.1
>GPS_000341582.1 NW_003097887.1
>GPS_000341583.1 NW_003097888.1
>GPS_000341584.1 NW_003097889.1
>GPS_000341585.1 NW_003097890.1
>GPS_000341586.1 NW_003097891.1

I am using following sed command to replace everything after the first white space.

sed -i 's/[^(>\d+?MT?X?GPS_\d+\.\d+)]\S..\d+\.\d+//g' newHeader.txt

The output should like this

>10
>11
>12
>MT
>X
>GPS_000341582.1
>GPS_000341583.1
>GPS_000341584.1
>GPS_000341585.1

However the command does not seem to work and does not give any error. How can I fix this?

答案1

得分: 1

Sure, here's the translated content:

如果意图是删除第一个空格之后的所有内容(包括空格),但只针对某些特定行,则根据您提供的不起作用的sed命令,可能是您想要的:

# 使用支持-i和-E的sed:
sed -i -E 's/^(>[0-9]+|MT|X|GPS_[0-9]+\.[0-9]+)[[:space:]].*//' infile

默认情况下,许多sed元字符在\之后出现。不需要在-E中使用反斜杠:

  • ^ - 匹配行的开头
  • (...) - 分组
  • | - 替代(可能需要-E才能理解)
  • [list] - 来自列表的任何单个字符
    • 在方括号内,[:space:] 匹配“空格”字符(制表符、换行符、空格等)
  • {min,max} - 重复前面的 min 到 max 次
  • * - 前面的零个或多个
  • + - 前面的一个或多个(如果没有 -E 则无法理解)

警告: 使用 -i 非常危险。确保在发生问题时备份原始文件。


只支持 POSIX BRE 的sed版本不支持替代(\|)。对于这些版本,可以单独测试每个替代项:

# 使用任何POSIX sed:
sed '
    # 如果行匹配,则跳转到标签s
    /^\(>X\)[[:space:]].*/bs
    /^\(>MT\)[[:space:]].*/bs
    /^\(>[0-9]\{1,\}\)[[:space:]].*/bs
    /^\(>GPS_[0-9]\{1,\}\.[0-9]\{1,\}\)[[:space:]].*/bs

    # 如果到达这里,没有匹配项
    # 所以只需打印行并开始下一个循环
    d

    :s
    # 空的正则表达式重用前一个正则表达式
    s///
' infile > tmpfile && mv tmpfile infile

请注意,以上内容是对您提供的代码片段的翻译。如果您有其他问题或需要进一步的帮助,请随时告诉我。

英文:

If the intent is to strip everything after the first space (including the space), but only on some specific lines, then
based on the non-working sed command you provided, this may be what you want:

# with a sed that supports -i and -E:
sed -i -E 's/^(>[0-9]+|MT|X|GPS_[0-9]+\.[0-9]+)[[:space:]].*//' infile

By default, many sed metacharacters appear after \. The backslash is not needed with -E:

  • ^ - match start of line
  • (...) - grouping
  • | - alternation (may not be understood without -E)
  • [list] - any single character from list
    • inside brackets [:space:] matches "space" characters (tab, newline, space, etc)
  • {min,max} - from min to max repetitions of preceding
  • * - zero or more of preceding
  • + - one or more of preceding (not understood without -E)

Warning: Using -i is quite dangerous. Make sure you have backups of the original file in case something goes wrong.


Versions of sed that only support POSIX BRE do not support alternation (\|).
With these, one can test each alternative separately:

# with any POSIX sed:
sed '
    # if line matches, branch to label s
    /^\(>X\)[[:space:]].*/bs
    /^\(>MT\)[[:space:]].*/bs
    /^\(>[0-9]\{1,\}\)[[:space:]].*/bs
    /^\(>GPS_[0-9]\{1,\}\.[0-9]\{1,\}\)[[:space:]].*/bs

    # if we got here nothing matched
    # so just print line and start next cycle
    d

    :s
    # empty regex reuses the previous one
    s///
' infile >tmpfile && mv tmpfile infile

答案2

得分: 0

使用 sed 命令:

$ sed -i -E 's/^([^ ]+) .*//' file

正则表达式匹配如下:

节点 解释
^ 字符串开头锚点
( 捕获组 \1:
[^ ]+ 任意字符除了空格(1次或更多次,尽可能匹配最多字符)
) \1 的结束
' ' 空格
.* 任意字符除了换行符(0次或更多次,尽可能匹配最多字符)

使用 grep 命令:

grep -oP '^>\S+' file

正则表达式匹配如下:

节点 解释
^ 字符串开头锚点
> > 字符
\S+ 非空白字符(除了换行、回车、制表、换页、和双引号之外的字符)(1次或更多次,尽可能匹配最多字符)

如果要进行原地编辑:

grep -oP '^>\S+' file | sponge file
英文:

With sed:

$ sed -i -E 's/^([^ ]+) .*//' file

The regular expression matches as follows:

Node Explanation
^ the beginning of the string anchor
( group and capture to \1:
[^ ]+ any character except: space (1 or more times (matching the most amount possible))
) end of \1
' ' space
.* any character except \n (0 or more times (matching the most amount possible))

With grep:

grep -oP '^>\S+' file
>10
>11
>12
>MT
>X
>GPS_000341582.1
>GPS_000341583.1
>GPS_000341584.1
>GPS_000341585.1
>GPS_000341586.1

The regular expression matches as follows:

Node Explanation
^ the beginning of the string anchor
> >
\S+ non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible))

If you want to edit in place:

 grep -oP '^>\S+' file | sponge file

huangapple
  • 本文由 发表于 2023年5月11日 06:29:03
  • 转载请务必保留本文链接:https://go.coder-hub.com/76222969.html
匿名

发表评论

匿名网友

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

确定