英文:
Sed is not removing the pattern for a list of headers
问题
问候,
我有一个文件中有多个DNA序列的以下头部信息
>10 AC_000167.1
>11 AC_000168.1
>12 AC_000169.1
>MT NC_006853.1
>X AC_000187.1
>GPS_000341582.1 NW_003097887.1
>GPS_000341583.1 NW_003097888.1
>GPS_000341584.1 NW_003097889.1
>GPS_000341585.1 NW_003097890.1
>GPS_000341586.1 NW_003097891.1
我正在使用以下sed命令来替换第一个空格后的所有内容。
sed -i 's/[^(>\d+?MT?X?GPS_\d+\.\d+)]\S..\d+\.\d+//g' newHeader.txt
输出应该像这样
>10
>11
>12
>MT
>X
>GPS_000341582.1
>GPS_000341583.1
>GPS_000341584.1
>GPS_000341585.1
然而,该命令似乎不起作用,也没有出现任何错误。我该如何修复?
英文:
Greeting,
I have following headers in a file with multiple dna sequences
>10 AC_000167.1
>11 AC_000168.1
>12 AC_000169.1
>MT NC_006853.1
>X AC_000187.1
>GPS_000341582.1 NW_003097887.1
>GPS_000341583.1 NW_003097888.1
>GPS_000341584.1 NW_003097889.1
>GPS_000341585.1 NW_003097890.1
>GPS_000341586.1 NW_003097891.1
I am using following sed command to replace everything after the first white space.
sed -i 's/[^(>\d+?MT?X?GPS_\d+\.\d+)]\S..\d+\.\d+//g' newHeader.txt
The output should like this
>10
>11
>12
>MT
>X
>GPS_000341582.1
>GPS_000341583.1
>GPS_000341584.1
>GPS_000341585.1
However the command does not seem to work and does not give any error. How can I fix this?
答案1
得分: 1
Sure, here's the translated content:
如果意图是删除第一个空格之后的所有内容(包括空格),但只针对某些特定行,则根据您提供的不起作用的sed命令,可能是您想要的:
# 使用支持-i和-E的sed:
sed -i -E 's/^(>[0-9]+|MT|X|GPS_[0-9]+\.[0-9]+)[[:space:]].*//' infile
默认情况下,许多sed元字符在\
之后出现。不需要在-E
中使用反斜杠:
^
- 匹配行的开头(
...)
- 分组|
- 替代(可能需要-E才能理解)[
list]
- 来自列表的任何单个字符- 在方括号内,
[:space:]
匹配“空格”字符(制表符、换行符、空格等)
- 在方括号内,
{
min,
max}
- 重复前面的 min 到 max 次*
- 前面的零个或多个+
- 前面的一个或多个(如果没有-E
则无法理解)
警告: 使用 -i
非常危险。确保在发生问题时备份原始文件。
只支持 POSIX BRE 的sed版本不支持替代(\|
)。对于这些版本,可以单独测试每个替代项:
# 使用任何POSIX sed:
sed '
# 如果行匹配,则跳转到标签s
/^\(>X\)[[:space:]].*/bs
/^\(>MT\)[[:space:]].*/bs
/^\(>[0-9]\{1,\}\)[[:space:]].*/bs
/^\(>GPS_[0-9]\{1,\}\.[0-9]\{1,\}\)[[:space:]].*/bs
# 如果到达这里,没有匹配项
# 所以只需打印行并开始下一个循环
d
:s
# 空的正则表达式重用前一个正则表达式
s///
' infile > tmpfile && mv tmpfile infile
请注意,以上内容是对您提供的代码片段的翻译。如果您有其他问题或需要进一步的帮助,请随时告诉我。
英文:
If the intent is to strip everything after the first space (including the space), but only on some specific lines, then
based on the non-working sed command you provided, this may be what you want:
# with a sed that supports -i and -E:
sed -i -E 's/^(>[0-9]+|MT|X|GPS_[0-9]+\.[0-9]+)[[:space:]].*//' infile
By default, many sed metacharacters appear after \
. The backslash is not needed with -E
:
^
- match start of line(
...)
- grouping|
- alternation (may not be understood without -E)[
list]
- any single character from list- inside brackets
[:space:]
matches "space" characters (tab, newline, space, etc)
- inside brackets
{
min,
max}
- from min to max repetitions of preceding*
- zero or more of preceding+
- one or more of preceding (not understood without-E
)
Warning: Using -i
is quite dangerous. Make sure you have backups of the original file in case something goes wrong.
Versions of sed that only support POSIX BRE do not support alternation (\|
).
With these, one can test each alternative separately:
# with any POSIX sed:
sed '
# if line matches, branch to label s
/^\(>X\)[[:space:]].*/bs
/^\(>MT\)[[:space:]].*/bs
/^\(>[0-9]\{1,\}\)[[:space:]].*/bs
/^\(>GPS_[0-9]\{1,\}\.[0-9]\{1,\}\)[[:space:]].*/bs
# if we got here nothing matched
# so just print line and start next cycle
d
:s
# empty regex reuses the previous one
s///
' infile >tmpfile && mv tmpfile infile
答案2
得分: 0
使用 sed
命令:
$ sed -i -E 's/^([^ ]+) .*//' file
正则表达式匹配如下:
节点 | 解释 |
---|---|
^ |
字符串开头锚点 |
( |
捕获组 \1: |
[^ ]+ |
任意字符除了空格(1次或更多次,尽可能匹配最多字符) |
) |
\1 的结束 |
' ' | 空格 |
.* |
任意字符除了换行符(0次或更多次,尽可能匹配最多字符) |
使用 grep
命令:
grep -oP '^>\S+' file
正则表达式匹配如下:
节点 | 解释 |
---|---|
^ |
字符串开头锚点 |
> |
> 字符 |
\S+ |
非空白字符(除了换行、回车、制表、换页、和双引号之外的字符)(1次或更多次,尽可能匹配最多字符) |
如果要进行原地编辑:
grep -oP '^>\S+' file | sponge file
英文:
With sed
:
$ sed -i -E 's/^([^ ]+) .*//' file
The regular expression matches as follows:
Node | Explanation |
---|---|
^ |
the beginning of the string anchor |
( |
group and capture to \1: |
[^ |
]+ any character except: space (1 or more times (matching the most amount possible)) |
) |
end of \1 |
' ' | space |
.* |
any character except \n (0 or more times (matching the most amount possible)) |
With grep
:
grep -oP '^>\S+' file
>10
>11
>12
>MT
>X
>GPS_000341582.1
>GPS_000341583.1
>GPS_000341584.1
>GPS_000341585.1
>GPS_000341586.1
The regular expression matches as follows:
Node | Explanation |
---|---|
^ |
the beginning of the string anchor |
> |
> |
\S+ |
non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible)) |
If you want to edit in place:
grep -oP '^>\S+' file | sponge file
通过集体智慧和协作来改善编程学习和解决问题的方式。致力于成为全球开发者共同参与的知识库,让每个人都能够通过互相帮助和分享经验来进步。
评论